vivus-ignis
Emulate File.stream! for a string variable
Hello there!
I have a mix task which grabs some data from a remote API, collects it into a file (a biggish xml), then sucks it in as a stream and processes doing a bunch of text transformations.
Now I’m trying to sketch an integration test for the processing part (skipping the getting-the-data part) and I wonder… is it possible to use a string variable instead of a file for that purpose?
This is what I have in my mix task code:
File.stream!("#{@download_dir}/#{category}.xml", [:read])
|> ... processing part I want to test follows
And this is how I’m trying to simulate (unsuccessfully so far) the File.stream! part:
input_xml = """
... xml fragment ...
...
"""
input_xml
|> Stream.unfold( &(String.split(&1, "~n")) )
|> ... processing part ...
Of course what I’m getting after this Stream.unfold is different from File.stream! – the Stream.unfold results in all the newlines being removed.
And then my processing part breaks as it relies on newlines in certain places (yeah, it sounds crazy, but inside that xml I have wiki markup-formatted fragments where newlines do matter).
So my question is: is it possible to split a string by newlines in such a way that I can preserve those "\n"s? That is, can I emulate FIle.stream! without an actual file?
Would appreciate any hints. Thank you!
Marked As Solved
wmnnd
You’re in luck, you can simply use StringIO.open/1 and IO.binstream in order to stream any String:
{:ok, stream} =
"abc\ndef\nghi\n"
|> StringIO.open()
stream
|> IO.binstream(:line)
|> #your own stream processing
Also Liked
ku1ik
Alternative solution to using StringIO + IO.binstream would be to use Stream.unfold with binary pattern matching:
def binary_stream(b, chunk_size \\ 50_000) when is_binary(b) do
Stream.unfold(0, fn skip ->
case b do
<<_skipped::binary-size(skip), chunk::binary-size(chunk_size), _rest::binary>> ->
{chunk, skip + chunk_size}
<<_skipped::binary-size(skip)>> ->
nil
<<_skipped::binary-size(skip), chunk::binary>> ->
{chunk, skip + byte_size(chunk)}
end
end)
end
I tested this on pretty big XML docs (hundreds of megabytes) and it seems to be performant and doesn’t require much memory due to not copying/cloning any part of the binary data.
PS: for this particular case (XML parsing), it’s not necessary to read by line, and in fact some XML documents (or SOAP API responses) return whole XML doc as a single long line without line breaks.
michalmuskala
Why not use one of the string functions? I understand you want to get an enumerable of lines from the string. This can be achieved eagerly with String.split(str, "\n") or lazily with String.splitter(str, "\n").
NobbZ
Because String.split/2 and String.splitter/2 will remove the split-points, but the OP said he needs them intact.
aseigo
Ok .. I think I have found the issue …
In every call to get a line, the process returned by StringIO.open does this:
defp io_request({:get_line, encoding, prompt}, s) do
get_line(encoding, prompt, s)
end
get_line, in turn call Erlang’s :unicode.characters_to_list. This converts the whole bitstring to a list with the proper encoding.
If this succeeds, StringIO.do_get_line is called which iterates over the items in the list until it finds a termination (newline or no more data) and returns that line and the rest of the string. It then goes back to Erlang calling :unicode.characters_to_binary on both the line just retrieved and the remainder of the string.
Which means that the longer the string the bigger the lists and then resulting binaries that are being generated will be on each iteration requesting a line. I expect this is doing some unhappy things to the memory management. A potential fix would be to do the conversion to a list once and keep that in the state data of the StringIO process and then iterate over that one line at a time.
antoine
Thanks for this post, it was very usefull !
But I think the post marked as solution is not the more appropriate.
The solution using String.splitter(str, "\n") seems to behave more as expected.
Explanations:
When using this:
str = "abc\ndef\nghi\n"
{:ok, stream} = str |> StringIO.open()
s = stream |> IO.binstream(:line)
it do not work as expected:
iex> s |> Enum.take(1)
["abc\n"]
iex> s |> Enum.take(1)
["def\n"]
=> The result should always be the same as it’s the same operation.
Like we have here:
iex> s = File.stream!("/tmp/foo.csv")
iex> s |> Enum.take(1)
["hey\n"]
iex> s |> Enum.take(1)
["hey\n"]
Instead, String.splitter(str, "\n"), do the job as expected:
iex> str = "abc\ndef\nghi\n"
iex> s = String.splitter(str, "\n")
iex> s |> Enum.take(1)
["abc"]
iex> s |> Enum.take(1)
["abc"]








