When streaming input from a file of unknown size, formatted as a single line of comma-separated values, what’s the best way to operate on each comma-separated element in that file?
Example input file input_file.txt:
"AAA","BBB","CCC","DDD"... for a few million characters
Splitting by line:
File.stream!/3
conveniently defaults to separating by :line
, but that mode is fixed to splitting on \n
or \r\n
.
> File.stream!("input_file.txt") |> Enum.to_list()
> ["\"AAA\",\"BBB\",\"CCC\"..."]
Splitting by byte & chunking stream:
File.stream!/3
also accepts a number of bytes, so setting the byte size to 1
(reading the input as raw, not UTF-8) and passing it to Stream.chunk_by/2
constructs something closer to the stream we want:
> File.stream!("input_file.txt", [], 1) |> Stream.chunk_by(&(&1 == ",")) |> Enum.to_list()
> [
["\"", "A","A","A", "\""],
[","],
["\"", "B","B","B", "\""],
[","],
...
]
From this we could probably chain further operations on this stream to filter out the unwanted punctuation and join the desired characters together, but it doesn’t feel like the best way to solve this problem.
What other patterns are there for chunking a stream by an arbitrary character?