I am building a json REST API with Phoenix that (among other services) has to convert static CSV files from a repository and “serve” them as response in json format; that will be used downstream by other services in our project. Some of these csv files are quite big, ranging for a several hundred kb to tens of MB.
So far so good, I made it work with several lines of code.
So, in my backend side:
defp csv(file) do
file
|> Path.expand(__DIR__)
|> File.stream!()
|> CSV.decode(headers: true)
|> Enum.map(fn {:ok, val} -> val end)
end
in the API controller:
def show(conn, bhandler) do
conn
|> Plug.Conn.put_resp_header("content-type", "application/json; charset=utf-8")
|> Plug.Conn.send_resp(200, Jason.encode!(bhandler, pretty: true))
end
end
The program is doing well what is supposed to do but on small files.
When the file gets bigger and response time gets exponentially bigger:
for 69kb (csv file) → 124ms
234kb → 2.8s
1.1MB → 24s
22.4MB → 835s
Obviously something is very wrong with these implementation, or with its libraries (CSV, Jason). A similar (simple) API we have, in php, took a few of seconds to read/convert/send/receive 40MB csv file.
I see two possible solutions. First, as @moogle19 suggested, you can use NumbleCSV's parse_stream and send a chunked response using Plug.Conn.send_chunked/2, never loading the whole file into memory. Second, you can, again, use parse_stream to stream the contents into a temporary file and use Plug.Conn.send_file/5. I believe the second one is trickier because in case you have a lot of requests there should be rate limiting in order to not fill your disk memory.
Thank you all, I managed to do this implementation using NimbleCSV:
defp csv(p) do
[header | data] = p
|> Path.expand(__DIR__)
|> File.stream!()
|> NimbleCSV.RFC4180.parse_stream(skip_headers: false)
|> Enum.into([])
Enum.map(data, fn x -> Enum.zip(header, x) |> Map.new() end)
end
It does the job but not much improvement
3.5MB → 22.46s
21.4MB → 7.2min (tested this file on a similar API, written long time ago, in php5, calling a humble fgetcsv() with no streams and we got 22.22s )
So still far for being usable.
I presume the issue is not streaming in the csv file but the second part - assembling the map and sending out the json.
is being faster in my book. Now if you can handle having all the data in memory (reading from the fs once instead of streaming) you’ll likely again see a speed improvement.
Sure, but also you’re going back to the file system for each individual line. File.stream! by default works on a per line basis. Unless you’re memory constraint it’s likely a lot faster to load more data into memory at a time.