Phoenix json API extremely slow when serving CSV files

I am building a json REST API with Phoenix that (among other services) has to convert static CSV files from a repository and “serve” them as response in json format; that will be used downstream by other services in our project. Some of these csv files are quite big, ranging for a several hundred kb to tens of MB.
So far so good, I made it work with several lines of code.
So, in my backend side:

 defp csv(file) do
   file
    |> Path.expand(__DIR__)
    |> File.stream!()
    |> CSV.decode(headers: true)
    |> Enum.map(fn {:ok, val} -> val end)
  end

in the API controller:

  def show(conn, bhandler) do
    conn
    |> Plug.Conn.put_resp_header("content-type", "application/json; charset=utf-8")
    |> Plug.Conn.send_resp(200, Jason.encode!(bhandler, pretty: true))
  end
end

The program is doing well what is supposed to do but on small files.
When the file gets bigger and response time gets exponentially bigger:

  • for 69kb (csv file) → 124ms :grinning:
  • 234kb → 2.8s :slightly_smiling_face:
  • 1.1MB → 24s :neutral_face:
  • 22.4MB → 835s :flushed:

Obviously something is very wrong with these implementation, or with its libraries (CSV, Jason). A similar (simple) API we have, in php, took a few of seconds to read/convert/send/receive 40MB csv file.

Do you have any suggestion ?

Maybe try nimble_csv. It has NimbleCSV.RFC4180 build in, which should parse your file.

"path/to/file"
|> File.read!() 
|> NimbleCSV.RFC4180.parse_string()

seems also faster than streaming the file (at least for ~20MB sized files).

The only downside is, that parse_string/1 returns a list of lists and you have to map it to a map yourself.

2 Likes

I see two possible solutions. First, as @moogle19 suggested, you can use NumbleCSV's parse_stream and send a chunked response using Plug.Conn.send_chunked/2, never loading the whole file into memory. Second, you can, again, use parse_stream to stream the contents into a temporary file and use Plug.Conn.send_file/5. I believe the second one is trickier because in case you have a lot of requests there should be rate limiting in order to not fill your disk memory.

1 Like

Can you measure which code is slow ? Is it the parsing, the JSON encoding, etc.

Also you could:

  • use Stream.map instead of Enum.map to avoid keeping too much data in memory
  • use Jason.Fragment on each in Stream.map to encode to JSON line by line
  • use Jason.encode_to_iodata instead of Jason.encode to avoid converting the full JSON structure to a string before sending it
  • convert your files to JSON once and serve the JSON files instead of converting from CSV for every request
3 Likes

Can you measure which code is slow ? Is it the parsing, the JSON encoding, etc.

Good idea, to find out which is the “culprit” and isolate the problem.
But I don’t know how to do that.

When input data is expected to be large, always use streaming and never keep all of it in memory.

You can use Benchee…

I tested it with a ~130MB CSV file and the default benchee config:

Name             ips        average  deviation         median         99th %
nimble         0.121         8.24 s     ±0.00%         8.24 s         8.24 s
csv           0.0173        57.91 s     ±0.00%        57.91 s        57.91 s

Code is:

file = "samples/large.csv"

Benchee.run(%{
  "nimble" => fn ->
    file
    |> File.stream!()
    |> NimbleCSV.RFC4180.parse_stream()
    |> Stream.map(fn [first, second, third, fourth, fifth] -> %{f: first, s: second, t: third, fo: fourth, fi: fifth} end)
    |> Enum.into([])
  end,
  "csv" => fn ->
    file 
    |> File.stream!() 
    |> CSV.decode!() 
    |> Enum.into([])
  end
})

It seems nimble_csv is much faster for large files.

2 Likes

You can wrap each part of the code with :timer.tc/1:

{decode_µs, data} = :timer.tc(fn -> decode_from_csv(file) end)
{encode_µs, json} = :timer.tc(fn -> Jason.encode_to_iolist!(data) end)

etc.

Thank you all, I managed to do this implementation using NimbleCSV:

 defp csv(p) do

    [header | data] = p
      |> Path.expand(__DIR__)
      |> File.stream!()
      |> NimbleCSV.RFC4180.parse_stream(skip_headers: false)
      |> Enum.into([])

    Enum.map(data, fn x -> Enum.zip(header, x) |> Map.new() end)
 end

It does the job but not much improvement
3.5MB → 22.46s
21.4MB → 7.2min (tested this file on a similar API, written long time ago, in php5, calling a humble fgetcsv() with no streams and we got 22.22s :pensive:)

So still far for being usable.
I presume the issue is not streaming in the csv file but the second part - assembling the map and sending out the json.

Is there a reason to stream the data into memory? If not try using the non streamed API, which should be faster.

@ LostKobrakai Initially we tried this

    |> File.read!()
    |> CSV.decode()

It was even worse.
I followed the suggestion to use streams, the data being large.

Streaming doesn’t make things faster. It trades less memory for more cpu cycles. Switching from CSV to NimbleCSV is what makes things faster.

As you see above, NimbleCSV did not make it faster.

to

is being faster in my book. Now if you can handle having all the data in memory (reading from the fs once instead of streaming) you’ll likely again see a speed improvement.

Right. Slightly faster but still not usable/performant:

21.4MB → 7.2min

Why the timing expands exponentially to the size of the file ?!?

Sure, but also you’re going back to the file system for each individual line. File.stream! by default works on a per line basis. Unless you’re memory constraint it’s likely a lot faster to load more data into memory at a time.

1 Like

I was wondering the same thing. @quda is still loading all the data into memory. Maybe something like the following would work.

result_stream = 
  path
    |> Path.expand(__DIR__)
    |> File.stream!()
    |> NimbleCSV.RFC4180.parse_stream(skip_headers: false)
    |> Stream.map(&Map.new/1)

# in your controller
conn = send_chunked(conn)

chunk(conn, "[")

stream_result
|> Stream.transform(true, fn
  m, true -> {Jason.encode_to_iodata!(m), false}
  m, false -> {["," | Jason.encode_to_iodata!(m)], false}
end)
|> Stream.each(&chunk/1)
|> Stream.run()

chunk(conn, "]")
1 Like

If they are static, You could use the same technique as the dashbit blog.

https://dashbit.co/blog/welcome-to-our-blog-how-it-was-made

I really doubt the memory is the issue here given it worked with loading all data into memory for the CSV backed implementation.