Phoenix json API extremely slow when serving CSV files

quda · November 15, 2021, 7:50pm

I am building a json REST API with Phoenix that (among other services) has to convert static CSV files from a repository and “serve” them as response in json format; that will be used downstream by other services in our project. Some of these csv files are quite big, ranging for a several hundred kb to tens of MB.
So far so good, I made it work with several lines of code.
So, in my backend side:

 defp csv(file) do
   file
    |> Path.expand(__DIR__)
    |> File.stream!()
    |> CSV.decode(headers: true)
    |> Enum.map(fn {:ok, val} -> val end)
  end

in the API controller:

  def show(conn, bhandler) do
    conn
    |> Plug.Conn.put_resp_header("content-type", "application/json; charset=utf-8")
    |> Plug.Conn.send_resp(200, Jason.encode!(bhandler, pretty: true))
  end
end

The program is doing well what is supposed to do but on small files.
When the file gets bigger and response time gets exponentially bigger:

for 69kb (csv file) → 124ms
234kb → 2.8s
1.1MB → 24s
22.4MB → 835s

Obviously something is very wrong with these implementation, or with its libraries (CSV, Jason). A similar (simple) API we have, in php, took a few of seconds to read/convert/send/receive 40MB csv file.

Do you have any suggestion ?

moogle19 · November 15, 2021, 8:47pm

Maybe try nimble_csv. It has NimbleCSV.RFC4180 build in, which should parse your file.

"path/to/file"
|> File.read!() 
|> NimbleCSV.RFC4180.parse_string()

seems also faster than streaming the file (at least for ~20MB sized files).

The only downside is, that parse_string/1 returns a list of lists and you have to map it to a map yourself.

krasenyp · November 16, 2021, 8:43am

I see two possible solutions. First, as @moogle19 suggested, you can use NumbleCSV's parse_stream and send a chunked response using Plug.Conn.send_chunked/2, never loading the whole file into memory. Second, you can, again, use parse_stream to stream the contents into a temporary file and use Plug.Conn.send_file/5. I believe the second one is trickier because in case you have a lot of requests there should be rate limiting in order to not fill your disk memory.

lud · November 16, 2021, 9:12am

Can you measure which code is slow ? Is it the parsing, the JSON encoding, etc.

Also you could:

use Stream.map instead of Enum.map to avoid keeping too much data in memory
use Jason.Fragment on each in Stream.map to encode to JSON line by line
use Jason.encode_to_iodata instead of Jason.encode to avoid converting the full JSON structure to a string before sending it
convert your files to JSON once and serve the JSON files instead of converting from CSV for every request

quda · November 16, 2021, 11:16am

Can you measure which code is slow ? Is it the parsing, the JSON encoding, etc.

Good idea, to find out which is the “culprit” and isolate the problem.
But I don’t know how to do that.

dimitarvp · November 16, 2021, 11:33am

When input data is expected to be large, always use streaming and never keep all of it in memory.

kokolegorille · November 16, 2021, 11:52am

You can use Benchee…

moogle19 · November 16, 2021, 12:45pm

I tested it with a ~130MB CSV file and the default benchee config:

Name             ips        average  deviation         median         99th %
nimble         0.121         8.24 s     ±0.00%         8.24 s         8.24 s
csv           0.0173        57.91 s     ±0.00%        57.91 s        57.91 s

Code is:

file = "samples/large.csv"

Benchee.run(%{
  "nimble" => fn ->
    file
    |> File.stream!()
    |> NimbleCSV.RFC4180.parse_stream()
    |> Stream.map(fn [first, second, third, fourth, fifth] -> %{f: first, s: second, t: third, fo: fourth, fi: fifth} end)
    |> Enum.into([])
  end,
  "csv" => fn ->
    file 
    |> File.stream!() 
    |> CSV.decode!() 
    |> Enum.into([])
  end
})

It seems nimble_csv is much faster for large files.

lud · November 16, 2021, 2:02pm

You can wrap each part of the code with :timer.tc/1:

{decode_µs, data} = :timer.tc(fn -> decode_from_csv(file) end)
{encode_µs, json} = :timer.tc(fn -> Jason.encode_to_iolist!(data) end)

etc.

quda · November 16, 2021, 2:11pm

Thank you all, I managed to do this implementation using NimbleCSV:

 defp csv(p) do

    [header | data] = p
      |> Path.expand(__DIR__)
      |> File.stream!()
      |> NimbleCSV.RFC4180.parse_stream(skip_headers: false)
      |> Enum.into([])

    Enum.map(data, fn x -> Enum.zip(header, x) |> Map.new() end)
 end

It does the job but not much improvement
3.5MB → 22.46s
21.4MB → 7.2min (tested this file on a similar API, written long time ago, in php5, calling a humble fgetcsv() with no streams and we got 22.22s )

So still far for being usable.
I presume the issue is not streaming in the csv file but the second part - assembling the map and sending out the json.

LostKobrakai · November 16, 2021, 2:15pm

Is there a reason to stream the data into memory? If not try using the non streamed API, which should be faster.

quda · November 16, 2021, 2:22pm

@ LostKobrakai Initially we tried this

    |> File.read!()
    |> CSV.decode()

It was even worse.
I followed the suggestion to use streams, the data being large.

LostKobrakai · November 16, 2021, 2:24pm

Streaming doesn’t make things faster. It trades less memory for more cpu cycles. Switching from CSV to NimbleCSV is what makes things faster.

quda · November 16, 2021, 2:25pm

As you see above, NimbleCSV did not make it faster.

LostKobrakai · November 16, 2021, 2:30pm

to

is being faster in my book. Now if you can handle having all the data in memory (reading from the fs once instead of streaming) you’ll likely again see a speed improvement.

quda · November 16, 2021, 2:33pm

Right. Slightly faster but still not usable/performant:

21.4MB → 7.2min

Why the timing expands exponentially to the size of the file ?!?

LostKobrakai · November 16, 2021, 2:35pm

Sure, but also you’re going back to the file system for each individual line. File.stream! by default works on a per line basis. Unless you’re memory constraint it’s likely a lot faster to load more data into memory at a time.

krasenyp · November 16, 2021, 2:36pm

I was wondering the same thing. @quda is still loading all the data into memory. Maybe something like the following would work.

result_stream = 
  path
    |> Path.expand(__DIR__)
    |> File.stream!()
    |> NimbleCSV.RFC4180.parse_stream(skip_headers: false)
    |> Stream.map(&Map.new/1)

# in your controller
conn = send_chunked(conn)

chunk(conn, "[")

stream_result
|> Stream.transform(true, fn
  m, true -> {Jason.encode_to_iodata!(m), false}
  m, false -> {["," | Jason.encode_to_iodata!(m)], false}
end)
|> Stream.each(&chunk/1)
|> Stream.run()

chunk(conn, "]")

kokolegorille · November 16, 2021, 2:37pm

If they are static, You could use the same technique as the dashbit blog.

https://dashbit.co/blog/welcome-to-our-blog-how-it-was-made

LostKobrakai · November 16, 2021, 2:37pm

I really doubt the memory is the issue here given it worked with loading all data into memory for the CSV backed implementation.