@quda Could you give us a big anonymized file so we can try stuff?
For the prior benchmarks I used csv files from here: CSV files for download | Stats NZ
Geographic units, by industry and statistical area: 2000–2021 descending order – CSV
is ~ 120MB when unzipped.
But I am not sure how the number of columns affects the runtime. The example has only 5 columns.
I used the 10000_sales.csv
file from NimbleCsv, I duplicated the content until the total size was 20M (19.2 Mio). It has 15 columns.
Now with this code using chunks it takes 2.8 seconds with wget
to download the JSON data (size is 50M), although in the browser it loads forever.
defmodule XsvWeb.PageController do
use XsvWeb, :controller
alias NimbleCSV.RFC4180, as: CSV
def index(conn, _params) do
conn = send_chunked(conn, 200)
send_csv(conn, "big_sales.csv")
end
defp send_csv(conn, file) do
csv_stream = csv_stream(file)
{:ok, conn} = chunk(conn, "[")
conn =
Enum.reduce_while(csv_stream, conn, fn chunk, conn ->
case Plug.Conn.chunk(conn, chunk) do
{:ok, conn} ->
{:cont, conn}
{:error, :closed} ->
{:halt, conn}
end
end)
{:ok, conn} = chunk(conn, "]")
conn
end
defp csv_stream(file) do
headers = headers(file)
headers |> IO.inspect(label: ~S[headers])
stream =
file
|> File.stream!()
|> CSV.parse_stream(skip_headers: true)
stream
|> Stream.take(1)
|> Stream.map(fn x -> Enum.zip(headers, x) |> Map.new() |> Jason.encode_to_iodata!() end)
|> Stream.concat(
stream
|> Stream.drop(1)
|> Stream.map(fn x ->
[?,, Enum.zip(headers, x) |> Map.new() |> Jason.encode_to_iodata!()]
end)
)
end
defp headers(file) do
file
|> File.stream!()
|> Stream.take(1)
|> CSV.parse_stream(skip_headers: false)
|> Enum.at(0)
end
end
Note that the browser makes it slow to display but with fetch("http://localhost:4000/").then(x => x.json()).then(x => console.log('x', x))
it is still less than 4 seconds for 54.8 MB of JSON.
Now @quda I know it is not the original problem but converting to JSON makes the final file size more than twice as big. Is it possible to just send the CSV file? or JSON arrays instead of objects?
That being said I do not think it is possible to beat something as simple as basic PHP like you said:
<?php
$csvFile = file('big_sales.csv');
$data = [];
foreach ($csvFile as $i => $line) {
if ($i == 0) {
$headers = str_getcsv($line);
} else {
$row = str_getcsv($line);
$data[] = array_combine($headers, $row);
}
}
echo json_encode($data, JSON_PRETTY_PRINT);
@lud you’re right about (not) beating php.
Very disappointed.
We were asked to migrate/refactory the old backend for APIs, developed 7 years ago on php5/nginx because it recently started to have performance issues: too many concurrent calls from the downstream API consumers. The cpu and memory were always at 100% on this server.
The idea of our IT solution architect was to switch to the Elixir/Phoenix stack for it’s performance using concurrent connections.
But it seem that this stack is far for being fast on processing individual requests…
Honestly, I couldn’t imagine there it can be something worse than php in terms of performance.
EDIT: @lud’s code is more complete than mine, I only demonstrated how to avoid passing huge collections around.
Seconded, without that we can just theorize until the end of time.
I agree with @LostKobrakai that memory is likely not a problem but I disagree it should be the default solution. You can use streaming with a big buffer and get the best of both words: low rate of hitting your storage + predictable memory usage.
@quda Here is a complete example on how I would do it:
def for_each_csv_row(path, processor) when is_function(processor, 1) do
path
|> Path.expand(__DIR__) # or `Path.absname` if you need that.
|> File.stream!([], 256 * 1024) # use a 256KB streaming buffer.
|> NimbleCSV.RFC4180.parse_stream(skip_headers: false)
|> Enum.map(processor) # --- FUNCTIONAL PROGRAMMING, HELL YEAH! ---
# or you can use `Enum.each` if you don't want to convert the values
# and only need the side effects.
end
It is absolutely meaningless to utilize optimal methods for accessing data if you just end up assigning it to a huge variable in the end and return it from a function. It’s best if you pass a function that operates on a single record instead. That way the full collection of original elements will never be in the memory and the runtime won’t have to copy it around – you’ll only get the converted collection in the end. (Or if you only need side effects you won’t even have the resulting collection in memory as well.)
I find Elixir and Phoenix a joy to work with. Individual requests are generally very fast for common purposes. But here you are doing a lot of string manipulations and iterating over large chunks of data, witch is not what Elixir is designed to do fast, unfortunately.
Honestly, I couldn’t imagine there it can be something worse than php in terms of performance.
I have been a long time user of PHP and I am surprised of your comment. PHP uses a lot of C code and is quite fast for many things actually. Lots of PHP libraries are bloated with overengineered, convoluted, interface-heavy APIs but the raw performance is there (to my taste at least).
Now, If you just need to serve static CSV files that are just sitting in the app directory, why not just write the JSON version in the same directory at the same time and just serve that?
Thanks for suggestion. I don’t understand what this processor function is
How much Elixir did you learn before coming to the forum? Do you know the basics of functional programming?
I personally prefer to give people good advice after they have done some homework and not before. Still, let me add to my example so it is at least complete:
#
# The function signature `[first, second, third, fourth, fifth]` is just an example
# for when your CSV records have five fields: first name, last name, age,
# employee position and salary.
#
processor = fn [first, second, third, fourth, fifth] ->
%{
first_name: first,
last_name: second,
age: third,
position: fourth,
salary: fifth
}
end
Passing that to the function I gave you above will yield a list of maps in the shape visible in the processor
function.
Exactly ! This php backend became an unmanageable beast in the last years (thousands of internal commits made by people are not working anymore in this project), that’s why it was decided to refactory it using a better stack. Slim, less code, better manageability. And better performance on concurrent calls.
Unfortunately, it seems Elixir/Phoenix is not suitable for this.
We will look forward to alternatives.
My suggestion is to use @lud 's solution to see if it is good enough or not.
If you still want more speed, you can write a native csv to json converter in a command line tool using rust or go, reading from the FS and writing to stdout. That should be doable in 100 LOC. Then use a port to run the converter and stream the chunked results to HTTP. That should be trivial in Elixir. I believe you can achieve line speed in this case.
You could have explained this from the beginning.
It does not matter my (humble) experience on Elixir. I was assigned to do this task, given a deadline and I have to do it. That’s why I appealed to this forum. I thought here members are supposed to help people like me not (only) experts in Elixir.
No offence, though.
I didn’t mean to be rude and sorry if it read that way – my point is that without some level of understanding we cannot help you because you might not understand and not be able to apply the suggestions given to you. So then how can we help you?
What I suggested to you is: integrate all processing of the CSV file together in one place so you can reap the full benefits of streaming and save CPU time and memory. If you need an even more detailed example, I’ll be happy to give it to you – can even make a GitHub repo for you if you provide some sample CSV data we can work with. (E.g. how much columns does your CSV file have?)
One of my first tasks with Elixir – 5.5 years ago now – was exactly to process hundreds of megabytes of CSV and XML files per hour. On a virtual hosting with a 4-core vCPU and using full parallelism (using Task.async_stream
or Flow
) I wasn’t able to load the CPU to more than 60% while still ingesting several CSV / XML files with sizes 500MB+, per hour. There are ways and we here are trying to show you those ways.
I don’t have a horse in this race – just don’t be too quick to discount Elixir as unsuitable. There’s a lot that can be done. And you won’t regret sticking with Elixir. It saves you from a ton of problems that no other language has solved yet (like transparent parallelism and concurrency).
Totally agree, @ dimitarvp.
Please understand that I am under much pressure about this task. I didn’t have years to study Elixir, just a couple of weeks, but it convinced me. I’m coming from the OOP world (C#, Java, php) so the functional paradigm still feels strange to me.
I’m still trying to give Elixir a chance but the deadline for this task is coming close and I have to deliver something. Using Elixir/Phoenix or something else…
Thanks.
Q.T.
I understand and sympathize. Been there, done that myself.
Send us a link to a sample CSV file – or just tell us how much columns it has. I am sure we can think of something to help you in short order. Also which Phoenix version do you use?
Try us.
(And btw, @lud’s reply seems excellent and IMO you can apply it quickly to your project.)
Just want to echo this, using ports and running a binary from Go or Rust (or whatever floats your boat) is an excellent way of doing data manipulation in an easy and understandable way.
@quda after reading this thread I really think what you have is an architecture problem and I’m basically echoing @lud here but just to clarify that what you’re describing would probably be not a good idea regardless of the language/technology.
Short version:
- Don’t parse CSV->JSON, simply write JSON files to the server and serve those.
- If you can, send the CSV file directly to the client and parse it there.
- Depending on if you actually need the whole file at once in the client, you could implement a pagination on large files.
Long version:
- As @lud mentioned: Why write CSV files to the server if what you need is JSON? Nothing - not even C or Rust - would be faster at parsing CSV and writing JSON than Nginx serving some static JSON files. So if you must use JSON I would write those files directly as JSON and not convert them on the fly in the request.
If you need authentication you can, of course, also just serve the file contents with phoenix. It would be a tad slower than Nginx but much, much faster than parsing and transforming the file. (This would probably also solve the performance problem of the server because you wouldn’t compute something in every request). - If you can, you should really just serve the CSV files and parse them on the client. Browser JS engines are really, really fast when dealing with data structures and you simply cannot render all data from 20MB files as HTML at once. That would kill the browser.
- You probably want some pagination for those files in order to only send a subset of the data. I have only once had the missfortune of having to send such an amount of data to the client and process it there. Most of the times you want to chunk it up and load more data on demand.
@lud and @dimitarvp you were both right.
I re-wrote the controller using the @lud’s example and guess what? It works out-of-the-box, like a charm.
curl -x "" localhost:4000/api/D3/env.ERA638.csv &> /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 160M 0 160M 0 0 8198k 0 --:--:-- 0:00:20 --:--:-- 8141k
My only problem… I feel like a scum to copy/paste code made by others, without even understanding it. I never did that before.
In principle it is clear what @lud did but at the level of details it’s elusive to me. And this because of my level of understanding Elixir syntax/concepts.
You are not my teachers, ( @dimitarvp was right before) therefore I cannot ask you what this [?,,
does or why |> Stream.drop(1)
. I think you got the point, I have to find out myself.
Thank you all.
Q.T.
?,
means: "give me the codepoint of the comma (in this case it returns 44, which you can try on iex: iex> ?,
). You can read about the usages of IO data here: IO — Elixir v1.12.3
So this function does (as I understand it):
stream
# Take the first line from the CSV (since we used "skip_headers" this is the first line of data)
|> Stream.take(1)
# 1. Merge the headers of the csv with the data we just read into tuples of {HEADER, DATA}
# 2. Make a new map from that list of tuples
# 3. Encode the map as json and output it as io_data instead of as string
|> Stream.map(fn x -> Enum.zip(headers, x) |> Map.new() |> Jason.encode_to_iodata!() end)
# We now have a list of io_data lists, so something like: `[ [12, 31, 55], [13, 14, 99], ... ]`
# which goes as the first argument into Stream.concat/2
|> Stream.concat(
# The second argument to the concat function
stream
# But this time we remove the first line...
|> Stream.drop(1)
# ... map over _each other line_ in the csv...
|> Stream.map(fn x ->
# ... and then prepend it with a comma in codepoint format
[?,, Enum.zip(headers, x) |> Map.new() |> Jason.encode_to_iodata!()]
end)
)
All this will result in something like this {"hello": "world"}, {"hello": "foo"}, {"hello": "bar"}
(notice that we don’t have a comma at the first position?). The send_csv
function wraps that into [
and ]
which makes it a valid json array. This whole thing with the code points is probably done for performance reasons in this case and I personally have not had the need to use this form. This is relatively advanced stuff for a relatively advanced problem.
So please don’t beat yourself up about this. We’re all here to learn and to give back where we can! This forum, or rather it’s management and the people in it, might just be the best feature of Elixir in general - and that says a lot.
Awesome explanation, kudos.
Indeed I used "["
and "]"
for the brackets, and ?,
for the comma, it is not very consistent
I guess when sending a chunk I thought we send some content to the browser here, while building the JSON chunks I thought here we are building some IO data.
Also there is the Stream.intersperse/2
function that would make the whole code cleaner when inserting commas between records instead of using take(1)
and then drop(1)
.
The code would look simpler :
defp csv_stream(file) do
headers = headers(file)
file
|> File.stream!()
|> CSV.parse_stream(skip_headers: true)
|> Stream.map(fn x -> Enum.zip(headers, x) |> Map.new() |> Jason.encode_to_iodata!() end)
|> Stream.intersperse(",")
end
But note that for my 50 MB JSON it takes ~4s instead of ~3s. because you send each comma as its own chunk.