Speed up parsing a text file

quazar · June 29, 2018, 2:44pm

I am parsing DNS zone file (12 GB appx) in Elixir and it’s taking absurdly amount of time, I exited the process after 30+ mins or so and switched to golang and go finished the same process in less than 5 min. I know elixir is not suitable for cpu intensive ops here I am stuck because of IO which is strange.
My code:

    dest_stream = File.stream!(destination, [{:delayed_write, 10_000_000, 60}])
    File.stream!(source, [read_ahead: 10_000_000])
    |> Stream.map(&String.split(&1))
    |> Stream.filter(&valid_entry?(&1, zone))
    |> Stream.map(&extract_domain_name(&1, zone))
    |> Stream.dedup
    |> Stream.into(dest_stream, fn args -> args <> "\n" end)
    |> Stream.run

Is there any way I can speed up this process?

peerreynders · June 29, 2018, 2:51pm

Stream all runs strictly sequentially, lazily in the same process. Have a look at Flow instead.

NobbZ · June 29, 2018, 2:52pm

Try using :line mode instead of :readahead, that will reduce the size of chunks in memory. Having binary chunks of 10MB and splitting them by line is inefficient.

Also as far as I can tell, you might cut entries in half and discard those.

Aside of that I’d probably try to rewrite that with leex and yacc or another proper parser library.

benwilson512 · June 29, 2018, 3:04pm

Without seeing what you’re doing in each of those functions we can’t really suggest anything concrete.

jakemorrison · June 30, 2018, 1:46am

I suspect the biggest issue you are hitting is that you are interleaving processing and I/O. That causes you to thrash the scheduler. Things work a lot better if you can read in the data, process it, then write it out all in one chunk.

With 12GB of data, if you don’t have enough RAM to hold everything, then streams are still useful, but you want to have bigger chunks. i.e. use the stream to read a block of data from the disk, split it into a number of records, chunk them, then process the chunks in parallel, then write each chunk to disk.

You can parallelize the processing of the entries to take advantage of multiple cores. I have found
https://github.com/beatrichartz/parallel_stream easy to use and fast, though there are other things that are part of the standard library. It lets you batch on the number of workers and number of records to process per worker, e.g.

workers = :erlang.system_info(:schedulers) * 2
stream = ParallelStream.map(records, &(process_record(&1)), num_workers: workers, worker_work_ratio: 1000)
results = Enum.into(stream, [])

Instead of concatenating strings, you can generate iolists, e.g. fn args -> [args, "\n"] end
See https://www.bignerdranch.com/blog/elixir-and-io-lists-part-1-building-output-efficiently/

We have one high-volume application which has configuration info in JSON, about 1M records with 1KB of JSON for each record. The data starts in a Postgres database. We have one job that reads all the data in the database, parses the JSON, massages it, then writes out a CSV file with key and JSON data. On startup, the app parses the CSV and loads the data into an ETS table.

The export job was originally taking 30 minutes. By processing the data in parallel and paying attention to I/O, it now takes about two minutes. Similar optimization on the load job took it from about three minutes down to about 8 seconds.

Elixir is not as fast as C, but it is reasonably efficient. The ability to easily parallelize work and take advantage of all the cores often makes up for absolute processing speed. Binary pattern matching works at about half the speed of C, and https://github.com/plataformatec/nimble_parsec makes it easy to implement efficient text parsers. For things which are driven by I/O and concurrency, it is very competitive.

mbuhot · June 30, 2018, 9:48am

I recently found String.split to be much slower than a regex for parsing lines of input.

mix profile.fprof is easy to use and will tell you immediately where your bottlenecks are. If it takes too long to analyse, try running with a smaller input file.

michalmuskala · June 30, 2018, 12:21pm

There was a performance bug in String.split (or rather the underlying :binary.split) that should be fixed in OTP 21.

quazar · July 16, 2018, 9:03am

Thanks for those suggestion. String.split is indeed taking lot of time. I ended up implementing this in golang and calling from phoenix framework. Now it fly’s without hiccup.