Need help improving CSV related code which is slower than expected

ahsanghalib · October 11, 2023, 10:07pm

I am new to Elixir, learning it. I have a list of US Cities from this github link https://raw.githubusercontent.com/grammakov/USA-cities-and-states/master/us_cities_states_counties.csv
I am trying to process this, remove the duplicate after removing the last two columns.
I have a NodeJS script doing the same but it’s way faster the elixir. Can anyone help me improve it.

Elixir : 55.9859 sec | Node: 4.2381143870018425 sec

Elixir Script.

#!/usr/bin/env elixir

defmodule Benchmark do
  def measure(function) do
    time =
      function
      |> :timer.tc()
      |> elem(0)
      |> Kernel./(1_000_000)
      |> to_string()

    IO.puts(time <> " sec")
  end
end

defmodule UsCitiesStates do
  def uniq([]), do: []

  def uniq([head | tail]) do
    [head | for(x <- uniq(tail), x != head, do: x)]
  end

  def process() do
    IO.puts("stated...")

    list =
      File.stream!("./us_cities_states_counties.csv")
      |> Stream.map(&String.trim(&1))
      |> Stream.map(&String.split(&1, "|"))
      |> Stream.filter(fn
        ["city" | _] -> false
        _ -> true
      end)
      |> Stream.map(&(Stream.drop(&1, -2) |> Enum.to_list()))

    cities =
      list
      |> Enum.to_list()
      |> uniq()

    IO.inspect(cities)
    IO.inspect(length(cities))
  end
end

Benchmark.measure(&UsCitiesStates.process/0)

Elxir Result

stated...
[
  ["Holtsville", "NY", "New York"],
  ["Adjuntas", "PR", "Puerto Rico"],
  ["Aguirre", "PR", "Puerto Rico"],
  ["Aibonito", "PR", ...],
  ["Maunabo", ...],
  [...],
  ...
]
29860
55.9859 sec

Node

import fs from "node:fs";
import readline from "node:readline/promises";
import { PerformanceObserver, performance } from "node:perf_hooks";

const perf = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    console.log(entry.duration / 1000, "sec");
  }
});

perf.observe({ entryTypes: ["measure"], buffered: true });

(async function main() {
  try {
    performance.mark("start");

    const data = [];

    const rl = readline.createInterface({
      input: fs.createReadStream("./us_cities_states_counties.csv"),
    });

    const clean = (msg, err) => {
      console.log(msg);
      if (msg === "error") console.log("error", err || "none");
      if (msg === "close") {
        performance.mark("end");
        performance.measure("fs-stat", "start", "end");
        console.log(data);
      }
    };

    rl.on("line", (ln) => {
      const d = ln.toString().trim();
      if (d === "city|state_short|state_full|county|city_alias") return;
      const city = d.split("|").slice(0, -2).join(",");
      if (data.includes(city)) return;
      data.push(city);
    });

    rl.on("close", () => clean("close"));
  } catch (e) {
    console.log(e);
  }
})();

Node Result

close
[
  [ 'Holtsville', 'NY', 'New York' ],
  [ 'Adjuntas', 'PR', 'Puerto Rico' ],
  [ 'Aguada', 'PR', 'Puerto Rico' ],
  ... 29760 more items
]
4.2381143870018425 sec

kip · October 11, 2023, 10:14pm

A more idiomatic Elixir version might look like this:

# Requires nimble_csv dependency to be configured
NimbleCSV.define(CSV, separator: "|", escape: "\"")

defmodule CityB do
  def process do
    "./us_cities_states_counties.csv"
    |> File.stream!
    |> CSV.parse_stream()
    |> Stream.map(fn [city, state_short, state_full, _, _] ->
      [city, state_short, state_full]
    end)
    |> Enum.uniq()
  end
end

Which when benchmarked (using Benchee) against your original version is quite a bit faster and more memory efficient:

Name                       ips        average  deviation         median         99th %
Revised                  19.57       0.0511 s     ±5.87%       0.0504 s       0.0663 s
ElixirForum post        0.0464        21.54 s     ±0.00%        21.54 s        21.54 s

Comparison:
Revised                  19.57
ElixirForum post        0.0464 - 421.51x slower +21.49 s

Memory usage statistics:

Name                Memory usage
Revised                0.0753 GB
ElixirForum post        28.33 GB - 376.23x memory usage +28.26 GB

dimitarvp · October 11, 2023, 10:22pm

Damn. That’s 95% identical to what I was writing right now. Kudos, you’re fast!

ahsanghalib · October 11, 2023, 11:13pm

So, I need to use external library…

Can you explain why Nimble is fast ?
I have checked it’s code, its using erlang functions mostly ?

Basically I should start with learning erlang.

kip · October 11, 2023, 11:39pm

Profiling would be required to establish if Nimble_Csv is the main contributor to the speed up.

I suspect it isn’t. I think your uniq function is the primary contributor since it processes the list an exponential number of times in the comprehension.

You don’t need to learn erlang to use Elixir. But over time you’re likely to find it helpful and eventually you’ll likely even enjoy it (although opinions vary).

Treat nimble_csv as a black box like you would in any CSV implementation. It uses metaprogrammjng to generate efficient code. But that’s not something you should be focused on early in Elixir use.

ahsanghalib · October 12, 2023, 12:00am

Thanks for help,

Yes, you are right about uniq function. I changed it to Enum.uniq() & got 0.252773 sec

kip · October 12, 2023, 1:22am

Elixir doesn’t try to do everything and external dependencies are encouraged to build out more capabilities. A lot of thought has been put into the hex library manager by the developers to obviate many of the challenges in other ecosystems. Therefore in general you shouldn’t be concerned about using external dependencies - it’s very normal with Elixir and totally expected.

ahsanghalib · October 12, 2023, 1:32am

yes, elixir or erlang external libs are good. but its not about their quality.

when i first start learning a new language i usually try to avoid external libs as much as possible. & also when i use libs is mostly after reading the code in github to get the gist of it.

aungmyooo2k17 · October 12, 2023, 1:45am

Check this library when you process with big file

> https://hexdocs.pm/flow/Flow.html

al2o3cr · October 12, 2023, 1:50am

This is the main performance issue in the code above. It does a loop inside a recursion making comparisons, so every element of the list ends up compared to every other element of the list. That means O(N^2) for an input list of length N, and bad performance for large datasets.

If that explanation is too hand-wavy, imagine running this uniq on a list with four elements [:a, :b, :c, :d]:

call to uniq([:a, :b, :c, :d])
- head bound to :a
- call to uniq([:b, :c, :d])
  - head bound to :b
  - call to uniq([:c, :d])
    - head bound to :c
    - call to uniq([:d])
      - head bound to :d
      - call to uniq([])
        
        returns []
      - compare each element of [] to head (:d) and skip ones that match
      - returns [:d | []] aka [:d]
    - compare each element of [:d] to head (:c) and skip ones that match
    - returns [:c | [:d]], aka [:c, :d]
  - compare each element of [:c, :d] to head (:b) and skip ones that match
  - returns [:b | [:c, :d]], aka [:b, :c, :d]
- compare each element of [:b, :c, :d] to head (:a) and skip ones that match
- returns [:a | [:b, :c, :d]] aka [:a, :b, :c, :d]

Here’s another implementation that might be slightly faster because it doesn’t construct intermediate results for values that have already been seen. It uses a simple list seen to track which elements have already appeared in the result.

defmodule SomewhatFasterUniq do
  def uniq(input, seen \\ [])
  def uniq([], _), do: []
  def uniq([head | tail], seen) do
    if head in seen do
      uniq(tail, seen)
    else
      [head | uniq(tail, [head | seen])]
    end
  end
end

HOWEVER

It’s labeled “slightly” faster because it still has the same core inefficiency - head in seen does its work by comparing head to each element of seen one at a time, so the more unique elements there are in the input the slower it goes.

What this kind of problem needs is a fast way to answer “has this value been seen before?”

The solution chosen by Enum.uniq (and other parts of stdlib like MapSet) is a Map. Looking up a key in a map is much faster (O(log N) for large maps) so this code will work much better than SomewhatFasterUniq:

defmodule MuchFasterUniq do
  def uniq(input, seen \\ %{})
  def uniq([], _), do: []
  def uniq([head | tail], seen) do
    if Map.has_key?(seen, head) do
      uniq(tail, seen)
    else
      [head | uniq(tail, Map.put(seen, head, true))]
    end
  end
end

Compare the implementation in Enum for lists:

github.com

elixir-lang/elixir/blob/44d3faad45a32d9648792b54e46e8ac27283ffb8/lib/elixir/lib/enum.ex#L4776-L4783


      
          defp uniq_list([head | tail], set, fun) do
            value = fun.(head)
          
            case set do
              %{^value => true} -> uniq_list(tail, set, fun)
              %{} -> [head | uniq_list(tail, Map.put(set, value, true), fun)]
            end
          end

dimitarvp · October 12, 2023, 2:40am

That’s fair but file format parsers should be external libraries. You don’t want to reinvent well-known (but sometimes difficult to implement, like HTTP) formats and protocols. It’s just not a productive expenditure of your time.

(At the same time, nothing is stopping you from inspecting NimbleCSV code and take lessons from it.)

And for your task in particular using a CSV parser (configured with a custom separator as @kip’s coding snippet demonstrates) is a very good idea due to corner cases, f.ex. imagine if one of the values in the CSV file actually used the | character that you use as a separator – and you’ll get more values in that row (i.e. if you normally have 5 values per row but you’ll get 6 or more values for the row that uses | inside a value). A proper parser will catch that, your homemade code will very likely not.