Need help improving CSV related code which is slower than expected

I am new to Elixir, learning it. I have a list of US Cities from this github link https://raw.githubusercontent.com/grammakov/USA-cities-and-states/master/us_cities_states_counties.csv
I am trying to process this, remove the duplicate after removing the last two columns.
I have a NodeJS script doing the same but it’s way faster the elixir. Can anyone help me improve it.

Elixir : 55.9859 sec | Node: 4.2381143870018425 sec

Elixir Script.

#!/usr/bin/env elixir

defmodule Benchmark do
  def measure(function) do
    time =
      function
      |> :timer.tc()
      |> elem(0)
      |> Kernel./(1_000_000)
      |> to_string()

    IO.puts(time <> " sec")
  end
end

defmodule UsCitiesStates do
  def uniq([]), do: []

  def uniq([head | tail]) do
    [head | for(x <- uniq(tail), x != head, do: x)]
  end

  def process() do
    IO.puts("stated...")

    list =
      File.stream!("./us_cities_states_counties.csv")
      |> Stream.map(&String.trim(&1))
      |> Stream.map(&String.split(&1, "|"))
      |> Stream.filter(fn
        ["city" | _] -> false
        _ -> true
      end)
      |> Stream.map(&(Stream.drop(&1, -2) |> Enum.to_list()))

    cities =
      list
      |> Enum.to_list()
      |> uniq()

    IO.inspect(cities)
    IO.inspect(length(cities))
  end
end

Benchmark.measure(&UsCitiesStates.process/0)

Elxir Result

stated...
[
  ["Holtsville", "NY", "New York"],
  ["Adjuntas", "PR", "Puerto Rico"],
  ["Aguirre", "PR", "Puerto Rico"],
  ["Aibonito", "PR", ...],
  ["Maunabo", ...],
  [...],
  ...
]
29860
55.9859 sec

Node

import fs from "node:fs";
import readline from "node:readline/promises";
import { PerformanceObserver, performance } from "node:perf_hooks";

const perf = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    console.log(entry.duration / 1000, "sec");
  }
});

perf.observe({ entryTypes: ["measure"], buffered: true });

(async function main() {
  try {
    performance.mark("start");

    const data = [];

    const rl = readline.createInterface({
      input: fs.createReadStream("./us_cities_states_counties.csv"),
    });

    const clean = (msg, err) => {
      console.log(msg);
      if (msg === "error") console.log("error", err || "none");
      if (msg === "close") {
        performance.mark("end");
        performance.measure("fs-stat", "start", "end");
        console.log(data);
      }
    };

    rl.on("line", (ln) => {
      const d = ln.toString().trim();
      if (d === "city|state_short|state_full|county|city_alias") return;
      const city = d.split("|").slice(0, -2).join(",");
      if (data.includes(city)) return;
      data.push(city);
    });

    rl.on("close", () => clean("close"));
  } catch (e) {
    console.log(e);
  }
})();

Node Result

close
[
  [ 'Holtsville', 'NY', 'New York' ],
  [ 'Adjuntas', 'PR', 'Puerto Rico' ],
  [ 'Aguada', 'PR', 'Puerto Rico' ],
  ... 29760 more items
]
4.2381143870018425 sec

A more idiomatic Elixir version might look like this:

# Requires nimble_csv dependency to be configured
NimbleCSV.define(CSV, separator: "|", escape: "\"")

defmodule CityB do
  def process do
    "./us_cities_states_counties.csv"
    |> File.stream!
    |> CSV.parse_stream()
    |> Stream.map(fn [city, state_short, state_full, _, _] ->
      [city, state_short, state_full]
    end)
    |> Enum.uniq()
  end
end

Which when benchmarked (using Benchee) against your original version is quite a bit faster and more memory efficient:

Name                       ips        average  deviation         median         99th %
Revised                  19.57       0.0511 s     ±5.87%       0.0504 s       0.0663 s
ElixirForum post        0.0464        21.54 s     ±0.00%        21.54 s        21.54 s

Comparison:
Revised                  19.57
ElixirForum post        0.0464 - 421.51x slower +21.49 s

Memory usage statistics:

Name                Memory usage
Revised                0.0753 GB
ElixirForum post        28.33 GB - 376.23x memory usage +28.26 GB
9 Likes

Damn. That’s 95% identical to what I was writing right now. Kudos, you’re fast!

2 Likes

So, I need to use external library…

Can you explain why Nimble is fast ?
I have checked it’s code, its using erlang functions mostly ?

Basically I should start with learning erlang.

Profiling would be required to establish if Nimble_Csv is the main contributor to the speed up.

I suspect it isn’t. I think your uniq function is the primary contributor since it processes the list an exponential number of times in the comprehension.

You don’t need to learn erlang to use Elixir. But over time you’re likely to find it helpful and eventually you’ll likely even enjoy it (although opinions vary).

Treat nimble_csv as a black box like you would in any CSV implementation. It uses metaprogrammjng to generate efficient code. But that’s not something you should be focused on early in Elixir use.

2 Likes

Thanks for help,

Yes, you are right about uniq function. I changed it to Enum.uniq() & got 0.252773 sec

Elixir doesn’t try to do everything and external dependencies are encouraged to build out more capabilities. A lot of thought has been put into the hex library manager by the developers to obviate many of the challenges in other ecosystems. Therefore in general you shouldn’t be concerned about using external dependencies - it’s very normal with Elixir and totally expected.

3 Likes

yes, elixir or erlang external libs are good. but its not about their quality.

when i first start learning a new language i usually try to avoid external libs as much as possible. & also when i use libs is mostly after reading the code in github to get the gist of it.

Check this library when you process with big file

> https://hexdocs.pm/flow/Flow.html

This is the main performance issue in the code above. It does a loop inside a recursion making comparisons, so every element of the list ends up compared to every other element of the list. That means O(N^2) for an input list of length N, and bad performance for large datasets.

If that explanation is too hand-wavy, imagine running this uniq on a list with four elements [:a, :b, :c, :d]:

  • call to uniq([:a, :b, :c, :d])
    • head bound to :a
    • call to uniq([:b, :c, :d])
      • head bound to :b
      • call to uniq([:c, :d])
        • head bound to :c
        • call to uniq([:d])
          • head bound to :d
          • call to uniq([])
            • returns []
          • compare each element of [] to head (:d) and skip ones that match
          • returns [:d | []] aka [:d]
        • compare each element of [:d] to head (:c) and skip ones that match
        • returns [:c | [:d]], aka [:c, :d]
      • compare each element of [:c, :d] to head (:b) and skip ones that match
      • returns [:b | [:c, :d]], aka [:b, :c, :d]
    • compare each element of [:b, :c, :d] to head (:a) and skip ones that match
    • returns [:a | [:b, :c, :d]] aka [:a, :b, :c, :d]

Here’s another implementation that might be slightly faster because it doesn’t construct intermediate results for values that have already been seen. It uses a simple list seen to track which elements have already appeared in the result.

defmodule SomewhatFasterUniq do
  def uniq(input, seen \\ [])
  def uniq([], _), do: []
  def uniq([head | tail], seen) do
    if head in seen do
      uniq(tail, seen)
    else
      [head | uniq(tail, [head | seen])]
    end
  end
end

HOWEVER

It’s labeled “slightly” faster because it still has the same core inefficiency - head in seen does its work by comparing head to each element of seen one at a time, so the more unique elements there are in the input the slower it goes.

What this kind of problem needs is a fast way to answer “has this value been seen before?”

The solution chosen by Enum.uniq (and other parts of stdlib like MapSet) is a Map. Looking up a key in a map is much faster (O(log N) for large maps) so this code will work much better than SomewhatFasterUniq:

defmodule MuchFasterUniq do
  def uniq(input, seen \\ %{})
  def uniq([], _), do: []
  def uniq([head | tail], seen) do
    if Map.has_key?(seen, head) do
      uniq(tail, seen)
    else
      [head | uniq(tail, Map.put(seen, head, true))]
    end
  end
end

Compare the implementation in Enum for lists:

5 Likes

That’s fair but file format parsers should be external libraries. You don’t want to reinvent well-known (but sometimes difficult to implement, like HTTP) formats and protocols. It’s just not a productive expenditure of your time.

(At the same time, nothing is stopping you from inspecting NimbleCSV code and take lessons from it.)

And for your task in particular using a CSV parser (configured with a custom separator as @kip’s coding snippet demonstrates) is a very good idea due to corner cases, f.ex. imagine if one of the values in the CSV file actually used the | character that you use as a separator – and you’ll get more values in that row (i.e. if you normally have 5 values per row but you’ll get 6 or more values for the row that uses | inside a value). A proper parser will catch that, your homemade code will very likely not.

1 Like