ahsanghalib

ahsanghalib

Need help improving CSV related code which is slower than expected

I am new to Elixir, learning it. I have a list of US Cities from this github link https://raw.githubusercontent.com/grammakov/USA-cities-and-states/master/us_cities_states_counties.csv
I am trying to process this, remove the duplicate after removing the last two columns.
I have a NodeJS script doing the same but it’s way faster the elixir. Can anyone help me improve it.

Elixir : 55.9859 sec | Node: 4.2381143870018425 sec

Elixir Script.

#!/usr/bin/env elixir

defmodule Benchmark do
  def measure(function) do
    time =
      function
      |> :timer.tc()
      |> elem(0)
      |> Kernel./(1_000_000)
      |> to_string()

    IO.puts(time <> " sec")
  end
end

defmodule UsCitiesStates do
  def uniq([]), do: []

  def uniq([head | tail]) do
    [head | for(x <- uniq(tail), x != head, do: x)]
  end

  def process() do
    IO.puts("stated...")

    list =
      File.stream!("./us_cities_states_counties.csv")
      |> Stream.map(&String.trim(&1))
      |> Stream.map(&String.split(&1, "|"))
      |> Stream.filter(fn
        ["city" | _] -> false
        _ -> true
      end)
      |> Stream.map(&(Stream.drop(&1, -2) |> Enum.to_list()))

    cities =
      list
      |> Enum.to_list()
      |> uniq()

    IO.inspect(cities)
    IO.inspect(length(cities))
  end
end

Benchmark.measure(&UsCitiesStates.process/0)

Elxir Result

stated...
[
  ["Holtsville", "NY", "New York"],
  ["Adjuntas", "PR", "Puerto Rico"],
  ["Aguirre", "PR", "Puerto Rico"],
  ["Aibonito", "PR", ...],
  ["Maunabo", ...],
  [...],
  ...
]
29860
55.9859 sec

Node

import fs from "node:fs";
import readline from "node:readline/promises";
import { PerformanceObserver, performance } from "node:perf_hooks";

const perf = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    console.log(entry.duration / 1000, "sec");
  }
});

perf.observe({ entryTypes: ["measure"], buffered: true });

(async function main() {
  try {
    performance.mark("start");

    const data = [];

    const rl = readline.createInterface({
      input: fs.createReadStream("./us_cities_states_counties.csv"),
    });

    const clean = (msg, err) => {
      console.log(msg);
      if (msg === "error") console.log("error", err || "none");
      if (msg === "close") {
        performance.mark("end");
        performance.measure("fs-stat", "start", "end");
        console.log(data);
      }
    };

    rl.on("line", (ln) => {
      const d = ln.toString().trim();
      if (d === "city|state_short|state_full|county|city_alias") return;
      const city = d.split("|").slice(0, -2).join(",");
      if (data.includes(city)) return;
      data.push(city);
    });

    rl.on("close", () => clean("close"));
  } catch (e) {
    console.log(e);
  }
})();

Node Result

close
[
  [ 'Holtsville', 'NY', 'New York' ],
  [ 'Adjuntas', 'PR', 'Puerto Rico' ],
  [ 'Aguada', 'PR', 'Puerto Rico' ],
  ... 29760 more items
]
4.2381143870018425 sec

Most Liked

kip

kip

ex_cldr Core Team

A more idiomatic Elixir version might look like this:

# Requires nimble_csv dependency to be configured
NimbleCSV.define(CSV, separator: "|", escape: "\"")

defmodule CityB do
  def process do
    "./us_cities_states_counties.csv"
    |> File.stream!
    |> CSV.parse_stream()
    |> Stream.map(fn [city, state_short, state_full, _, _] ->
      [city, state_short, state_full]
    end)
    |> Enum.uniq()
  end
end

Which when benchmarked (using Benchee) against your original version is quite a bit faster and more memory efficient:

Name                       ips        average  deviation         median         99th %
Revised                  19.57       0.0511 s     ±5.87%       0.0504 s       0.0663 s
ElixirForum post        0.0464        21.54 s     ±0.00%        21.54 s        21.54 s

Comparison:
Revised                  19.57
ElixirForum post        0.0464 - 421.51x slower +21.49 s

Memory usage statistics:

Name                Memory usage
Revised                0.0753 GB
ElixirForum post        28.33 GB - 376.23x memory usage +28.26 GB
al2o3cr

al2o3cr

This is the main performance issue in the code above. It does a loop inside a recursion making comparisons, so every element of the list ends up compared to every other element of the list. That means O(N^2) for an input list of length N, and bad performance for large datasets.

If that explanation is too hand-wavy, imagine running this uniq on a list with four elements [:a, :b, :c, :d]:

  • call to uniq([:a, :b, :c, :d])
    • head bound to :a
    • call to uniq([:b, :c, :d])
      • head bound to :b
      • call to uniq([:c, :d])
        • head bound to :c
        • call to uniq([:d])
          • head bound to :d
          • call to uniq([])
            • returns []
          • compare each element of [] to head (:d) and skip ones that match
          • returns [:d | []] aka [:d]
        • compare each element of [:d] to head (:c) and skip ones that match
        • returns [:c | [:d]], aka [:c, :d]
      • compare each element of [:c, :d] to head (:b) and skip ones that match
      • returns [:b | [:c, :d]], aka [:b, :c, :d]
    • compare each element of [:b, :c, :d] to head (:a) and skip ones that match
    • returns [:a | [:b, :c, :d]] aka [:a, :b, :c, :d]

Here’s another implementation that might be slightly faster because it doesn’t construct intermediate results for values that have already been seen. It uses a simple list seen to track which elements have already appeared in the result.

defmodule SomewhatFasterUniq do
  def uniq(input, seen \\ [])
  def uniq([], _), do: []
  def uniq([head | tail], seen) do
    if head in seen do
      uniq(tail, seen)
    else
      [head | uniq(tail, [head | seen])]
    end
  end
end

HOWEVER

It’s labeled “slightly” faster because it still has the same core inefficiency - head in seen does its work by comparing head to each element of seen one at a time, so the more unique elements there are in the input the slower it goes.

What this kind of problem needs is a fast way to answer “has this value been seen before?”

The solution chosen by Enum.uniq (and other parts of stdlib like MapSet) is a Map. Looking up a key in a map is much faster (O(log N) for large maps) so this code will work much better than SomewhatFasterUniq:

defmodule MuchFasterUniq do
  def uniq(input, seen \\ %{})
  def uniq([], _), do: []
  def uniq([head | tail], seen) do
    if Map.has_key?(seen, head) do
      uniq(tail, seen)
    else
      [head | uniq(tail, Map.put(seen, head, true))]
    end
  end
end

Compare the implementation in Enum for lists:

kip

kip

ex_cldr Core Team

Elixir doesn’t try to do everything and external dependencies are encouraged to build out more capabilities. A lot of thought has been put into the hex library manager by the developers to obviate many of the challenges in other ecosystems. Therefore in general you shouldn’t be concerned about using external dependencies - it’s very normal with Elixir and totally expected.

Where Next?

Popular in Questions Top

JorisKok
I have a server on AWS, and was running a load test using artillery. When looking at the Phoenix dashboard I see the Ports going to 100% ...
New
lessless
I believe there are people here who are dealing with CSV files import on the daily basis, and since Excel is a really popular tool there ...
New
Kurisu
For example for a current url like http://localhost:4000/cosmetic/products?_utf8=✓&amp;query=perfume&amp;page=2, I would like to get: ...
New
stefanluptak
Hello everybody, usually, I use a 29" ultra-wide monitor for VSCode which can easily accomodate explorer (files panel) + file with code ...
New
vac
Hi, I'm quite new in Elixir and I'm trying to format a string to a PEM format. I have the certificate value like MIIDBTCCAe2...... and ...
New
baxterw3b
Hi guys, i’m new in the Elixir world, and i have to say, that i love it! i’m having some problem to understand anonymous functions with ...
New
srinivasu
How to handle excepions in elixir? Suppose i have A, B, C ,D, E modules. and each module has get() function. A.get() method will call th...
New
chensan
I have a User schema with a :from_id field set to type :string: defmodule TweetBot.Repo.Migrations.CreateUsers do use Ecto.Migration ...
New
shijith.k
I am trying to start a new phoenix project with elixir 1.9, but mix phx.new does not work. It says that ** (Mix) The task "phx.new" could...
New
hariharasudhan94
Lets say i have map like this fetching from my database %{"_id" =&gt; #BSON.ObjectId&lt;58eb1a7a9ad169198c3dXXXX&gt;, "email" =&gt; "XX...
New

Other popular topics Top

sorentwo
Hello! tl;dr Announcing Oban, an Ecto based job processing library with a focus on reliability and historical observability. After spen...
985 42842 311
New
aesmail
Hello guys, I have finally made it. I created an admin interface for a framework. It’s been on my todo list for years and with the curre...
New
belgoros
I’m not a pro in using Regex and can’t figure out why the following behaviour happens, especially if we take into account the difference ...
New
chrismccord
This release brings a number of exciting features, including integration with the new Phoenix LiveDashboard and Phoenix LiveView. There h...
New
ashish173
I am using Ecto timestamps with postgres, I can see the timestamps() use the :naive_dateime but for my use case I wanted to store the ti...
New
jason.o
In the code below, if the create action is not set to accept “extra_key” as an input, it errors out with a message shown above. Is there ...
New
KronicDeth
Elixir plugin for JetBrain’s IntelliJ Platform (including Rubymine) This is a plugin that adds support for Elixir to JetBrains IntelliJ...
289 35953 110
New
dblack
I’ve got an issue with an app and I’ve no idea of how to troubleshoot it. I’m hoping someone here might have seen something similar. I p...
New
romenigld
I am trying to run a deploy with docker and I successfully runned with this command: docker build -t romenigld/blog-prod . but when I t...
New
sergio
Kind of like when jquery came out, it was super necessary. Existing drag and drop libraries have a bunch of baggage to support old browse...
New

We're in Beta

About us Mission Statement