ahsanghalib
Need help improving CSV related code which is slower than expected
I am new to Elixir, learning it. I have a list of US Cities from this github link https://raw.githubusercontent.com/grammakov/USA-cities-and-states/master/us_cities_states_counties.csv
I am trying to process this, remove the duplicate after removing the last two columns.
I have a NodeJS script doing the same but it’s way faster the elixir. Can anyone help me improve it.
Elixir : 55.9859 sec | Node: 4.2381143870018425 sec
Elixir Script.
#!/usr/bin/env elixir
defmodule Benchmark do
def measure(function) do
time =
function
|> :timer.tc()
|> elem(0)
|> Kernel./(1_000_000)
|> to_string()
IO.puts(time <> " sec")
end
end
defmodule UsCitiesStates do
def uniq([]), do: []
def uniq([head | tail]) do
[head | for(x <- uniq(tail), x != head, do: x)]
end
def process() do
IO.puts("stated...")
list =
File.stream!("./us_cities_states_counties.csv")
|> Stream.map(&String.trim(&1))
|> Stream.map(&String.split(&1, "|"))
|> Stream.filter(fn
["city" | _] -> false
_ -> true
end)
|> Stream.map(&(Stream.drop(&1, -2) |> Enum.to_list()))
cities =
list
|> Enum.to_list()
|> uniq()
IO.inspect(cities)
IO.inspect(length(cities))
end
end
Benchmark.measure(&UsCitiesStates.process/0)
Elxir Result
stated...
[
["Holtsville", "NY", "New York"],
["Adjuntas", "PR", "Puerto Rico"],
["Aguirre", "PR", "Puerto Rico"],
["Aibonito", "PR", ...],
["Maunabo", ...],
[...],
...
]
29860
55.9859 sec
Node
import fs from "node:fs";
import readline from "node:readline/promises";
import { PerformanceObserver, performance } from "node:perf_hooks";
const perf = new PerformanceObserver((list) => {
for (const entry of list.getEntries()) {
console.log(entry.duration / 1000, "sec");
}
});
perf.observe({ entryTypes: ["measure"], buffered: true });
(async function main() {
try {
performance.mark("start");
const data = [];
const rl = readline.createInterface({
input: fs.createReadStream("./us_cities_states_counties.csv"),
});
const clean = (msg, err) => {
console.log(msg);
if (msg === "error") console.log("error", err || "none");
if (msg === "close") {
performance.mark("end");
performance.measure("fs-stat", "start", "end");
console.log(data);
}
};
rl.on("line", (ln) => {
const d = ln.toString().trim();
if (d === "city|state_short|state_full|county|city_alias") return;
const city = d.split("|").slice(0, -2).join(",");
if (data.includes(city)) return;
data.push(city);
});
rl.on("close", () => clean("close"));
} catch (e) {
console.log(e);
}
})();
Node Result
close
[
[ 'Holtsville', 'NY', 'New York' ],
[ 'Adjuntas', 'PR', 'Puerto Rico' ],
[ 'Aguada', 'PR', 'Puerto Rico' ],
... 29760 more items
]
4.2381143870018425 sec
Most Liked
kip
A more idiomatic Elixir version might look like this:
# Requires nimble_csv dependency to be configured
NimbleCSV.define(CSV, separator: "|", escape: "\"")
defmodule CityB do
def process do
"./us_cities_states_counties.csv"
|> File.stream!
|> CSV.parse_stream()
|> Stream.map(fn [city, state_short, state_full, _, _] ->
[city, state_short, state_full]
end)
|> Enum.uniq()
end
end
Which when benchmarked (using Benchee) against your original version is quite a bit faster and more memory efficient:
Name ips average deviation median 99th %
Revised 19.57 0.0511 s ±5.87% 0.0504 s 0.0663 s
ElixirForum post 0.0464 21.54 s ±0.00% 21.54 s 21.54 s
Comparison:
Revised 19.57
ElixirForum post 0.0464 - 421.51x slower +21.49 s
Memory usage statistics:
Name Memory usage
Revised 0.0753 GB
ElixirForum post 28.33 GB - 376.23x memory usage +28.26 GB
al2o3cr
This is the main performance issue in the code above. It does a loop inside a recursion making comparisons, so every element of the list ends up compared to every other element of the list. That means O(N^2) for an input list of length N, and bad performance for large datasets.
If that explanation is too hand-wavy, imagine running this uniq on a list with four elements [:a, :b, :c, :d]:
- call to
uniq([:a, :b, :c, :d])headbound to:a- call to
uniq([:b, :c, :d])headbound to:b- call to
uniq([:c, :d])headbound to:c- call to
uniq([:d])headbound to:d- call to
uniq([])- returns
[]
- returns
- compare each element of
[]tohead(:d) and skip ones that match - returns
[:d | []]aka[:d]
- compare each element of
[:d]tohead(:c) and skip ones that match - returns
[:c | [:d]], aka[:c, :d]
- compare each element of
[:c, :d]tohead(:b) and skip ones that match - returns [:b | [:c, :d]], aka
[:b, :c, :d]
- compare each element of
[:b, :c, :d]tohead(:a) and skip ones that match - returns
[:a | [:b, :c, :d]]aka[:a, :b, :c, :d]
Here’s another implementation that might be slightly faster because it doesn’t construct intermediate results for values that have already been seen. It uses a simple list seen to track which elements have already appeared in the result.
defmodule SomewhatFasterUniq do
def uniq(input, seen \\ [])
def uniq([], _), do: []
def uniq([head | tail], seen) do
if head in seen do
uniq(tail, seen)
else
[head | uniq(tail, [head | seen])]
end
end
end
HOWEVER
It’s labeled “slightly” faster because it still has the same core inefficiency - head in seen does its work by comparing head to each element of seen one at a time, so the more unique elements there are in the input the slower it goes.
What this kind of problem needs is a fast way to answer “has this value been seen before?”
The solution chosen by Enum.uniq (and other parts of stdlib like MapSet) is a Map. Looking up a key in a map is much faster (O(log N) for large maps) so this code will work much better than SomewhatFasterUniq:
defmodule MuchFasterUniq do
def uniq(input, seen \\ %{})
def uniq([], _), do: []
def uniq([head | tail], seen) do
if Map.has_key?(seen, head) do
uniq(tail, seen)
else
[head | uniq(tail, Map.put(seen, head, true))]
end
end
end
Compare the implementation in Enum for lists:
kip
Elixir doesn’t try to do everything and external dependencies are encouraged to build out more capabilities. A lot of thought has been put into the hex library manager by the developers to obviate many of the challenges in other ecosystems. Therefore in general you shouldn’t be concerned about using external dependencies - it’s very normal with Elixir and totally expected.







