Very slow compiles loading large data as module attributes for escript

I have an escript CLI application where I am processing large lists (in the millions) of street addresses and matching them against government databases of valid zip codes, city names, counties, and states. To get great matching performance I’ve converted these databases into maps such as:

%{"47401" => ["BLOOMINGTON,IN", "WOODBRIDGE,IN"], "47402" => ["BLOOMINGTON,IN"], ...

and

%{"BLOOMINGTON,IN" => %{county: "MONROE", fips: "18105", lat: "39.165325", long: "-86.5263857"}

Then saved the maps as .etf files in the /priv directory.

When processing the millions of addresses I run a lot of code such as Map.get(zip_city_map(), "47401") to validate the city name matches. It’s very fast and works perfectly!

I don’t mind loading them into memory for good runtime performance (at the expense of compile-time performance) and also this is a command-line escript app (which can’t read /priv at runtime) so I think I have to load the lookup tables into compiled .beam files like this:

defmodule ZipCityData do
  @external_resource Path.join(__DIR__, "../../../priv/zip_cities.etf")
  @external_resource Path.join(__DIR__, "../../../priv/city_states.etf")
  @external_resource Path.join(__DIR__, "../../../priv/gnis_civil_pop.etf")

  # 725kb file
  @zip_cities File.read!(Path.join(__DIR__, "../../../priv/zip_cities.etf")) |> :erlang.binary_to_term()
  def zip_city_map(), do: @zip_cities

  # 308kb file
  @city_states File.read!(Path.join(__DIR__, "../../../priv/city_states.etf")) |> :erlang.binary_to_term()
  def usps_city_state_map(), do: @city_states

  # 8.4mb file
  @gnis_civil_pop File.read!(Path.join(__DIR__, "../../../priv/gnis_civil_pop.etf")) |> :erlang.binary_to_term()
  def gnis_city_state_map(), do: @gnis_civil_pop
end

The problem is my compiles became very slow (from 2 secs to 30 secs). This is no big deal if it only happened when I change the ZipCityData module (very rarely) but it happens on recompile no matter what unrelated module in my project I edit. I’ve searched around and can’t find a better way to do it that works with compiled escripts.

If the data doesn’t have interdependencies then I’d suggest using multiple modules (each in it’s own file). This way you can leverage the compiler of elixir, which allows you to compile modules in parallel. You could still have one central module, which delegates to the actual implementations in the modules with the data compiled into.

Another step I’d consider here is actually compiling data not as a blob into the modules, but if possible compile it into multiple functions with different function heads, which could lessen the runtime load on iterating big chunks of data over and over again. You can look at ex_cldr for inspiration, which does exactly that with the cldr database.

E.g. for your zips compile into function like this:

@external_resource Path.join(__DIR__, "../../../priv/zip_cities.etf")
zip_cities = File.read!(Path.join(__DIR__, "../../../priv/zip_cities.etf")) |> :erlang.binary_to_term()
for {zip, city} <- zip_cities do
  def zip_city(unquote(zip)), do: unquote(city)
end
3 Likes

As far as I remember, mix escript.build requires to strip the modules as well as compressing them every time into the binary. Depending on the physical size of the modulefiles this can of course take a long time.

As a rule of thumb, building the escript will always take longer than tarring the _build/$env/lib/*/ebin folders.

1 Like

Thanks for the advice… very good idea to put each data file in a different module to leverage the parallel compiler!

I’ve seen threads on here talking about generating thousands of functions in the same module which I thought were interesting but I’ve never benchmarked that. If I get some time I can compare that to my Map.get based approach. One of my maps has ~44,000 keys (zip codes in US) and another has ~180,000 keys (unique city/st in US) and the Map.get performance is still quite good.

I solved my initial problem by extracting the ZipCityData into a separate umbrella app/library and including it as a dependency. It rarely changes so my IEx recompiles/tests when not changing that module are super fast again. Of course mix escript.build is slow but that’s OK as I don’t release new versions very often.

1 Like