Compiling tens of thousands functions with guards

grych · May 10, 2017, 8:56pm

Hi guys,
I am having an interesting issue with Elixir compiling time. The case is really simple, I needed a database which maps the IP to the country. I know there are other solutions for this case, but let’s focus what I’ve observed.

Database is a flat CSV file contains “start IP”, “end IP” and “country code”. Here is the example:

"2.16.108.0","2.16.108.255","ES"
"2.16.109.0","2.16.109.255","PL"

I took it from https://db-ip.com

I decided to create a bunch of do_whereis/1 function definitions with guards. It looks to be an easy job:

  dbip_stream = File.stream!(Path.join([__DIR__, @db]), [], :line) 
      |> CSV.decode 
      |> Stream.map(fn [_ip1, ip2, country] -> {ip2integer(ip2), String.to_atom(country)} end)

  for {ip2, country} <- dbip_stream do
    defp do_whereis(ip) when ip <= unquote(ip2), do: unquote(country)
  end

(ip2integer translates IP string to the corresponding integer (as 192.168.1.1 == 256*256*256*192 + 256*256*168 + 256*1 + 1), the rest should be obvious. do_whereis/1 is supposed to translate this integer to the corresponding country code)

Everything works, but the compilation takes forever. The database contains about 400,000 lines with (I’ve filtered it to IPv4 only). Interesting is the compilation time is not linear. I made some test for the first N lines and this are the results:

 1000 lines ->   0,96s user 0,47s system 157% cpu   0,905 total
10000 lines ->   9,06s user 1,38s system 128% cpu   8,121 total
20000 lines ->  30,55s user 2,45s system 114% cpu  28,734 total
50000 lines -> 225,72s user 7,43s system 105% cpu 3:41,74 total

Just for fun (I don’t want to create 16 millions of function declarations) I’ve tried to generate a function for every single IP with:

for {ip1, ip2, country} <- dbip_stream, ip <- ip1..ip2 do
  defp do_whereis(unquote(ip)), do: unquote(country)
end

but obviously it is even worse in terms of compilation time.

Of course I can do the same with many other ways, but it is not a case, I just would like to understand

Is this method a bad practice? Elixir does it itself, as far as I remember, with Unicode. Did I touch the practical limits?

zambal · May 10, 2017, 9:12pm

What I remember from previous discussions, I am afraid you indeed reached the reasonable limits of number of function clauses. I think you are better off using a sorted ets table in this case.

OvermindDL1 · May 10, 2017, 9:15pm

Exactly this. ^.^

smpallen99 · May 10, 2017, 9:32pm

You could preprocess the list and store the processed list in your project. You really don’t need a CSV library either. The preprocessor can convert the string IPs to integers, and store them in csv file. Then the compile step simply reads each line, does a String.split(",").

Presuming that the list does not change as often as you compile the program, it might be better.

If the preprocessor is still very slow, you could write it in C. Could even add it as a custom compiler if you wan to get real fancy.

grych · May 10, 2017, 9:40pm

Processing the file is not an issue, it is relatively quick (it was the first thing I’ve checked). Below are the benchmarks for the code:

  dbip_stream = File.stream!(Path.join([__DIR__, @db]), [], :line) 
      |> CSV.decode 
      |> Stream.filter(&is_ipv4?/1) 
      |> Stream.map(fn [ip1, ip2, country] -> {ip2integer(ip1), ip2integer(ip2), String.to_atom(country)} end)
      |> Stream.run()

I could even process the whole file with less than a minute:

10000  -> 1,27s user 0,84s system 219% cpu 0,962 total
50000  -> 3,79s user 2,74s system 311% cpu 2,093 total
~400000 -> 49,07s user 32,57s system 453% cpu 17,997 total

smpallen99 · May 10, 2017, 9:47pm

Wow, that’s wild.

grych · May 10, 2017, 9:55pm

Please correct me if I’am wrong, but I think the whole issue is because the function definitions in AST are stored in the list:

iex(3)> quote do       
...(3)> def a(1), do: 1
...(3)> def a(2), do: 2
...(3)> end
{:__block__, [],
 [{:def, [context: Elixir, import: Kernel],
   [{:a, [context: Elixir], [1]}, [do: 1]]},
  {:def, [context: Elixir, import: Kernel],
   [{:a, [context: Elixir], [2]}, [do: 2]]}]}

That would explain why adding more and more definitions slows the things down. Adding to the end of the list is quite expensive

smpallen99 · May 10, 2017, 10:01pm

I wonder what it would be like it you built the AST yourself?

grych · May 10, 2017, 10:02pm

You are psychic , I am now thinking about building the AST, reversed. I will let you guys know!

smpallen99 · May 10, 2017, 10:05pm

LOL. I was going to add the “in reverse” to my post, but I thought you would figure that out. I’d be interested in hearing what you find out. Its an interesting problem.

michalmuskala · May 10, 2017, 10:30pm

I’m pretty sure the issue is not in the AST building part, but actually compiling it and going through all the compiler passes.

Since the compiler performs many optimisations on the code, it usually is not linear with regards to a number of functions and clauses - most of the time this is completely fine.

The input file is 26MB large (I checked at the website you linked) - this is huge. For example, all unicode sources Elixir uses for code generation are 1.9MB.

grych · May 10, 2017, 10:39pm

Yes, you are right. It is only a few (tens, hundrets) thousands list long, it can’t be just an issue of adding a stuff at the end of it.

NobbZ · May 11, 2017, 5:09am

This is a known issue in the beam on general, compiling a single function is quadratic to the number of heads. You will have similar problems in erlang, lfe, and alpaca.

edit:

Also before I forget! Erlang/BEAM tries hard to optimize your function heads into a tree, but because you only have “catch all” patterns with guards, its actually impossible and regardless of all the optimizations that have been tried, you will end in a sequential scan of function heads.

If you do not want to use an ETS for whatever reasons, you could even read your source into an ordered list, which is similar to your current one, but sorted (which you should do with your current approach as well!).

Then you could do something like this:

defmodule Foo do
  @the_list read_parse_and_sort_from_csv()

  def whereis(ip) do
    ip = ip2integer(ip)

    @the_list
    |> Enum.drop_while(fn ({ip2, _country}) -> ip > ip2 end)
    |> hd()
    |> snd()
  end

  defp snd({_, x}, do: x
end

This should do it. Roughly at least. But remember, as it is a seqential scan, it will take longer the “bigger” the IP gets.

grych · May 11, 2017, 7:56am

Thanks! Thats explains the case. So, I consider ~10,000 declarations as a practical limit.

grych · May 11, 2017, 8:07am

Another possible resolution (rather theoretical):

A binary tree, based on the bits of the IP adress

     root
    /     \
   0       1
  / \     / \
 0   1   0   1
:PL :CH :DK :PL  etc etc

Searching would be quite fast and linear, every time only 16 steps to go (for IPv4)! But the tree would be quite hudge (16 millions of leafs at the end) and I am afraid creating it would take forever.

NobbZ · May 11, 2017, 8:53am

Well, building this tree, is the optimisation that BEAM tries to accomplish, but this is only possible if it had concrete values to match on. Doing this in the functions head will give you the same squared compiling time though…

case has the same limitation AFAIK.

grych · May 11, 2017, 10:15am

Hi @NobbZ,
I made a similar to your suggestion:

  @db "db/dbip-country.csv"
  @external_resource @db

  dbip_stream = File.stream!(Path.join([__DIR__, @db]), [], :line) 
      |> CSV.decode 
      |> Stream.filter(&is_ipv4?/1) 
      |> Stream.map(fn [_ip1, ip2, country] -> {ip2integer(ip2), String.to_atom(country)} end)
  @dbip_list Enum.sort(dbip_stream) # CSV file is already sorted, just in case

  defp do_whereis(ip) do
    {_, country} = Enum.find(@dbip_list, fn {ip2, _} -> ip <= ip2 end)
    country
  end

  def whereis(ip) do
    do_whereis(ip2integer(ip))
  end

and the compilation time is reasonable, less than a minute (53.37s user 13.43s system 194%). We can live with it, especially that is will be compiled only once in most cases.

Searching the list is quite expensive, anyway. For the corner case (255.255.255.255) it takes about 21ms to complete.

frigidcode · May 11, 2017, 3:06pm

Might be an off topic suggestion, but have you looked at Radix Trees?

jc00ke · May 17, 2017, 3:29am

I’ve been thinking a lot lately about actor-oriented databases, and if that piques your interest give this paper a read: https://www.microsoft.com/en-us/research/publication/indexing-in-an-actor-oriented-database/

I wonder what it would look like to import each address, or range, into an actor and then search on that network of actors? I’m sure you could go about solving another way, but it might be fun to spike this out.

bossek · May 17, 2017, 11:20am

We are solving problem with IP -> country mapping here: https://github.com/ApolloCrawler/microcrawler-webapp/blob/master/lib/microcrawler_webapp/ip_info.ex
Shortly: Load IP ranges, sort IP ranges, convert result to big tuple (all at compile time) and use binary search to find country corresponding to given IP (at runtime).