Using metaprogramming to create lookup functions

uri · September 25, 2016, 5:04pm

I’m trying to create a sort of inverted index of pharmaceutical terms. For example I’ll have a file magnesium.txt which will have the following content

magnesium
Mg
atomic number 12
Magnesio
Epsom Salts

What I have created looks something like the following:

defmodule Lookup do
  def lookup("magnesium"), do: "magnesium"
  def lookup("Mg"), do: "magnesium"
  def lookup("atomic number 12"), do: "magnesium"
  def lookup("Magnesio"), do: "magnesium"
  def lookup("Epsom Salts"), do: "magnesium"
end

My dataset is much larger so I used metaprogramming to generate these functions. It works pretty well, although I had to increase the default max number of atoms when compiling, and compiling itself takes around 5 minutes.

I’m wondering if this is actually a good idea. Originally I got the idea to do this from Chris McCord’s Metaprogramming Elixir, but is there a better way to handle this?

sashaafm · September 25, 2016, 5:22pm

Hello @uri, I think metaprogramming is the perfect way to go about doing this. If you’ve read the Metaprogramming Book by Chris McCoord then you should have a good foundation to go about doing that (he has a similar example in the book).

Metaprogramming can totally be used to generate code dynamically according to external data. This is exactly what Elixir does with its Unicode support. There’s a unicode.txt file with all data relevant to each Unicode codepoint and all the needed code is generated via metaprogramming while reading that particular file. Each time a new Unicode codepoint must be added, the core team just needs to add the new data to the file.

I’d say to try to do the code generation by yourself so you learn and explore metaprogramming a bit, and if you can’t, come back and me (or someone else) will help you with the task

rvirding · September 25, 2016, 5:29pm

Why did you need to increase the default number of atoms? In your example at least you are using strings and not creating any new atoms. That the compiling takes so long time is not strange if you are creating one whopping lookup function.

josevalim · September 25, 2016, 5:51pm

Besides @rvirding’s question, which I would love to know the answer to, remember to use the @external_resource attribute and make it point to the magnesium.txt file so the module is recompiled if the source file changes.

OvermindDL1 · September 25, 2016, 7:43pm

Honestly I’d stuff all that into ets as it is simple key -> value mapping. Significantly more simple to do. Just on load read in the file to ets. You can even dynamically update the list very easily at runtime then.

mkunikow · September 25, 2016, 9:00pm

Yes I agree why just not put this into map/ets?

uri · September 25, 2016, 10:46pm

Hey @sashaafm, thanks for taking the time to reply. I could have done a better job at being more clear, but I have already implemented this via metaprogramming. It works well, it just takes forever to compile. And I have to add addition compiler options because it ends up hitting the default maximum number of atoms (1,048,576).

uri · September 25, 2016, 10:48pm

Yeah I’m not sure exactly where the atoms are coming from. I assuming somewhere when Elixir is being converted into erlang it’s creating an atom for the function.

uri · September 25, 2016, 10:50pm

I actually forgot about that! The only problem is that this each term has it’s own file, so I have something like 1400 files. Is it still feasible to list them as @external_resource?

And as for increasing the atom limit, I just assumed it had something to do with how Elixir transforms into Erlang.

uri · September 25, 2016, 10:52pm

Yeah I was just trying trying to feel out what kind other solutions are out there. I might try this as well.

josevalim · September 26, 2016, 6:39am

It should be.

Oh, if everything is using a single module, both Elixir and Erlang compilers may generate intermediate variables and since those are atoms, yes, they can end-up exceeding the limit.

Yup, this is also a good option. One approach is to have a explicit tasks that builds the table so you don’t need to build it every time your app starts. The tzinfo project does something similar.

uri · September 26, 2016, 3:05pm

Is there a way to debug something like this? Here is the code in question

defmodule Lookup do
  path = Path.join([".", "data"])
  files = path |> File.ls!

  Enum.map(files, fn filename ->
    monograph =
      filename
      |> String.replace("_", " ")
      |> String.replace(".txt", "")
      |> String.downcase

    filestream = Path.join(path, filename) |> File.stream!([], :line)

    Enum.map( filestream, fn line ->
      match = line |> String.downcase |> String.trim
      def lookup( unquote(match) ), do: unquote(monograph)
    end)

  end)

  def lookup(""), do: ""
  def lookup(nil), do: ""
  def lookup(term), do: nil
end

benwilson512 · September 26, 2016, 3:54pm

I’m curious about the relative performance characteristics of a hard coded lookup vs :ets. When the values start being large I’m sure there’s memory copying concerns involved. Are there any situations where :ets may be faster? I suppose if the functions don’t return bare literals but have to do some additional function calls?

OvermindDL1 · September 26, 2016, 5:55pm

I’d say look at the generated Erlang AST and see what it does, maybe make a few benchmarks? It would be fascinating to see.

rvirding · September 26, 2016, 7:15pm

There is always copying involved when using ETS. Every time you access an ETS table you are copying data between the process heap and ETS table. Processes don’t share data.

rvirding · September 26, 2016, 7:22pm

I don’t think there is any problem with the generated erlang AST as such. Unless the elixir compiler does something really really strange . There is definitely an issue when compiler is handling pattern matching of literal binaries/elixir strings. The binaries are split into separate bytes and the pattern matching compiler seems to create a variable per byte. Internally a variable is just a tuple {:var, :"_ker953"} where the variable name is an atom. Hence the very large number of atoms.

I don’t know if there is anyway around this. Keeping the data in ETS tables is an alternative.

uri · September 26, 2016, 9:07pm

Thanks for pointing this out @rvirding! I switched from using binaries to char lists and I didn’t have to increase the global atom limit.

rvirding · September 26, 2016, 11:00pm

How did it go with the speed? Is it noticeably faster or slower?

uri · September 28, 2016, 5:10pm

I haven’t done a proper benchmark, but using a binary seems to take slightly longer to compile.