Using metaprogramming to create lookup functions

I’m trying to create a sort of inverted index of pharmaceutical terms. For example I’ll have a file magnesium.txt which will have the following content

magnesium
Mg
atomic number 12
Magnesio
Epsom Salts

What I have created looks something like the following:

defmodule Lookup do
  def lookup("magnesium"), do: "magnesium"
  def lookup("Mg"), do: "magnesium"
  def lookup("atomic number 12"), do: "magnesium"
  def lookup("Magnesio"), do: "magnesium"
  def lookup("Epsom Salts"), do: "magnesium"
end

My dataset is much larger so I used metaprogramming to generate these functions. It works pretty well, although I had to increase the default max number of atoms when compiling, and compiling itself takes around 5 minutes.

I’m wondering if this is actually a good idea. Originally I got the idea to do this from Chris McCord’s Metaprogramming Elixir, but is there a better way to handle this?

2 Likes

Hello @uri, I think metaprogramming is the perfect way to go about doing this. If you’ve read the Metaprogramming Book by Chris McCoord then you should have a good foundation to go about doing that (he has a similar example in the book).

Metaprogramming can totally be used to generate code dynamically according to external data. This is exactly what Elixir does with its Unicode support. There’s a unicode.txt file with all data relevant to each Unicode codepoint and all the needed code is generated via metaprogramming while reading that particular file. Each time a new Unicode codepoint must be added, the core team just needs to add the new data to the file.

I’d say to try to do the code generation by yourself so you learn and explore metaprogramming a bit, and if you can’t, come back and me (or someone else) will help you with the task :slight_smile:

Why did you need to increase the default number of atoms? In your example at least you are using strings and not creating any new atoms. That the compiling takes so long time is not strange if you are creating one whopping lookup function.

1 Like

Besides @rvirding’s question, which I would love to know the answer to, remember to use the @external_resource attribute and make it point to the magnesium.txt file so the module is recompiled if the source file changes.

4 Likes

Honestly I’d stuff all that into ets as it is simple key -> value mapping. Significantly more simple to do. Just on load read in the file to ets. You can even dynamically update the list very easily at runtime then. :slight_smile:

4 Likes

Yes I agree why just not put this into map/ets?

2 Likes

Hey @sashaafm, thanks for taking the time to reply. I could have done a better job at being more clear, but I have already implemented this via metaprogramming. It works well, it just takes forever to compile. And I have to add addition compiler options because it ends up hitting the default maximum number of atoms (1,048,576).

Yeah I’m not sure exactly where the atoms are coming from. I assuming somewhere when Elixir is being converted into erlang it’s creating an atom for the function.

I actually forgot about that! The only problem is that this each term has it’s own file, so I have something like 1400 files. Is it still feasible to list them as @external_resource?

And as for increasing the atom limit, I just assumed it had something to do with how Elixir transforms into Erlang.

Yeah I was just trying trying to feel out what kind other solutions are out there. I might try this as well.

It should be.

Oh, if everything is using a single module, both Elixir and Erlang compilers may generate intermediate variables and since those are atoms, yes, they can end-up exceeding the limit.

Yup, this is also a good option. One approach is to have a explicit tasks that builds the table so you don’t need to build it every time your app starts. The tzinfo project does something similar.

1 Like

Is there a way to debug something like this? Here is the code in question

defmodule Lookup do
  path = Path.join([".", "data"])
  files = path |> File.ls!

  Enum.map(files, fn filename ->
    monograph =
      filename
      |> String.replace("_", " ")
      |> String.replace(".txt", "")
      |> String.downcase

    filestream = Path.join(path, filename) |> File.stream!([], :line)

    Enum.map( filestream, fn line ->
      match = line |> String.downcase |> String.trim
      def lookup( unquote(match) ), do: unquote(monograph)
    end)

  end)

  def lookup(""), do: ""
  def lookup(nil), do: ""
  def lookup(term), do: nil
end
1 Like

I’m curious about the relative performance characteristics of a hard coded lookup vs :ets. When the values start being large I’m sure there’s memory copying concerns involved. Are there any situations where :ets may be faster? I suppose if the functions don’t return bare literals but have to do some additional function calls?

2 Likes

I’d say look at the generated Erlang AST and see what it does, maybe make a few benchmarks? It would be fascinating to see. :slight_smile:

There is always copying involved when using ETS. Every time you access an ETS table you are copying data between the process heap and ETS table. Processes don’t share data.

I don’t think there is any problem with the generated erlang AST as such. Unless the elixir compiler does something really really strange :smile: . There is definitely an issue when compiler is handling pattern matching of literal binaries/elixir strings. The binaries are split into separate bytes and the pattern matching compiler seems to create a variable per byte. Internally a variable is just a tuple {:var, :"_ker953"} where the variable name is an atom. Hence the very large number of atoms.

I don’t know if there is anyway around this. Keeping the data in ETS tables is an alternative.

2 Likes

Thanks for pointing this out @rvirding! I switched from using binaries to char lists and I didn’t have to increase the global atom limit.

How did it go with the speed? Is it noticeably faster or slower?

I haven’t done a proper benchmark, but using a binary seems to take slightly longer to compile.