Compiler optimization with many macros

apcomplete · April 27, 2020, 3:32pm

Wondering if there are any compiler gurus out there that might be able to help me. I’ve been building a dynamic query system that would allow a user to compose a relatively complex Ecto query using a JSON schema with dynamic naming for keys and joins and such. Since Ecto requires that query aliases in joins have to be compile time atoms, we’ve had to pivot to a system that assigns individual joins a compile-time generated name, and convert the aliases to user-defined keys when we select the values back out so the client can identify their values.

I’ve created a list of atoms (using a simple naming pattern :column_x where x is a number) as a config variable to serve as identifiers. Each key on the JSON incoming schema is assigned an identifier when the request comes in and I am later using the following macro to implement the individual joins dynamically:

  defmacro join_column(query, qual, binding, expr, column_id, clause \\ true) do
    column_branches =
      Enum.map(@column_ids, fn c ->
        {
          :->,
          [],
          [
            [Macro.escape(c, unquote: true)],
            Builder.Join.build(query, qual, binding, expr, nil, clause, Macro.escape(c, unquote: true), nil, nil, __CALLER__) |> elem(0)
          ]
        }
      end)

    column_fn = {:fn, [], column_branches ++ [{:->, [], [[{:_, [], nil}], query]}]}

    quote do
      unquote(column_fn).(unquote(column_id))
    end
  end

The goal here is to use a similar syntax to Ecto, and because I’m able to Macro.escape the atom, I can use the identifiers to reference the joins and satisfy Ecto’s compile-time atom requirement. This however, balloons compile time and memory. We’re allowing users to add up to 50 columns per report, so the problem of generating 51 anonymous functions per macro instance is further complicated by the fact that this macro is used in conjunction with another parent macro to link the join functions to individual Ecto schemas, of which we have potentially hundreds due to our db being organized as a star schema.

My boundaries seem to be:

The number of columns/schemas is a hard requirement, so I can’t simply reduce the number of columns to reduce overhead.
The compile-time requirement makes it so that I can’t use any kind of lookup to store the identifiers, since escaping the return value would still be a reference to a runtime variable.
We can’t use keys to reference the schemas, since we might want to include the same schema more than once with different parameters (filters) around the join.

Does anyone know if there is any way to make this more performance with respect to compilation time and memory overhead? I’m wondering if there is either a different way to write this, or some other syntax that I’m missing like case that might compile faster to the same bytecode. I’m trying not to expand too much and keep the scope of my problem narrow enough to identify the problem. Hopefully this is clear enough for y’all to get the basic gist, but appreciate any help!

dimitarvp · May 3, 2020, 9:48am

Definitely no idea comes to mind but it would be helpful if you give several practical examples of what functionality you’re aiming at and how does the compile-time-atom requirement is impeding it exactly (complier error?).

I did read your post and I might be dumb but I can’t understand what you’re going for and why it has to be done the way you propose.

LostKobrakai · May 3, 2020, 10:35am

At this time it might be easier to keep track of the order of joins and using that to identify bindings than trying to name things dynamically. Names are useful to humans, but not really to computers. Just map your incoming dynamic names to positions and use positional bindings.

al2o3cr · May 3, 2020, 2:03pm

Hot take: there is a way - use something besides Ecto to build these queries.

I really enjoy Ecto, but it’s ultimately got a region of use-case-space it’s designed to work efficiently for and what you’re describing is basically the opposite of that.

Ecto assumes you have columns grouped into tables. Your problem has a star schema.

Ecto assumes most of your query is known at compile-time so it can generate efficient code and convert query bugs into complier errors. Your problem generates most of the query at runtime.

geofflane · May 4, 2020, 8:03pm

Hi Matt, former Neo coworker.

A star scheme or a dimensional model are or other data models are still just relational models that have tables and columns, maybe they’re grouped differently than an OTP relational model, but Ecto handles them perfectly well.

Ecto is also composable in a way that no other query system I’ve seen is. It’s super flexible and fits this use case exceedingly well. What Alex described was one of the few areas in Ecto where a value can’t be interpolated in at runtime. It’s just a limitation of the library. It has a couple of other oddities like that that we’ve run into as well around naming things, but by and large handles this exceedingly well.

I just don’t want anyone to go away with the impression that Ecto isn’t suitable for a reporting engines, denormalized data models or anything like that. It handles those exceedingly well. It’s unique composability also allows you to do things that I don’t know would be possible with any other data access model.