Best way to handle ETL tasks in Elixir?

bigfish1 · February 26, 2024, 3:00pm

I am looking for suggestions on the best way to handle ETL tasks in elixir or if I should just do these tasks in SQL + dbt + airflow.

I’m building an application around forecasting the effects of training on athletes to be used by team athletic staff. The users will upload data from wearable devices and we extract what we need then have two stages of transformation before writing the DB and the user can see these graphed on a nice dashboard.

I currently have these transformations written in SQL and can orchestrate them with dbt, but I would like to simplify this.

Would gen stage be a good fit here or should I avoid putting this pipeline into the application and keep it separate. Ideally I would like to move away from SQL and do this all with Nx Explorer because the cost of containers running is a lot cheaper then managed databases.

If anyone has any experience or advice here it would be greatly appreciated. Thank you in advance

benwilson512 · February 26, 2024, 3:55pm

Really this depends on the size of your dataset. Explorer (as I understand it) is “in memory only”. If your data set fits comfortably in RAM then you should be good to go. How much data are we talking about here in terms of rows and columns?

bigfish1 · February 26, 2024, 5:12pm

Thanks for your reply, I had not considered in memory limitations.

The raw data over a single season adds up to about 10-30 GBs per player. Each time the first transaction runs it needs to pull in all of this because it updates based on how the player is progressing physically. This creates about 1-2GBs of data the final transformation will use.

dimitarvp · February 26, 2024, 5:26pm

I might be misunderstanding but with the limitations you have put in your original post + the fact that you don’t want to use Elixir for the complete ETL pipeline, but it does seem like you’d be better served by a faster language, more number-crunching friendly one like Golang or Rust.

benwilson512 · February 26, 2024, 5:35pm

Gotcha. I mean doing this in memory isn’t impossible by any stretch, but it may be cheaper to use something like clickhouse that is optimized for these sorts of bulk queries / transformations. We’ve been testing it lately for internal reporting and IoT data management and it is crazy fast, while still giving you a lot of the flexibility you’d expect from a SQL interface.

EDIT: @dimitarvp I mean explorer is rust if you’re using the polars backend. The issue is the same either way: If in memory is fine, do rust / polars / explorer / whatever but if not, an OLAP database is probably the next best bet.

bigfish1 · February 26, 2024, 5:54pm

It’s not that I necessarily want to use elixir for the dtl itself. I want to keep the complexity down by using my elixir backend to orchestrate the process.

bigfish1 · February 26, 2024, 5:57pm

Thanks I hadn’t heard of clickhouse. Could I orchestrate the queries from something like gen stage pipeline kicked off by a user request?

benwilson512 · February 26, 2024, 6:03pm

GenStage / Broadway and other libraries in that family are a good fit when you have a sort of continuous stream of data to operate on. If when you say “users will upload data” you mean that data is being continuously streamed to you then GenStage could be a good fit to at least get the data written, and then run queries to do further transformations.

However if by “users will upload data” you mean it’s more of a bulk / batch operation then I’d consider something like Oban to do this in a “job” structure. A file gets uploaded and you kick off a job to go ingest data from the file, and run further queries to transform that data. It depends a bit on how data comes to you.

bigfish1 · February 26, 2024, 6:19pm

Thanks, this will be more of batch data. When players do an activity the wearable device records this as a file. After completion these get uploaded manually or pushed to a webhook. Thanks for all your help!