Data science and machine learning workflows in Elixir

ryan-duve · February 3, 2023, 7:30pm

As a data scientist at an insurtech, I primarily develop in Python but the rest of our company uses Elixir for non-DSML (data science/machine learning) things, and I’ve been trying to get into it for my own workflows. An engineer suggested I reach out here to see if either @josevalim or someone else familiar with DSML work might be able to help me bridge the gap and get productive in Elixir.

I’ve had some luck using Explorer to look at data and tinkering with Ecto to hit our databases, but not much more. I largely get stuck because the following DSML workflows don’t seem suited for the process/concurrency paradigm underneath Erlang/Elixir:

DS Workflows [data exploration]
When I’m deserializing a dataset, that’s what I need my CPU doing with all 8 cylinders. When I want to aggregate by a column and print some statistics, that’s all I need to be doing. Same with all other data exploration and discovery transformations. I haven’t been able to find a way to utilize Elixir’s process model in a way that improves upon imperative approaches with Python for tasks like A) load dataset, B) transform dataset, C) calculate number of interest.

ML Workflows [deployment/inference]
Each machine learning model runs inference in its own endpoint: data in and number out. Parallel execution for heavy loads are handled horizontally by spinning up more machines as needed on demand. Is there some obvious benefit from doing this in Elixir or do we need to go back to drawing board and rethink everything about how we deploy models within the Elixir VM?

My hope is I’m just overlooking/misunderstanding something about the benefits of working with Elixir. I want to get into it for the native, functional approach, immutable data and being able to debug processes while they’re running. I appreciate anything people here can share to help me get off the ground.

josevalim · February 3, 2023, 8:37pm

Hi @ryan-duve, welcome!

You are right that the concurrency aspect of Elixir won’t help much with data exploration. In this area, we are pretty much on the same baseline as Python. The good news is that we can also delegate to C/Rust, just as Python, and that’s the goal with Explorer. It is based on the amazing Polars (which also has Python bindings) and folks report it to be consistently faster than Pandas. So my suggestion is to give Explorer a try and let us know what you are missing in the issues tracker. It is pretty much work in progress and you will hit some walls. But this feedback will help us know what to focus on next. Also, the next release of Livebook will start bringing more features guided towards data exploration.

For machine learning inference, one of the benefits of Elixir is that you can run it within your application. I cover this in the Bumblebee announcement. If this is useful or not, it will be depend on your application. For example, for data processing with Broadway or embedded devices with Nerves, that can be really useful. But for a web application, you may resort to separate services anyway, especially for larger models, just as you would in Python.

In other words, for your questions the answers are mostly “just do it as you have been doing in Python” - unless you see a benefit on deploying the model with your app. So you may be wondering where Elixir can be beneficial:

Using a functional programming language for writing computational models can have its benefits: Nx (Numerical Elixir) is now publicly available - Dashbit Blog
Nx was designed from scratch to support multiple compilers. This means that you can choose between Torch or XLA easily. Most of our focus is on XLA at the moment but this vision will become truer as we pour more work into it
By using Nx as a foundation, everything we build can be compiled to the GPU. For example, we are working on Scholar, which is equivalent to scikit-learn, but everything can be tensor compiled to both CPU and GPU and we get consistently faster results thanks to it
Finally, when it comes to distributed and federated work, then Elixir has the potential to standout. We want to explore how to coordinate work across multiple machines for both data processing and machine learning

I hope this gives you an idea of where we are and what to expect. There is a lot of work but there is a group of people who are working hard and are excited on making this vision come true. Questions, feedback, and help are all welcome!

linusdm · February 3, 2023, 9:53pm

I could imagine that a lot of the data exploration is ad-hoc, and that’s where Livebook really helps. It gives you a nice playground with the added bonus of documenting your process and sharing back and forth with your developer colleagues.

I’ve also had nice results with Flow, which allows you to define a concurrent data-crunching pipeline, within Livebook. No better feeling than setting up a pipeline that utilises all your resources just right, and being able to tinker with it in a short feedback loop.

Add to this the ability to use Ecto within Livebook, and you’re all set up to load and dump to the database at full speed.

And then, at the end of the ride, you can graph everything nicely with the kino-vegalite package to impress everyone with your colourful results.

I can’t add anything to the ML side of things, except the interop between Livebook and all the ML libraries mentioned is already great, and getting better. Welcome to the community!

nseaSeb · October 28, 2023, 1:28pm

Hi,
Possibly late, sorry to dig up this post but in case you haven’t come across this book yet?
https://pragprog.com/titles/smelixir/machine-learning-in-elixir/
Best regards