Links:
- GitHub: GitHub - elixir-explorer/explorer: Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
- HexDocs: Explorer — Explorer v0.8.0
The Explorer library has been out for some time. But we just released the latest version (see below), and we thought we’d start posting updates to the forum.
Wait, Explorer is new to me. What are Series and DataFrames?
Explorer is a DataFrame library for Elixir.
DataFrame libraries are common in languages which have a focus on data manipulation, including:
If you’d like a more in-depth tutorial, there’s an excellent LiveBook called Ten Minutes to Explorer that you can play with:
But we’ll provide a quick overview here.
Briefly, you can think of a DataFrame like an in-memory table. Its purpose is to facilitate common data exploration and analysis tasks. As such, it’s a column-oriented table.
Column-oriented tables
If you’re unfamiliar with column-oriented tables, suppose you have a table of pet data like this:
type | age | color |
---|---|---|
cat | 5 | black |
dog | 2 | brown |
dog | 3 | brindle |
A row-oriented organization of that data might look like this in Elixir:
rows = [
[type: "cat", age: 5, color: "black"],
[type: "dog", age: 2, color: "brown"],
[type: "dog", age: 3, color: "brindle"],
]
It matches the original table fairly one-to-one. But the column-oriented version might instead look like:
columns = [
type: ["cat", "dog", "dog"],
age: [5, 2, 3],
color: ["black", "brown", "brindle"]
]
It has same information, but “transposed”.
Column-orientation is beneficial if you’re asking questions that require a lot of number-crunching like “What’s the average age of all pets?”. In the row-oriented version, finding the average age would require first looking through the entire contents of the table to collect the relevant data. But in the column-oriented version, those values have already been co-located in memory.
Series and DataFrames: columns and tables
In dataframe parlance, a “series” is a single column and a “dataframe” is a collection of named series, aka a table.
Our example above would look like this:
type = Explorer.Series.from_list(["cat", "dog", "dog"])
age = Explorer.Series.from_list([5, 2, 3])
color = Explorer.Series.from_list(["black", "brown", "brindle"])
df = Explorer.DataFrame.new(type: type, age: age, color: color)
# #Explorer.DataFrame<
# Polars[3 x 3]
# type string ["cat", "dog", "dog"]
# age s64 [5, 2, 3]
# color string ["black", "brown", "brindle"]
# >
Some things to note:
- Each series has a corresponding data type or “dtype”, e.g.
type
has the dtypestring
. - The word “Polars” appears. That indicates that this dataframe is using the backend powered by the fantastic Polars library (the default backend).
And if we really did want to know the average age of the pets, that would look like this:
Explorer.Series.mean(df["age"])
# 3.3333333333333335
Features and design
Preiminaries out of the way, here are Explorer’s high-level features:
-
Simply typed series:
:binary
,:boolean
,:category
,:date
,:datetime
,:duration
, floats of 32 and 64 bits ({:f, size}
), integers of 8, 16, 32 and 64 bits ({:s, size}
,{:u, size}
),:null
,:string
,:time
,:list
, and:struct
. -
A powerful but constrained and opinionated API, so you spend less time looking for the right function and more time doing data manipulation.
-
Support for CSV, Parquet, NDJSON, and Arrow IPC formats
-
Integration with external databases via ADBC and direct connection to file storages such as S3
-
Pluggable backends, providing a uniform API whether you’re working in-memory or (forthcoming) on remote databases or even Spark dataframes.
-
The first (and default) backend is based on NIF bindings to the blazing-fast polars library.
The API is heavily influenced by Tidy Data and borrows much of its design from dplyr.
The philosophy is heavily influenced by this passage from dplyr’s documentation:
By constraining your options, it helps you think about your data manipulation challenges.
It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
It uses efficient backends, so you spend less time waiting for the computer.
The aim here isn’t to have the fastest dataframe library around (though it certainly helps that we’re building on Polars, one of the fastest).
Instead, we’re aiming to bridge the best of many worlds:
- the elegance of dplyr
- the speed of polars
- the joy of Elixir
That means you can expect the guiding principles to be ‘Elixir-ish’. For example, you won’t see the underlying data mutated, even if that’s the most efficient implementation. Explorer functions will always return a new dataframe or series.
Acknowledgements
Explorer is an extensive library and there’s much more we could say. But for now, we’d just like to thank the dozens of contributors who’ve added wonderful improvements over the years.