Explorer - Series (1D) and dataframes (2D) for fast and elegant data exploration in Elixir

billylanchantin · January 21, 2024, 3:26pm

Links:

GitHub: GitHub - elixir-explorer/explorer: Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
HexDocs: Explorer — Explorer v0.8.0

The Explorer library has been out for some time. But we just released the latest version (see below), and we thought we’d start posting updates to the forum.

Wait, Explorer is new to me. What are Series and DataFrames?

Explorer is a DataFrame library for Elixir.

DataFrame libraries are common in languages which have a focus on data manipulation, including:

dplyr (R)
Polars (Rust/Python)

If you’d like a more in-depth tutorial, there’s an excellent LiveBook called Ten Minutes to Explorer that you can play with:

Ten Minutes to Explorer — Explorer v0.8.0

But we’ll provide a quick overview here.

Briefly, you can think of a DataFrame like an in-memory table. Its purpose is to facilitate common data exploration and analysis tasks. As such, it’s a column-oriented table.

Column-oriented tables

If you’re unfamiliar with column-oriented tables, suppose you have a table of pet data like this:

type	age	color
cat	5	black
dog	2	brown
dog	3	brindle

A row-oriented organization of that data might look like this in Elixir:

rows = [
  [type: "cat", age: 5, color: "black"],
  [type: "dog", age: 2, color: "brown"],
  [type: "dog", age: 3, color: "brindle"],
]

It matches the original table fairly one-to-one. But the column-oriented version might instead look like:

columns = [
  type: ["cat", "dog", "dog"],
  age: [5, 2, 3],
  color: ["black", "brown", "brindle"]
]

It has same information, but “transposed”.

Column-orientation is beneficial if you’re asking questions that require a lot of number-crunching like “What’s the average age of all pets?”. In the row-oriented version, finding the average age would require first looking through the entire contents of the table to collect the relevant data. But in the column-oriented version, those values have already been co-located in memory.

Series and DataFrames: columns and tables

In dataframe parlance, a “series” is a single column and a “dataframe” is a collection of named series, aka a table.

Our example above would look like this:

type = Explorer.Series.from_list(["cat", "dog", "dog"])
age = Explorer.Series.from_list([5, 2, 3])
color = Explorer.Series.from_list(["black", "brown", "brindle"])

df = Explorer.DataFrame.new(type: type, age: age, color: color)
# #Explorer.DataFrame<
#   Polars[3 x 3]
#   type string ["cat", "dog", "dog"]
#   age s64 [5, 2, 3]
#   color string ["black", "brown", "brindle"]
# >

Some things to note:

Each series has a corresponding data type or “dtype”, e.g. type has the dtype string.
The word “Polars” appears. That indicates that this dataframe is using the backend powered by the fantastic Polars library (the default backend).

And if we really did want to know the average age of the pets, that would look like this:

Explorer.Series.mean(df["age"])
# 3.3333333333333335

Features and design

Preiminaries out of the way, here are Explorer’s high-level features:

Simply typed series: :binary, :boolean, :category, :date, :datetime, :duration, floats of 32 and 64 bits ({:f, size}), integers of 8, 16, 32 and 64 bits ({:s, size}, {:u, size}), :null, :string, :time, :list, and :struct.
A powerful but constrained and opinionated API, so you spend less time looking for the right function and more time doing data manipulation.
Support for CSV, Parquet, NDJSON, and Arrow IPC formats
Integration with external databases via ADBC and direct connection to file storages such as S3
Pluggable backends, providing a uniform API whether you’re working in-memory or (forthcoming) on remote databases or even Spark dataframes.
The first (and default) backend is based on NIF bindings to the blazing-fast polars library.

The API is heavily influenced by Tidy Data and borrows much of its design from dplyr.
The philosophy is heavily influenced by this passage from dplyr’s documentation:

By constraining your options, it helps you think about your data manipulation challenges.

It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.

It uses efficient backends, so you spend less time waiting for the computer.

The aim here isn’t to have the fastest dataframe library around (though it certainly helps that we’re building on Polars, one of the fastest).
Instead, we’re aiming to bridge the best of many worlds:

the elegance of dplyr
the speed of polars
the joy of Elixir

That means you can expect the guiding principles to be ‘Elixir-ish’. For example, you won’t see the underlying data mutated, even if that’s the most efficient implementation. Explorer functions will always return a new dataframe or series.

Acknowledgements

Explorer is an extensive library and there’s much more we could say. But for now, we’d just like to thank the dozens of contributors who’ve added wonderful improvements over the years.

billylanchantin · January 21, 2024, 3:36pm

Explorer - version 0.8

Explorer has released version 0.8!

Added

Add explode/2 to Explorer.DataFrame. This function is useful to expand the contents of a {:list, inner_dtype} series into a “inner_dtype” series.
Add the new series functions all?/1 and any?/1, to work with boolean series.
Add support for the “struct” dtype. This new dtype represents the struct dtype from Polars/Arrow.
Add map/2 and map_with/2 to the Explorer.Series module.
This change enables the usage of the Explore.Query features in a series.
Add sort_by/2 and sort_with/2 to the Explorer.Series module.
This change enables the usage of the lazy computations and the Explorer.Query module.
Add unnest/2 to Explorer.DataFrame. It works by taking the fields of a “struct” - the new dtype - and transform them into columns.
Add pairwise correlation - Explorer.DataFrame.correlation/2 - to calculate the correlation between numeric columns inside a data frame.
Add pairwise covariance - Explorer.DataFrame.covariance/2 - to calculate the covariance between numeric columns inside a data frame.
Add support for more integer dtypes. This change introduces new signed and unsigned integer dtypes:
- {:s, 8}, {:s, 16}, {:s, 32}
- {:u, 8}, {:u, 16}, {:u, 32}, {:u, 64}.
The existing :integer dtype is now represented as {:s, 64}, and it’s still the default dtype for integers. But series and data frames can now work with the new dtypes. Short names for these new dtypes can be used in functions like Explorer.Series.from_list/2. For example, {:u, 32} can be represented with the atom :u32.

This may bring more interoperability with Nx, and with Arrow related things, like ADBC and Parquet.
Add ewm_standard_deviation/2 and ewm_variance/2 to Explorer.Series.
They calculate the “exponentially weighted moving” variance and standard deviation.
Add support for :skip_rows_after_header option for the CSV reader functions.
Support {:list, numeric_dtype} for Explorer.Series.frequencies/1.
Support pins in cond, inside the context of Explorer.Query.
Introduce the :null dtype. This is a special dtype from Polars and Apache Arrow to represent “all null” series.
Add Explorer.DataFrame.transpose/2 to transpose a data frame.

Changed

Rename the functions related to sorting/arranging of the Explorer.DataFrame.
Now arrange_with is named sort_with, and arrange is sort_by.

The sort_by/3 is a macro and it is going to work using the Explorer.Query module. On the other side, the sort_with/2 uses a callback function.
Remove unnecessary casts to {:s, 64} now that we support more integer dtypes.
It affects some functions, like the following in the Explorer.Series module:
- argsort
- count
- rank
- day_of_week, day_of_year, week_of_year, month, year, hour, minute, second
- abs
- clip
- lengths
- slice
- n_distinct
- frequencies
And also some functions from the Explorer.DataFrame module:
- mutate - mostly because of series changes
- summarise - mostly because of series changes
- slice

Fixed

Fix inspection of series and data frames between nodes.
Fix cast of :string series to {:datetime, any()}
Fix mismatched types in Explorer.Series.pow/2, making it more consistent.
Normalize sorting options.
Fix functions with dtype mismatching the result from Polars.
This fix is affecting the following functions:
- quantile/2 in the context of a lazy series
- mode/1 inside a summarisation
- strftime/2 in the context of a lazy series
- mutate_with/2 when creating a column from a NaiveDateTime or Explorer.Duration.

Contributors

Thank you to everyone who opend up a PR:

@billylanchantin
@cigrainger
@costaraphael
@cristineguadelupe
@Jhonatannunessilva
@kellyfelkins
@lkarthee
@philss

And thank you to the first-time contributors!:

@JonGretar made their first contribution in Support for skip_rows_after_header option in reading csv files by JonGretar · Pull Request #782 · elixir-explorer/explorer · GitHub
@rtvu made their first contribution in Added Series.ewm_std/2 and Series.ewm_var/2 by rtvu · Pull Request #778 · elixir-explorer/explorer · GitHub

Changelogs

Full Changelog: Comparing v0.7.2...v0.8.0 · elixir-explorer/explorer · GitHub
Official Changelog: Changelog — Explorer v0.8.0

billylanchantin · January 22, 2024, 4:02pm

[Blog] Explorer 0.8: The `dtype` release