Simple count by group in Explorer

This feels like a dumb question but I am stuck. Using Explorer I want to count the number of rows for each group (groups may be defined by one column, or multiple columns).

I would think this would be:

alias Explorer.{Datasets, DataFrame}

Datasets.fossil_fuels()
|> DataFrame.group_by("country")
|> DataFrame.summarise(country: [:count])

And the result would be a data frame like:

#Explorer.DataFrame<
  Polars[222 x 2]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  country_count integer [5, 5, 5, 5, 5, ...]
>

Instead I get this error:

%Inspect.Error{
  message: "got RuntimeError with message \"Polars(NotFound(\\\"country_count\\\"))\" 
  while inspecting %{__struct__: Explorer.DataFrame, data: %{__struct__:  
  Explorer.PolarsBackend.DataFrame, resource: #Reference<0.3895999949.566099982.184593>},
  dtypes: %{\"country\" => :string, \"country_count\" => :string}, groups: [], names: 
  [\"country\", \"country_count\"]}"
}

It works fine if I group and count by different variables, e.g.,

Datasets.fossil_fuels()
|> DataFrame.group_by("country")
|> DataFrame.summarise(total: [:count])

returns:

#Explorer.DataFrame<
  Polars[222 x 2]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total_count integer [5, 5, 5, 5, 5, ...]
>

but it seems odd that I have to summarise an arbitrary other variable. What am I missing?

1 Like

From the docs:

When the dataframe has grouping variables, operations are performed per group

You can think of this as Explorer creating sub-dataframes using the grouping variable and applying operations on each group. But importantly, the sub-dataframe does not contain the variable any more (since it does not vary) and the variable cannot be used any longer.


                                 ┌───────────────────────────────────────────┐
  ┌──────────────────┐           │  Country 1    Country 2      Country N    │
  │                  │  group_by │ ┌─────────┐   ┌───────────┐  ┌──────────┐ │
  │   Fossil Fuels   ├───────────► │ Sub-DF  │   │ Sub-DF    │  │  Sub-DF  │ │
  │                  │           │ │         │   │           │  │          │ │
  └──────────────────┘           │ └─────────┘   └───────────┘  └──────────┘ │
                                 └───────────────────────────────────────────┘

1 Like

As of version 0.3.0, Explorer.Series.summarise_with/2 is a bit more flexible:

alias Explorer.{Datasets, DataFrame, Series}

Datasets.fossil_fuels()
|> DataFrame.group_by("country")
|> DataFrame.summarise_with(&[country_count: Series.count(&1["country"])])

returns

#Explorer.DataFrame<
  Polars[222 x 2]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  country_count integer [5, 5, 5, 5, 5, ...]
>