Timeseries upsampling with linear interpolation in explorer

suchasurge · August 22, 2023, 12:17pm

Hi there.

The last week I had to work on some timeseries data. Specifically upsampling with linear interpolation of timeseries data.

This is quite easy in python with pandas:

import pandas as pd

# Your data
data = [
    ["2023-02-13", 100],
    ["2023-02-15", 100.01],
    ["2023-02-16", 100.09],
    ["2023-02-17", 101.02],
    ["2023-02-20", 105.00],
    ["2023-02-22", 103.06]
]

# Convert to DataFrame
df = pd.DataFrame(data, columns=['Date', 'Value'])

# Convert 'Date' column to datetime type and set as index
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# Resample to daily frequency and interpolate missing values
upsampled = df.resample('D').interpolate(method='linear')

print(upsampled)

But the project where I need to do it is, obviously , written in elixir.
So the explorer library came into my mind.

I haven’t used explorer before and had a hard time playing with the timeseries data in a livebook.

In the end I build the upsampling with linear interpolation in plain elixir.

Now I’m wondering if any of you think that it’s currently possible to translate the above python/pandas code to elixir/explorer?

With the usage of explorer I hope to get some performance benefits because as of now it looks like that my plain elixir solution is slow and consumes more CPU than we saw for the app before.
This is especially true for quite large timeseries (about 30 years) with long gaps (weeks to month for the large timeseries).

Can I even get performance improvements writing this in explorer? Does anyone of you have any experience?

When trying to translate to explorer I’m stuck at the upsampling part:

alias Explorer.DataFrame, as: DF
require Explorer.DataFrame

data = [
    ["2023-02-13", 100],
    ["2023-02-15", 100.01],
    ["2023-02-16", 100.09],
    ["2023-02-17", 101.02],
    ["2023-02-20", 105.00],
    ["2023-02-22", 103.06]
]

df = data
|> Enum.map(fn [date, value] ->
  %{
    date: date, 
    value: value
  }
end)
|> DF.new()
|> DF.put("date", Explorer.Series.strptime(df["date"], "%Y-%m-%d"))

So proper dates are there but how do I loop through the data now to be able to fill the missing rows (upsampling)?

Cheers
Frank

suchasurge · September 27, 2023, 12:30pm

Hi there, it’s me again.

I still cannot provide a solution but I now have a sample in rust and polars.
Maybe not the best and idiomatic way of doing it but I just wanted to have something for comparison and to may find my way towards a solution with explorer.

cargo.toml:

[dependencies]
polars = { version = "0.32.1", features = ["lazy", "interpolate"]}

main.rs:

use polars::prelude::*;

fn main() -> Result<(), PolarsError> {
    let data = vec![
        vec!["2023-02-13".to_string(), "100".to_string()],
        vec!["2023-02-15".to_string(), "100.01".to_string()],
        vec!["2023-02-16".to_string(), "100.09".to_string()],
        vec!["2023-02-17".to_string(), "101.02".to_string()],
        vec!["2023-02-20".to_string(), "105.00".to_string()],
        vec!["2023-02-22".to_string(), "103.06".to_string()],
    ];

    let len = data.len();
    let mut dates: Vec<&str> = Vec::with_capacity(len);
    let mut values: Vec<f64> = Vec::with_capacity(len);

    for row in &data {
        dates.push(&row[0]);
        values.push(row[1].parse::<f64>().unwrap());
    }

    let date_series = Series::new("date", dates);
    let value_series = Series::new("value", values);

    let df = DataFrame::new(vec![date_series, value_series])?;
    let df = df
        .clone()
        .lazy()
        .with_columns(vec![col("date")
            .str()
            .strptime(DataType::Date, StrptimeOptions::default())])
        .sort(
            "date",
            SortOptions {
                descending: false,
                nulls_last: true,
                maintain_order: true,
                multithreaded: true,
            },
        )
        .collect()?;

    let upsampled_df = df
        .clone()
        .upsample::<[String; 0]>([], "date", Duration::parse("1d"), Duration::parse("0"))?
        .lazy()
        .with_columns(vec![col("value").interpolate(InterpolationMethod::Linear)])
        .collect()?
        .fill_null(FillNullStrategy::Forward(None))?;

    println!("{}", &upsampled_df);

    Ok(())
}

When running this returns:

shape: (10, 2)
┌────────────┬────────────┐
│ date       ┆ value      │
│ ---        ┆ ---        │
│ date       ┆ f64        │
╞════════════╪════════════╡
│ 2023-02-13 ┆ 100.0      │
│ 2023-02-14 ┆ 100.005    │
│ 2023-02-15 ┆ 100.01     │
│ 2023-02-16 ┆ 100.09     │
│ …          ┆ …          │
│ 2023-02-19 ┆ 103.673333 │
│ 2023-02-20 ┆ 105.0      │
│ 2023-02-21 ┆ 104.03     │
│ 2023-02-22 ┆ 103.06     │
└────────────┴────────────┘

Much appreciated if anyone of you have an idea how to proceed.

Cheers
Frank

billylanchantin · September 27, 2023, 8:37pm

I don’t believe Explorer supports interpolation at this time. Until it does, I think you’ll need to use a workaround.

Here is an example workaround:

alias Explorer.DataFrame, as: DF
alias Explorer.Series, as: S
require DF

defmodule Interpolate do
  def evenly_spaced(x_step, df, x_col, y_col) do
    n_rows = DF.n_rows(df)
    last = DF.slice(df, n_rows - 1, 1)[[x_col, y_col]]

    DF.new(
      x1: S.head(df[x_col], n_rows - 1),
      x2: S.tail(df[x_col], n_rows - 1),
      y1: S.head(df[y_col], n_rows - 1),
      y2: S.tail(df[y_col], n_rows - 1),
    )
    |> DF.to_rows_stream()
    |> Stream.map(fn %{"x1" => x1, "y1" => y1, "x2" => x2, "y2" => y2} ->
      m = (y2 - y1) / (x2 - x1)
      x = x1..x2//x_step |> Enum.to_list() |> S.from_list()
      n = S.size(x)
      x = S.head(x, n - 1)
      y = x |> S.subtract(x1) |> S.multiply(m) |> S.add(y1)

      DF.new(%{x_col => x, y_col => y})
    end)
    |> Enum.reduce(&DF.concat_rows(&2, &1))
    |> DF.concat_rows(last)
  end
end

# Build data
data = [
  ["2023-02-13", 100],
  ["2023-02-15", 100.01],
  ["2023-02-16", 100.09],
  ["2023-02-17", 101.02],
  ["2023-02-20", 105.00],
  ["2023-02-22", 103.06]
]

df = (
  data
  |> Enum.map(fn [date, value] ->
    %{date: Date.from_iso8601!(date), value: value}
  end)
  |> DF.new()
)

# Cast date column to integer.
# (Units make working with Date/Durations a bit awkward.)
df = DF.mutate(df, date_int: cast(date, :integer))

# Run interpolation.
interp = Interpolate.evenly_spaced(1, df, "date_int", "value")

# Cast back to date.
interp = DF.mutate(interp, date: cast(date_int, :date))

# Print result.
interp[["date", "value"]] |> DF.print(limit: :infinity)

# +---------------------------------------------+
# | Explorer DataFrame: [rows: 10, columns: 2]  |
# +------------------+--------------------------+
# |       date       |          value           |
# |      <date>      |         <float>          |
# +==================+==========================+
# | 2023-02-13       | 100.0                    |
# +------------------+--------------------------+
# | 2023-02-14       | 100.005                  |
# +------------------+--------------------------+
# | 2023-02-15       | 100.01                   |
# +------------------+--------------------------+
# | 2023-02-16       | 100.09                   |
# +------------------+--------------------------+
# | 2023-02-17       | 101.02                   |
# +------------------+--------------------------+
# | 2023-02-18       | 102.34666666666666       |
# +------------------+--------------------------+
# | 2023-02-19       | 103.67333333333333       |
# +------------------+--------------------------+
# | 2023-02-20       | 105.0                    |
# +------------------+--------------------------+
# | 2023-02-21       | 104.03                   |
# +------------------+--------------------------+
# | 2023-02-22       | 103.06                   |
# +------------------+--------------------------+

This workaround will be quite slow. If you need a performant solution, you’ll need to shell out to Polars proper.

Also, note that because of the simplicity of your use case (evenly spaced dates), I was able to use a simple algorithm. In the general case is more complicated.

suchasurge · September 28, 2023, 6:47am

Thanks for your explorer based solution.

I had the hope that there is a solution which is more performant than with plain elixir.

Anyways. I think I follow your suggestion and try to directly use polars.

Hopefully with this I collect enough knowledge to contribute to explorer some day.

Thx again

Cheers
Frank

billylanchantin · September 28, 2023, 2:44pm

Sure thing

Some other thoughts while it’s on my mind.

The interpolation operation is not very array-programming friendly. Check out what these two major implementations are doing:

Both are quite loop heavy. The reason is that the two arrays in question – the original array and the array of points to sample at – are different sizes. Fundamentally, some work needs to be done to figure out which values of the original array are relevant to each sampling point.

Now this fact is essentially irrelevant to Explorer since Explorer is using Polars under the hood (at least the primary backend is). Adding interpolation to Explorer is more about API design than it is about algorithms. I mostly note this fundamental limitation of interpolation to set expectations about what’s possible from an Elixir-based solution.