I have a DataFrame Value columns and a Column with a start date (NaiveDateTime) and an end date (NaiveDateTime) at which the values where meassured. I would like to “unroll” the dateframe so that, I get a row for every date (day of the year), for every sample in the data frame.
I know, how I would do it in pandas (perhaps not the most elegant solution).
I would compute the day of the year (day) of the start and end dates, construct a list of every doy withing this range as a new column and than explode to return t 1st NF and that it is.
After the flatmap have literaly gotten the day column correct, but the value columns still have the original length. Therefore I can not put the new col.
P.S.: The next steps would be, group(day)->mean(values)-> distinct(). I know how to that. The final step is, what I do not know. How would I fill all the missing day between the first and the last date, that are missing and fill the value columns with nils, so that I cann fill the value will last values. Although I would like to fill with linear interpolation. Still don’t know how to do that.
In the end I would have a timeline with all values filled and a row for each day.^^
Hi,
I’m not really sure I understand the context, in case it helps?
polls_timeline =
polls_timeline
|> DF.map_rows(fn row →
start_day = elem(row, 2) # Assume than start_day at position 2
end_day = elem(row, 3) # Assume end_day at position 3
date_range = Enum.to_list(start_day…end_day)
Tuple.append(row, date_range)
end)
@nseaSeb map_rows was neither included in the Documentation of Explorer, nor could I use it in the git Version I currently use. Could you tell, me a bit more about it?
Yes, I’m not at all sure about this. The map_rows/2 function is not available in Elixir’s Explorer library. It seems to be a function of the df library in R.
However, you can get a similar effect by using Enum.map/2 to iterate over each row of the DataFrame, unless I’m mistaken?
The closest I have seen is the pivot_longer function. But it’s works on existing columns that are combined into one column. But instead of columns I have a list of data (of variable length). @josevalim I hope that makes clear, what I want. Do you know how to address it?
@billylanchantin I musst be afk and try clear the problem from remote as good as I can.
What I meant is, the original DataFrame has about ten entries.
Than I take the dates, as start and End number and construct either a list of list, or what @josevalim proposed A flat_map which creates a flat list list of length 40 or so and than makes a Series out of it.
So the problems from my point of few are:
a) Putting a Series of length 40 into a DataFrame of Shape 10, n_cols is not possible?!
b) I don’t know to transform the DataFrame to a length shape of 40, n_cols and preserve the same row order as in the Series.
You talked about grouping. How would it help in this topic?
alias Explorer.Series, as: S
require Explorer.DataFrame, as: DF
data = [
%{start_date: ~D[2023-01-01], end_date: ~D[2023-01-03], value: 1},
%{start_date: ~D[2023-02-01], end_date: ~D[2023-02-03], value: 2}
]
df =
data
|> DF.new()
|> DF.mutate(
start_day: day_of_year(start_date),
end_day: day_of_year(end_date)
)
computed_values =
df
|> DF.to_rows()
|> Enum.flat_map(fn row ->
Enum.map(row["start_day"]..row["end_day"], fn day ->
# Hold onto "start_day" so we group-then-sort by it later
row |> Map.take(["start_day", "value"]) |> Map.put("day", day)
end)
end)
# Do whatever computation you're trying to do here.
|> Enum.map(fn %{"day" => day, "value" => value} = row ->
Map.put(row, "value2", day + value)
end)
# Now we can group by "start_day", which makes n_rows match again.
|> Enum.group_by(&(&1["start_day"]), &(&1["value2"]))
|> Enum.sort()
|> Enum.map(fn {_start_day, values} -> Enum.sum(values) end)
DF.put(df, :computed_values, computed_values)
A few points:
The above is not optimal. I focused in individual steps to clarify the idea.
I admit, I don’t quite know what unrolling the list accomplishes if you need to collapse it back down. Presumably, for most computations you could skip the unrolling and just operate on the dataframe directly.
If you show exactly what computation you’re trying to accomplish, I could say more. But the above is what I meant by “grouping/etc”.
@billylanchantin I did it in Python Pandas and ilustrate it with some images.
I start with a DataFrame with start and Enddates and some values:
Than I add a column (last column), that holds the days from start to end date:
You see the brackets. Which implies the list. Than I “explode” the DataFrame. So that from any value in the list, the other row values are copied and the row-exists n times.