I have a data set for which I need to calculate a running total for sub-segments. My data set looks like this:
trip_id, station_code, departure_time, station_sequence
1, A, 08:00:00, 1
1, B, 08:05:00, 2
1, C, 08:07:00, 3
1, D, 08:12:00, 4
2, Z, 08:30:00, 1
2, Y, 08:37:00, 2
2, X, 08:41:00, 3
For each record in this data set, I want to add another field, called trip_duration_min. This will be the total time of the trip since the first stop sequence.
trip_id, station_code, departure_time, station_sequence, trip_duration_min
1, A, 08:00:00, 1, 0
1, B, 08:05:00, 2, 5
1, C, 08:07:00, 3, 7
1, D, 08:12:00, 4, 12
2, Z, 08:30:00, 1, 0
2, Y, 08:37:00, 2, 7
2, X, 08:41:00, 3, 11
I’m wondering what is the best way to calculate this value using Stream or Enum. My thought was to use Stream.chunk_by to break up the data set into individual enumerables for each trip_id, then use Stream.scan to accumulate the total time in minutes of each trip. However, I’m not sure what the best way is to reference the prior record in an enumerable.
Any thoughts?