Merging a time series dataframe on a multi index
[ python data-science ] · 20 min read
In my current role I am frequently (pun intended!) working with time series data. I trail, validate, build and deploy machine learning models to forecast financial time series data.
It is all about the index
It is best practice to make use of an multiindex when working with big time series dataframes. My usual tool of use is pandas. Often enough you get your data from some data source like SQL or an API. The data in question can often not be used in testing purposes (for confidentiallity purposes ofor instance).
Thus every good data scientist has some sort of a helper function to create a test data frame Especially when you thrive in TDD.
One function I almost always come back to is posted below.
import pandas as pd
import numpy as np
def dfTS_gen(nsize, ncat, ncols, start="2018-01-01", freq="D"):
# generate data
list_df = []
for i in range(ncat):
df = pd.DataFrame()
df["date"] = pd.date_range(start, periods=nsize, freq=freq)
df["cat0"] = i
df["cat1"] = "test"
for j in range(ncols):
df[f"col{j}"] = np.linspace(0, nsize-1, nsize)
df.set_index(["date", "cat0", "cat1"], inplace=True)
list_df.append(df)
return pd.concat(list_df).sort_index()
The function takes four arguments: nsize, ncat, ncols, freq
. nsize
sets the number of total observation, i.e. the size of the dataframe.
ncat
defines the number of different categorical values (which will also be indexed by!).
ncols
sets the horizontal length, that means create n number of columns in the new data frame.
The freq
parameter defines the frequency of the main index column date. For instance: Possible parameters are monthly (set freq="M"
or freq="MS"
), hourly (set freq="H"
). The default is daily data with freq="D"
.
See an example. I create a dataframe that is 31 days long, with 3 categories and 10 columns.
df = dfTS_gen(31, 3, 10)
df
col0 | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|
date | cat0 | cat1 | ||||||||||
2018-01-01 | 0 | test | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | test | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
2 | test | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
2018-01-02 | 0 | test | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
1 | test | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018-01-30 | 1 | test | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 |
2 | test | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | |
2018-01-31 | 0 | test | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 |
1 | test | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | |
2 | test | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 | 30.0 |
93 rows × 10 columns
I populated the dataframe with a linear space (using numpy) to have some linear data to work with.
How to correctly loc on a multiindex
Once you start working with a multiindexed time series you will come across the problem to select (or look at) a subset of the data.
How to do that? The answer is simple: pd.IndexSlice
to the rescue.
Once you have mastered that handy tool, working with multiindexed dataframes becomes really easy. Say we want to loc onto the full data for the categorical 0 (i.e. 2nd index) .
idx = pd.IndexSlice
df.loc[idx[:, 0, "test"]].head()
col0 | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | |
---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||
2018-01-01 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2018-01-02 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
2018-01-03 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 |
2018-01-04 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 |
2018-01-05 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 |
You could also loc the datetime with the same idea. Let’s say we want the data for all 1 level categoricals for the 4 days January 10th to 13th:
df.loc[idx["2018-01-10":"2018-01-13", :, "test"]]
col0 | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|
date | cat0 | cat1 | ||||||||||
2018-01-10 | 0 | test | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 |
1 | test | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | |
2 | test | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | |
2018-01-11 | 0 | test | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
1 | test | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | |
2 | test | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | |
2018-01-12 | 0 | test | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 |
1 | test | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | |
2 | test | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 | |
2018-01-13 | 0 | test | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 |
1 | test | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | |
2 | test | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 | 12.0 |
How to merge onto a multiindex dataframe
Now, I am working regularly with time series data in the predictive analytics space. Especially in Forecasting you often end up creating a forecast by running some script or project that returns a future dataframe. That means the date index lies (often) in the future. If you run a validation you would cut of the dataframe in the past and create forecasts in the (real) past but with a datetime index that is not in the train dataframe. A typical business application is to compare the validation dataframe with the actual values from the train. It is important to understand how cross-validation techniques for time series works. See the illustration below, to give you an idea:
This is the moving window technique!
- Pick a start date: Cut the dataframe at a datetime in the past, e.g. using
df.query("date < '2018-01-05')
or thepd.IndexSlice
functionality above (when working with multiindexed time series data). - Create your forecasts and return the dataframe with the datetime index (or column).
- Merge the dataframe and calculate the forecasting error from the validation values and actual values.
Let’s create a hypothetical dataframe with validation data now. Remember we created df
with a length of 31 rows (full Jan. of 2018.).
I will create a shorter dataframe that overlaps (for merging purposes) with a different start date. Also I manipulate the data and multiply it by 100 to distinguish it from the training dataframe:
df_val = dfTS_gen(15, 3, 1, start="2018-01-20")
df_val["col0"] = df_val["col0"] * 100
df_val
col0 | |||
---|---|---|---|
date | cat0 | cat1 | |
2018-01-20 | 0 | test | 0.0 |
1 | test | 0.0 | |
2 | test | 0.0 | |
2018-01-21 | 0 | test | 100.0 |
1 | test | 100.0 | |
2 | test | 100.0 | |
2018-01-22 | 0 | test | 200.0 |
1 | test | 200.0 | |
2 | test | 200.0 | |
2018-01-23 | 0 | test | 300.0 |
1 | test | 300.0 | |
2 | test | 300.0 | |
2018-01-24 | 0 | test | 400.0 |
1 | test | 400.0 | |
2 | test | 400.0 | |
2018-01-25 | 0 | test | 500.0 |
1 | test | 500.0 | |
2 | test | 500.0 | |
2018-01-26 | 0 | test | 600.0 |
1 | test | 600.0 | |
2 | test | 600.0 | |
2018-01-27 | 0 | test | 700.0 |
1 | test | 700.0 | |
2 | test | 700.0 | |
2018-01-28 | 0 | test | 800.0 |
1 | test | 800.0 | |
2 | test | 800.0 | |
2018-01-29 | 0 | test | 900.0 |
1 | test | 900.0 | |
2 | test | 900.0 | |
2018-01-30 | 0 | test | 1000.0 |
1 | test | 1000.0 | |
2 | test | 1000.0 | |
2018-01-31 | 0 | test | 1100.0 |
1 | test | 1100.0 | |
2 | test | 1100.0 | |
2018-02-01 | 0 | test | 1200.0 |
1 | test | 1200.0 | |
2 | test | 1200.0 | |
2018-02-02 | 0 | test | 1300.0 |
1 | test | 1300.0 | |
2 | test | 1300.0 | |
2018-02-03 | 0 | test | 1400.0 |
1 | test | 1400.0 | |
2 | test | 1400.0 |
The dataframe does indeed overlap with the training dataframe, named df
in this notebook.
Let’s see how we can merge them. If we simply run the default pd.DataFrame.merge()
function it will not work as what you most likely intend.
df.merge(df_val)
col0 | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
6 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
We see the index vanished!! We lose so much information.
Also, we need to be aware, that we try to merge on the same column col0
.
In practice you would most likely have another column that you want to add to the train dataframe. Let’s first check how to correctly solve the first case.
Merging another dataframe with the same column
df_merge_same_col = df.merge(df_val, how="outer", left_on=["date", "cat0", "cat1"], right_on=["date", "cat0", "cat1"])
df_merge_same_col
col0_x | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col0_y | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | cat0 | cat1 | |||||||||||
2018-01-01 | 0 | test | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
1 | test | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | |
2 | test | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | |
2018-01-02 | 0 | test | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN |
1 | test | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018-02-02 | 1 | test | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1300.0 |
2 | test | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1300.0 | |
2018-02-03 | 0 | test | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1400.0 |
1 | test | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1400.0 | |
2 | test | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1400.0 |
102 rows × 11 columns
This looks quite good. Remeber the SQL merging types. Also quite helpful to refer to pandas docs.
Lets discet the code:
- Use
how=outer
as merging strategy. - Specify the index to merge on as a list (since it is a multiindex).
You see that for the default set up will rename the overlapping columns with a suffix: x (old) and y (new).
Let’s now look at a overlapping entry (using the idx
Pandas IndexSlice object defined above):
df_merge_same_col.loc[idx["2018-01-30", :, :]]
col0_x | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col0_y | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
cat0 | cat1 | |||||||||||
0 | test | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 1000.0 |
1 | test | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 1000.0 |
2 | test | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 29.0 | 1000.0 |
And we find our 1000 for all the cat0
and cat1
entries. Note, that the date index gets omitted by using loc.
Merging another dataframe with a new column onto an old column
Sometimes you want to append the findings from your validation dataframe (here df_val
) to the old dataframe (i.e. df
).
This is to append values from col0
in df_val
to col0
in df
.
For this we will need the pandas functionality pd.concat.
Beautifully this will work out of the box and gives us what we need (even with the default arguments, note the join='outer'
):
pd.concat([df, df_val])
col0 | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|
date | cat0 | cat1 | ||||||||||
2018-01-01 | 0 | test | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | test | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
2 | test | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
2018-01-02 | 0 | test | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
1 | test | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018-02-02 | 1 | test | 1300.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | test | 1300.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2018-02-03 | 0 | test | 1400.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | test | 1400.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2 | test | 1400.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
138 rows × 10 columns
Conculsion
Merging and concatenating can quite a dauting field, especially for time series data. This short overview is hopefully helpful to get a more stabel foot down in this area. I have been working with time series for years now and find myself often reading the pandas docs for help. Thus, I can highly recommend the article about merging dataframe in the offical pandas docs.