Merging a time series dataframe on a multi index

[ python data-science ] · 20 min read

In my current role I am frequently (pun intended!) working with time series data. I trail, validate, build and deploy machine learning models to forecast financial time series data.

It is all about the index

It is best practice to make use of an multiindex when working with big time series dataframes. My usual tool of use is pandas. Often enough you get your data from some data source like SQL or an API. The data in question can often not be used in testing purposes (for confidentiallity purposes ofor instance).

Thus every good data scientist has some sort of a helper function to create a test data frame Especially when you thrive in TDD.

One function I almost always come back to is posted below.

import pandas as pd
import numpy as np

def dfTS_gen(nsize, ncat, ncols, start="2018-01-01", freq="D"):
    # generate data
    list_df = []
    for i in range(ncat):
        df = pd.DataFrame()
        df["date"] = pd.date_range(start, periods=nsize, freq=freq)
        df["cat0"] = i
        df["cat1"] = "test"
        for j in range(ncols):
            df[f"col{j}"] = np.linspace(0, nsize-1, nsize)

        df.set_index(["date", "cat0", "cat1"], inplace=True)
        list_df.append(df)

    return pd.concat(list_df).sort_index()

The function takes four arguments: nsize, ncat, ncols, freq. nsize sets the number of total observation, i.e. the size of the dataframe. ncat defines the number of different categorical values (which will also be indexed by!). ncols sets the horizontal length, that means create n number of columns in the new data frame. The freq parameter defines the frequency of the main index column date. For instance: Possible parameters are monthly (set freq="M" or freq="MS"), hourly (set freq="H"). The default is daily data with freq="D".

See an example. I create a dataframe that is 31 days long, with 3 categories and 10 columns.

df = dfTS_gen(31, 3, 10)
df

			col0	col1	col2	col3	col4	col5	col6	col7	col8	col9
date	cat0	cat1
2018-01-01	0	test	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
	1	test	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
	2	test	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2018-01-02	0	test	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
2018-01-02	1	test	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...
2018-01-30	1	test	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0
2018-01-30	2	test	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0
2018-01-31	0	test	30.0	30.0	30.0	30.0	30.0	30.0	30.0	30.0	30.0	30.0
	1	test	30.0	30.0	30.0	30.0	30.0	30.0	30.0	30.0	30.0	30.0
	2	test	30.0	30.0	30.0	30.0	30.0	30.0	30.0	30.0	30.0	30.0

93 rows × 10 columns

I populated the dataframe with a linear space (using numpy) to have some linear data to work with.

How to correctly loc on a multiindex

Once you start working with a multiindexed time series you will come across the problem to select (or look at) a subset of the data. How to do that? The answer is simple: pd.IndexSlice to the rescue.

Once you have mastered that handy tool, working with multiindexed dataframes becomes really easy. Say we want to loc onto the full data for the categorical 0 (i.e. 2nd index) .

idx = pd.IndexSlice
df.loc[idx[:, 0, "test"]].head()

	col0	col1	col2	col3	col4	col5	col6	col7	col8	col9
date
2018-01-01	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2018-01-02	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
2018-01-03	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0
2018-01-04	3.0	3.0	3.0	3.0	3.0	3.0	3.0	3.0	3.0	3.0
2018-01-05	4.0	4.0	4.0	4.0	4.0	4.0	4.0	4.0	4.0	4.0

You could also loc the datetime with the same idea. Let’s say we want the data for all 1 level categoricals for the 4 days January 10th to 13th:

df.loc[idx["2018-01-10":"2018-01-13", :, "test"]]

			col0	col1	col2	col3	col4	col5	col6	col7	col8	col9
date	cat0	cat1
2018-01-10	0	test	9.0	9.0	9.0	9.0	9.0	9.0	9.0	9.0	9.0	9.0
	1	test	9.0	9.0	9.0	9.0	9.0	9.0	9.0	9.0	9.0	9.0
	2	test	9.0	9.0	9.0	9.0	9.0	9.0	9.0	9.0	9.0	9.0
2018-01-11	0	test	10.0	10.0	10.0	10.0	10.0	10.0	10.0	10.0	10.0	10.0
	1	test	10.0	10.0	10.0	10.0	10.0	10.0	10.0	10.0	10.0	10.0
	2	test	10.0	10.0	10.0	10.0	10.0	10.0	10.0	10.0	10.0	10.0
2018-01-12	0	test	11.0	11.0	11.0	11.0	11.0	11.0	11.0	11.0	11.0	11.0
	1	test	11.0	11.0	11.0	11.0	11.0	11.0	11.0	11.0	11.0	11.0
	2	test	11.0	11.0	11.0	11.0	11.0	11.0	11.0	11.0	11.0	11.0
2018-01-13	0	test	12.0	12.0	12.0	12.0	12.0	12.0	12.0	12.0	12.0	12.0
	1	test	12.0	12.0	12.0	12.0	12.0	12.0	12.0	12.0	12.0	12.0
	2	test	12.0	12.0	12.0	12.0	12.0	12.0	12.0	12.0	12.0	12.0

How to merge onto a multiindex dataframe

Now, I am working regularly with time series data in the predictive analytics space. Especially in Forecasting you often end up creating a forecast by running some script or project that returns a future dataframe. That means the date index lies (often) in the future. If you run a validation you would cut of the dataframe in the past and create forecasts in the (real) past but with a datetime index that is not in the train dataframe. A typical business application is to compare the validation dataframe with the actual values from the train. It is important to understand how cross-validation techniques for time series works. See the illustration below, to give you an idea:

TS Cross Validation

This is the moving window technique!

Pick a start date: Cut the dataframe at a datetime in the past, e.g. using df.query("date < '2018-01-05') or the pd.IndexSlice functionality above (when working with multiindexed time series data).
Create your forecasts and return the dataframe with the datetime index (or column).
Merge the dataframe and calculate the forecasting error from the validation values and actual values.

Let’s create a hypothetical dataframe with validation data now. Remember we created df with a length of 31 rows (full Jan. of 2018.). I will create a shorter dataframe that overlaps (for merging purposes) with a different start date. Also I manipulate the data and multiply it by 100 to distinguish it from the training dataframe:

df_val = dfTS_gen(15, 3, 1, start="2018-01-20")
df_val["col0"] = df_val["col0"] * 100
df_val

			col0
date	cat0	cat1
2018-01-20	0	test	0.0
	1	test	0.0
	2	test	0.0
2018-01-21	0	test	100.0
	1	test	100.0
	2	test	100.0
2018-01-22	0	test	200.0
	1	test	200.0
	2	test	200.0
2018-01-23	0	test	300.0
	1	test	300.0
	2	test	300.0
2018-01-24	0	test	400.0
	1	test	400.0
	2	test	400.0
2018-01-25	0	test	500.0
	1	test	500.0
	2	test	500.0
2018-01-26	0	test	600.0
	1	test	600.0
	2	test	600.0
2018-01-27	0	test	700.0
	1	test	700.0
	2	test	700.0
2018-01-28	0	test	800.0
	1	test	800.0
	2	test	800.0
2018-01-29	0	test	900.0
	1	test	900.0
	2	test	900.0
2018-01-30	0	test	1000.0
	1	test	1000.0
	2	test	1000.0
2018-01-31	0	test	1100.0
	1	test	1100.0
	2	test	1100.0
2018-02-01	0	test	1200.0
	1	test	1200.0
	2	test	1200.0
2018-02-02	0	test	1300.0
	1	test	1300.0
	2	test	1300.0
2018-02-03	0	test	1400.0
	1	test	1400.0
	2	test	1400.0

The dataframe does indeed overlap with the training dataframe, named df in this notebook. Let’s see how we can merge them. If we simply run the default pd.DataFrame.merge() function it will not work as what you most likely intend.

df.merge(df_val)

	col0	col1	col2	col3	col4	col5	col6	col7	col8	col9
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
8	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

We see the index vanished!! We lose so much information. Also, we need to be aware, that we try to merge on the same column col0.

In practice you would most likely have another column that you want to add to the train dataframe. Let’s first check how to correctly solve the first case.

Merging another dataframe with the same column

df_merge_same_col = df.merge(df_val, how="outer", left_on=["date", "cat0", "cat1"], right_on=["date", "cat0", "cat1"])
df_merge_same_col

			col0_x	col1	col2	col3	col4	col5	col6	col7	col8	col9	col0_y
date	cat0	cat1
2018-01-01	0	test	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	NaN
	1	test	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	NaN
	2	test	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	NaN
2018-01-02	0	test	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	NaN
2018-01-02	1	test	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2018-02-02	1	test	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1300.0
2018-02-02	2	test	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1300.0
2018-02-03	0	test	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1400.0
	1	test	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1400.0
	2	test	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1400.0

102 rows × 11 columns

This looks quite good. Remeber the SQL merging types. Also quite helpful to refer to pandas docs.

Lets discet the code:

Use how=outer as merging strategy.
Specify the index to merge on as a list (since it is a multiindex).

You see that for the default set up will rename the overlapping columns with a suffix: x (old) and y (new).

Let’s now look at a overlapping entry (using the idx Pandas IndexSlice object defined above):

df_merge_same_col.loc[idx["2018-01-30", :, :]]

		col0_x	col1	col2	col3	col4	col5	col6	col7	col8	col9	col0_y
cat0	cat1
0	test	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	1000.0
1	test	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	1000.0
2	test	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	29.0	1000.0

And we find our 1000 for all the cat0 and cat1 entries. Note, that the date index gets omitted by using loc.

Merging another dataframe with a new column onto an old column

Sometimes you want to append the findings from your validation dataframe (here df_val) to the old dataframe (i.e. df). This is to append values from col0 in df_val to col0 in df.

For this we will need the pandas functionality pd.concat. Beautifully this will work out of the box and gives us what we need (even with the default arguments, note the join='outer'):

pd.concat([df, df_val])

			col0	col1	col2	col3	col4	col5	col6	col7	col8	col9
date	cat0	cat1
2018-01-01	0	test	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
	1	test	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
	2	test	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2018-01-02	0	test	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
2018-01-02	1	test	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...
2018-02-02	1	test	1300.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-02-02	2	test	1300.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-02-03	0	test	1400.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	1	test	1400.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	2	test	1400.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

138 rows × 10 columns

Conculsion

Merging and concatenating can quite a dauting field, especially for time series data. This short overview is hopefully helpful to get a more stabel foot down in this area. I have been working with time series for years now and find myself often reading the pandas docs for help. Thus, I can highly recommend the article about merging dataframe in the offical pandas docs.

	col0	col1	col2	col3	col4	col5	col6	col7	col8	col9
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
8	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	col0	col1	col2	col3	col4	col5	col6	col7	col8	col9
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
8	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	col0	col1	col2	col3	col4	col5	col6	col7	col8	col9
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
8	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0