Merging a time series dataframe on a multi index

In my current role I am frequently (pun intended!) working with time series data. I trail, validate, build and deploy machine learning models to forecast financial time series data.

It is all about the index

It is best practice to make use of an multiindex when working with big time series dataframes. My usual tool of use is pandas. Often enough you get your data from some data source like SQL or an API. The data in question can often not be used in testing purposes (for confidentiallity purposes ofor instance).

Thus every good data scientist has some sort of a helper function to create a test data frame Especially when you thrive in TDD.

One function I almost always come back to is posted below.

import pandas as pd
import numpy as np

def dfTS_gen(nsize, ncat, ncols, start="2018-01-01", freq="D"):
    # generate data
    list_df = []
    for i in range(ncat):
        df = pd.DataFrame()
        df["date"] = pd.date_range(start, periods=nsize, freq=freq)
        df["cat0"] = i
        df["cat1"] = "test"
        for j in range(ncols):
            df[f"col{j}"] = np.linspace(0, nsize-1, nsize)

        df.set_index(["date", "cat0", "cat1"], inplace=True)
        list_df.append(df)

    return pd.concat(list_df).sort_index()

The function takes four arguments: nsize, ncat, ncols, freq. nsize sets the number of total observation, i.e. the size of the dataframe. ncat defines the number of different categorical values (which will also be indexed by!). ncols sets the horizontal length, that means create n number of columns in the new data frame. The freq parameter defines the frequency of the main index column date. For instance: Possible parameters are monthly (set freq="M" or freq="MS"), hourly (set freq="H"). The default is daily data with freq="D".

See an example. I create a dataframe that is 31 days long, with 3 categories and 10 columns.

df = dfTS_gen(31, 3, 10)
df
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9
date cat0 cat1
2018-01-01 0 test 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 test 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 test 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2018-01-02 0 test 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 test 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ...
2018-01-30 1 test 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0
2 test 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0
2018-01-31 0 test 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0
1 test 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0
2 test 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0

93 rows × 10 columns

I populated the dataframe with a linear space (using numpy) to have some linear data to work with.

How to correctly loc on a multiindex

Once you start working with a multiindexed time series you will come across the problem to select (or look at) a subset of the data. How to do that? The answer is simple: pd.IndexSlice to the rescue.

Once you have mastered that handy tool, working with multiindexed dataframes becomes really easy. Say we want to loc onto the full data for the categorical 0 (i.e. 2nd index) .

idx = pd.IndexSlice
df.loc[idx[:, 0, "test"]].head()
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9
date
2018-01-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2018-01-02 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2018-01-03 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2018-01-04 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
2018-01-05 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0

You could also loc the datetime with the same idea. Let’s say we want the data for all 1 level categoricals for the 4 days January 10th to 13th:

df.loc[idx["2018-01-10":"2018-01-13", :, "test"]]
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9
date cat0 cat1
2018-01-10 0 test 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
1 test 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
2 test 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
2018-01-11 0 test 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0
1 test 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0
2 test 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0
2018-01-12 0 test 11.0 11.0 11.0 11.0 11.0 11.0 11.0 11.0 11.0 11.0
1 test 11.0 11.0 11.0 11.0 11.0 11.0 11.0 11.0 11.0 11.0
2 test 11.0 11.0 11.0 11.0 11.0 11.0 11.0 11.0 11.0 11.0
2018-01-13 0 test 12.0 12.0 12.0 12.0 12.0 12.0 12.0 12.0 12.0 12.0
1 test 12.0 12.0 12.0 12.0 12.0 12.0 12.0 12.0 12.0 12.0
2 test 12.0 12.0 12.0 12.0 12.0 12.0 12.0 12.0 12.0 12.0

How to merge onto a multiindex dataframe

Now, I am working regularly with time series data in the predictive analytics space. Especially in Forecasting you often end up creating a forecast by running some script or project that returns a future dataframe. That means the date index lies (often) in the future. If you run a validation you would cut of the dataframe in the past and create forecasts in the (real) past but with a datetime index that is not in the train dataframe. A typical business application is to compare the validation dataframe with the actual values from the train. It is important to understand how cross-validation techniques for time series works. See the illustration below, to give you an idea:

TS Cross Validation

This is the moving window technique!

  1. Pick a start date: Cut the dataframe at a datetime in the past, e.g. using df.query("date < '2018-01-05') or the pd.IndexSlice functionality above (when working with multiindexed time series data).
  2. Create your forecasts and return the dataframe with the datetime index (or column).
  3. Merge the dataframe and calculate the forecasting error from the validation values and actual values.

Let’s create a hypothetical dataframe with validation data now. Remember we created df with a length of 31 rows (full Jan. of 2018.). I will create a shorter dataframe that overlaps (for merging purposes) with a different start date. Also I manipulate the data and multiply it by 100 to distinguish it from the training dataframe:

df_val = dfTS_gen(15, 3, 1, start="2018-01-20")
df_val["col0"] = df_val["col0"] * 100
df_val
col0
date cat0 cat1
2018-01-20 0 test 0.0
1 test 0.0
2 test 0.0
2018-01-21 0 test 100.0
1 test 100.0
2 test 100.0
2018-01-22 0 test 200.0
1 test 200.0
2 test 200.0
2018-01-23 0 test 300.0
1 test 300.0
2 test 300.0
2018-01-24 0 test 400.0
1 test 400.0
2 test 400.0
2018-01-25 0 test 500.0
1 test 500.0
2 test 500.0
2018-01-26 0 test 600.0
1 test 600.0
2 test 600.0
2018-01-27 0 test 700.0
1 test 700.0
2 test 700.0
2018-01-28 0 test 800.0
1 test 800.0
2 test 800.0
2018-01-29 0 test 900.0
1 test 900.0
2 test 900.0
2018-01-30 0 test 1000.0
1 test 1000.0
2 test 1000.0
2018-01-31 0 test 1100.0
1 test 1100.0
2 test 1100.0
2018-02-01 0 test 1200.0
1 test 1200.0
2 test 1200.0
2018-02-02 0 test 1300.0
1 test 1300.0
2 test 1300.0
2018-02-03 0 test 1400.0
1 test 1400.0
2 test 1400.0

The dataframe does indeed overlap with the training dataframe, named df in this notebook. Let’s see how we can merge them. If we simply run the default pd.DataFrame.merge() function it will not work as what you most likely intend.

df.merge(df_val)
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

We see the index vanished!! We lose so much information. Also, we need to be aware, that we try to merge on the same column col0.

In practice you would most likely have another column that you want to add to the train dataframe. Let’s first check how to correctly solve the first case.

Merging another dataframe with the same column

df_merge_same_col = df.merge(df_val, how="outer", left_on=["date", "cat0", "cat1"], right_on=["date", "cat0", "cat1"])
df_merge_same_col

col0_x col1 col2 col3 col4 col5 col6 col7 col8 col9 col0_y
date cat0 cat1
2018-01-01 0 test 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
1 test 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
2 test 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
2018-01-02 0 test 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN
1 test 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2018-02-02 1 test NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1300.0
2 test NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1300.0
2018-02-03 0 test NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1400.0
1 test NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1400.0
2 test NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1400.0

102 rows × 11 columns

This looks quite good. Remeber the SQL merging types. Also quite helpful to refer to pandas docs.

Lets discet the code:

  1. Use how=outer as merging strategy.
  2. Specify the index to merge on as a list (since it is a multiindex).

You see that for the default set up will rename the overlapping columns with a suffix: x (old) and y (new).

Let’s now look at a overlapping entry (using the idx Pandas IndexSlice object defined above):

df_merge_same_col.loc[idx["2018-01-30", :, :]]
col0_x col1 col2 col3 col4 col5 col6 col7 col8 col9 col0_y
cat0 cat1
0 test 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 1000.0
1 test 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 1000.0
2 test 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 1000.0

And we find our 1000 for all the cat0 and cat1 entries. Note, that the date index gets omitted by using loc.

Merging another dataframe with a new column onto an old column

Sometimes you want to append the findings from your validation dataframe (here df_val) to the old dataframe (i.e. df). This is to append values from col0 in df_val to col0 in df.

For this we will need the pandas functionality pd.concat. Beautifully this will work out of the box and gives us what we need (even with the default arguments, note the join='outer'):

pd.concat([df, df_val])
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9
date cat0 cat1
2018-01-01 0 test 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 test 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 test 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2018-01-02 0 test 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 test 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ...
2018-02-02 1 test 1300.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 test 1300.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-02-03 0 test 1400.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 test 1400.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 test 1400.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN

138 rows × 10 columns

Conculsion

Merging and concatenating can quite a dauting field, especially for time series data. This short overview is hopefully helpful to get a more stabel foot down in this area. I have been working with time series for years now and find myself often reading the pandas docs for help. Thus, I can highly recommend the article about merging dataframe in the offical pandas docs.