Resampling timeseries dataframe with multi-index - python

Generate data:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df['col1'] = np.random.normal(size = df.shape[0])
df['col2'] = np.random.random_integers(1, 100, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df2['col1'] = np.random.normal(size = df2.shape[0])
df2['col2'] = np.random.random_integers(1, 50, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3=df3.set_index(['index','uid'])
I am trying to resample the data to 30min intervals and assign how to aggregate the data for each uid and each column individually. I have many columns and I need to assign whether if I want the mean, median, std, max, min, for each column. Since there are duplicate timestamps I need to do this operation for each user, that's why I try to set the multiindex and do the following:
df3.groupby(pd.Grouper(freq='30Min',closed='right',label='right')).agg({
"col1": "max", "col2": "min", 'uid':'max'})
but I get the following error
ValueError: MultiIndex has no single backing array. Use
'MultiIndex.to_numpy()' to get a NumPy array of tuples.
How can I do this operation?

You have to specify the level name when you use pd.Grouper on index:
out = (df3.groupby([pd.Grouper(level='index', freq='30T', closed='right', label='right'), 'uid'])
.agg({"col1": "max", "col2": "min"}))
print(out)
# Output
col1 col2
index uid
2020-10-01 00:00:00 1 -0.222489 77
2 -1.490019 22
2020-10-01 00:30:00 1 1.556801 16
2 0.580076 1
2020-10-01 01:00:00 1 0.745477 12
... ... ...
2020-10-02 23:00:00 2 0.272276 13
2020-10-02 23:30:00 1 0.378779 20
2 0.786048 5
2020-10-03 00:00:00 1 1.716791 20
2 1.438454 5
[194 rows x 2 columns]

Related

Get the value of a data frame column with respect to another data frame column value

I have two data frames
df1:
ID Date Value
0 9560 07/3/2021 25
1 9560 03/03/2021 20
2 9712 12/15/2021 15
3 9712 08/30/2021 10
4 9920 4/11/2021 5
df2:
ID Value
0 9560
1 9712
2 9920
In df2, I want to get the latest value from "Value" column of df1 with respect to ID.
This is my expected output:
ID Value
0 9560 25
1 9712 15
2 9920 5
How could I achieve it?
Based on Daniel Afriyie's approach, I came up with this solution:
import pandas as pd
# Setup for demo
df1 = pd.DataFrame(
columns=['ID', 'Date', 'Value'],
data=[
[9560, '07/3/2021', 25],
[9560, '03/03/2021', 20],
[9712, '12/15/2021', 15],
[9712, '08/30/2021', 10],
[9920, '4/11/2021', 5]
]
)
df2 = pd.DataFrame(
columns=['ID', 'Value'],
data=[[9560, None], [9712, None], [9920, None]]
)
## Actual solution
# Casting 'Date' column to actual dates
df1['Date'] = pd.to_datetime(df1['Date'])
# Sorting by dates
df1 = df1.sort_values(by='Date', ascending=False)
# Dropping duplicates of 'ID' (since it's ordered by date, only the newest of each ID will be kept)
df1 = df1.drop_duplicates(subset=['ID'])
# Merging the values from df1 into the the df2
pf2 = pd.merge(df2[['ID']], df1[['ID', 'Value']]))
output:
ID Value
0 9560 25
1 9712 15
2 9920 5

Python Pandas: How to subtract values in two non-consecutive rows in a specific column of a dataframe from one another

I am trying to populate the values in a new column in a Pandas df by subtracting the value of two non-consecutive rows in a different column within the same df. I can do it, so long as the df does not have a column with dates in it. But if it does have a column with dates then pandas throws an error.
Assume the following dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 55, 9], [10, 99, 19], [27, 38, 29], [39, 10, 72]]),
columns=['a', 'b', 'c'])
df['Date'] = ['2020-01-02', '2020-01-05', '2020-06-10', '2020-08-05', '2020-09-01', '2020-10-29']
df['Date'] = pd.to_datetime(df['Date'])
df['d'] = ''
df = df[['Date', 'a', 'b', 'c', 'd']]
This gives me a df that looks like this:
Date a b c d
0 2020-01-02 1 2 3
1 2020-01-05 4 5 6
2 2020-06-10 7 55 9
3 2020-08-05 10 99 19
4 2020-09-01 27 38 29
5 2020-10-29 39 10 72
I am trying to create a new column 'd' that, for each row, subtracts the value in column 'b' two rows below from the row in question. For instance, the value in row [0], column ['d'] would be calculated as df.loc[2]['b'] - df.loc[0]['b'].
What I'm trying (which doesn't work) is:
for i in range(len(df)-2):
df.loc[i]['d'] = df.loc[i+2]['b'] - df.loc[i]['b']
I can get this to work if I have no date in the df. But when I add a column with dates, it throws an error message saying
A value is trying to be set on a copy of a slice from a DataFrame
I can't figure out why a date column causes the df to be unable to do math on columns with only int64 data. I've tried searching this site and just can't seem to solve the problem. Any help would be greatly appreciated.
You can do it in vectorized form using shift (which is considerably faster than using loops):
df['d'] = df['b'].shift(-2) - df['b']
df
Output:
Date a b c d
0 2020-01-02 1 2 3 53.0
1 2020-01-05 4 5 6 94.0
2 2020-06-10 7 55 9 -17.0
3 2020-08-05 10 99 19 -89.0
4 2020-09-01 27 38 29 NaN
5 2020-10-29 39 10 72 NaN

Python Pandas merge on row index and column index across 2 dataframes

Am trying to do something where I calculate a new dataframe which is dataframe1 divided by dataframe2 where columnname match and date index matches bases on closest date nonexact match)
idx1 = pd.DatetimeIndex(['2017-01-01','2018-01-01','2019-01-01'])
idx2 = pd.DatetimeIndex(['2017-02-01','2018-03-01','2019-04-01'])
df1 = pd.DataFrame(index = idx1,data = {'XYZ': [10, 20, 30],'ABC': [15, 25, 30]})
df2 = pd.DataFrame(index = idx2,data = {'XYZ': [1, 2, 3],'ABC': [3, 5, 6]})
#looking for some code
#df3 = df1/df2 on matching column and closest matching row
This should produce a dataframe which looks like this
XYZ ABC
2017-01-01 10 5
2018-01-01 10 5
2019-01-01 10 5
You can use an asof merge to do a match on a "close" row. Then we'll group over the columns axis and divide.
df3 = pd.merge_asof(df1, df2, left_index=True, right_index=True,
direction='nearest')
# XYZ_x ABC_x XYZ_y ABC_y
#2017-01-01 10 15 1 3
#2018-01-01 20 25 2 5
#2019-01-01 30 30 3 6
df3 = (df3.groupby(df3.columns.str.split('_').str[0], axis=1)
.apply(lambda x: x.iloc[:, 0]/x.iloc[:, 1]))
# ABC XYZ
#2017-01-01 5.0 10.0
#2018-01-01 5.0 10.0
#2019-01-01 5.0 10.0

Advanced Pivot Table in Pandas

I am trying to optimize some table transformation scripts in Python Pandas, which I am trying to feed with huge data sets (above 50k rows). I wrote a script that iterates through every index and parses values into a new data frame (see example below), but I am experiencing performance issues. Is there any pandas function, that could get the same results without iterating?
Example code:
from datetime import datetime
import pandas as pd
date1 = datetime(2019,1,1)
date2 = datetime(2019,1,2)
df = pd.DataFrame({"ID": [1,1,2,2,3,3],
"date": [date1,date2,date1,date2,date1,date2],
"x": [1,2,3,4,5,6],
"y": ["a","a","b","b","c","c"]})
new_df = pd.DataFrame()
for i in df.index:
new_df.at[df.at[i, "ID"], "y"] = df.at[i, "y"]
if df.at[i, "date"] == datetime(2019,1,1):
new_df.at[df.at[i, "ID"], "x1"] = df.at[i, "x"]
elif df.at[i, "date"] == datetime(2019,1,2):
new_df.at[df.at[i, "ID"], "x2"] = df.at[i, "x"]
output:
ID date x y
0 1 2019-01-01 1 a
1 1 2019-01-02 2 a
2 2 2019-01-01 3 b
3 2 2019-01-02 4 b
4 3 2019-01-01 5 c
5 3 2019-01-02 6 c
y x1 x2
1 a 1.0 2.0
2 b 3.0 4.0
3 c 5.0 6.0
The transformation basically groups the rows by the "ID" column and gets the "x1" values from the rows with date 2019-01-01, and the "x2" values from the rows with date 2019-01-02. The "y" value is the same within the same "ID". "ID" columns become the new indexes.
I'd appreciate any advice on this matter.
Using pivot_tables will get what you are looking for:
result = df.pivot_table(index=['ID', 'y'], columns='date', values='x')
result.rename(columns={date1: 'x1', date2: 'x2'}).reset_index('y')

Displaying only the intersection of date range rows in pandas

Following from here
import pandas as pd
data = {'date': ['1998-03-01 00:00:01', '2001-04-01 00:00:01','1998-06-01 00:00:01','2001-08-01 00:00:01','2001-05-03 00:00:01','1994-03-01 00:00:01'],
'node1': [1, 1, 2,2,3,2],
'node2': [8,316,26,35,44,56],
'weight': [1,1,1,1,1,1], }
df = pd.DataFrame(data, columns = ['date', 'node1','node2','weight'])
df['date'] = pd.to_datetime(df['date'])
mask = df.groupby('node1').apply(lambda x : (x['date'].dt.year.isin([1998,1999,2000])).any())
mask2 = df.groupby('node1').apply(lambda x : (x['date'].dt.year.isin([2001,2002,2003])).any())
print df[df['node1'].isin(mask[mask & mask2].index)]
The output I require are the nodes which are in the year range (98-00) and (01-03) but also it should only display the rows which are in both the ranges.
Expected Output-
node1 node2 date
1 8 1998-03-01
1 316 2001-04-01
2 26 1998-06-01
2 35 2001-08-01
right now this code is also printing this row: 2 56 1994-03-01 too.
One simple solution is to first remove the dates that are not in both the date ranges then apply mask i.e
l1 = [1998,1999,2000]
l2 = [2001,2002,2003]
ndf = df[df['date'].dt.year.isin(l1+l2)]
After getting the ndf:
Option 1: You can go for dual groupby mask based approach i.e
mask = ndf.groupby('node1').apply(lambda x : (x['date'].dt.year.isin(l1)).any())
mask2 = ndf.groupby('node1').apply(lambda x : (x['date'].dt.year.isin(l2)).any())
new = ndf[ndf['node1'].isin(mask[mask & mask2].index)]
Thank you #Zero
Option 2: You can go for groupby transform
new = ndf[ndf.groupby('node1')['date'].transform(lambda x: x.dt.year.isin(l1).any() & x.dt.year.isin(l2).any())]
Option 3: groupby filter
new = ndf.groupby('node1').filter(lambda x: x['date'].dt.year.isin(l1).any() & x['date'].dt.year.isin(l2).any())
Output :
date node1 node2 weight
0 1998-03-01 00:00:01 1 8 1
1 2001-04-01 00:00:01 1 316 1
2 1998-06-01 00:00:01 2 26 1
3 2001-08-01 00:00:01 2 35 1

Categories