I try to combine many pairs of rows when run the code one time. As my example shows, for two rows which can be combined, the rules are,
values in PT, DS, SC columns must be same.
time stamps in FS must be the closest pair.
combine on ID column (string) is like ID1,ID2.
combine on WT and CB column (number) is sum().
combine on FS is as the latest time.
My example is,
df0 = pd.DataFrame({'ID':['1001','1002','1003','1004','2001','2002','2003','2004','3001','3002','3003','3004','4001','4002','4003','4004','5001','5002','5003','5004','6001'],
'PT':['B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','D','D','D','D','F'],
'DS':['AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAA','AAA','AAA','AAB','AAB'],
'SC':['P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P2','P2'],
'FS':['2020-10-16 00:00:00','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:04','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:05','2020-10-16 00:00:07','2020-10-16 00:00:01','2020-10-16 00:00:10','2020-10-16 00:10:00','2020-10-16 00:10:40','2020-10-16 00:00:00','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[1,2,3,4,10,11,12,13,20,21,22,23,30,31,32,33,40,41,42,43,53],
'CB':[0.1,0.2,0.3,0.4,1,1.1,1.2,1.3,2,2.1,2.2,2.3,3,3.1,3.2,3.3,4,4.1,4.2,4.3,5.3]})
When run the code one time, the new dataframe df1 is,
df1 = pd.DataFrame({'ID':['1001,1002','1003,1004','2001,2002','2003,2004','3001,3002','3003,3004','4001,4002','4003,4004','5001,5002','5003','5004','6001'],
'PT':['B','B','B','B','B','B','B','B','D','D','D','F'],
'DS':['AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAA','AAA','AAB','AAB'],
'SC':['P1','P1','P2','P2','P1','P1','P2','P2','P1','P1','P2','P2'],
'FS':['2020-10-16 00:00:02','2020-10-16 00:00:04','2020-10-16 00:00:01','2020-10-16 00:00:03','2020-10-16 00:00:01','2020-10-16 00:00:07','2020-10-16 00:00:10','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[3,7,21,25,41,45,61,65,81,42,43,53],
'CB':[0.3,0.7,2.1,2.5,4.1,4.5,6.1,6.5,8.1,4.2,4.3,5.3]})
When run the code again on df1, the new dataframe df2 is,
df2 = pd.DataFrame({'ID':['1001,1002,1003,1004','2001,2002,2003,2004','3001,3002,3003,3004','4001,4002,4003,4004','5001,5002,5003','5004','6001'],
'PT':['B','B','B','B','D','D','F'],
'DS':['AAA','AAA','AAB','AAB','AAA','AAB','AAB'],
'SC':['P1','P2','P1','P2','P1','P2','P2'],
'FS':['2020-10-16 00:00:04','2020-10-16 00:00:03','2020-10-16 00:00:07','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[10,46,86,126,123,43,53],
'CB':[1,4.6,8.6,12.6,12.3,4.3,5.3]})
Here no more combines can be done on df2 because no any pair of rows meets the rules.
The reason is that I have memory limit and have to decrease the size of data without losing the info. So I try to bundle IDs which shares same features and happens close to each other. I plan to run the code multiple times until no more memory issue or no more possible combines.
This is a good place to use GroupBy operations.
My source was Wes McKinney's Python for Data Analysis.
df0['ID'] = df0.groupby([df0['PT'], df0['DS'], df0['SC']])['ID'].transform(lambda x: ','.join(x))
max_times = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).max().drop(['WT', 'CB'], axis = 1)
sums_WT_CB = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).sum()
df2 = pd.merge(max_times, sums_WT_CB, on=['ID', 'PT', 'DS', 'SC'])
This code just takes the most recent time for each unique grouping of the columns you specified. If there are other requirements for the FS column, you will have to modify this.
Code to concatenate the IDs came from:
Concatenate strings from several rows using Pandas groupby
Perhaps there's something more straightforward (please comment if so :)
but the following seems to work:
def combine(data):
return pd.DataFrame(
{
"ID": ",".join(map(str, data["ID"])),
"PT": data["PT"].iloc[0],
"DS": data["DS"].iloc[0],
"SC": data["SC"].iloc[0],
"WT": data["WT"].sum(),
"CB": data["CB"].sum(),
"FS": data["FS"].max(),
},
index=[0],
).reset_index(drop=True)
df_agg = (
df.sort_values(["PT", "DS", "SC", "FS"])
.groupby(["PT", "DS", "SC"])
.apply(combine)
.reset_index(drop=True)
)
returns
ID PT DS SC WT CB FS
0 1001,1002,1003,1004 B AAA P1 10 1.0 2020-10-16 00:00:04
1 2001,2002,2003,2004 B AAA P2 46 4.6 2020-10-16 00:00:03
2 3001,3002,3003,3004 B AAB P1 86 8.6 2020-10-16 00:00:07
3 4001,4002,4003,4004 B AAB P2 126 12.6 2020-10-16 00:10:40
4 5001,5003,5002 D AAA P1 123 12.3 2020-10-16 00:10:00
5 5004 D AAB P2 43 4.3 2020-10-16 00:00:10
6 6001 F AAB P2 53 5.3 2020-10-16 00:00:05
I have 2 dataframes of different sizes in Python. The smaller dataframe has 2 datetime columns, one for the beginning datetime and one for the ending datetime. The other dataframe is bigger (more rows and columns) and it has one datetime column.
df A
Date_hour_beginning Date_hour_end
3/8/2019 18:35 3/8/2019 19:45
4/8/2019 14:22 4/8/2019 14:55
df B
Date_hour compression
3/8/2019 18:37 41
3/8/2019 18:55 47
3/8/2019 19:30 55
3/8/2019 19:51 51
4/8/2019 14:10 53
4/8/2019 14:35 48
4/8/2019 14:51 51
4/8/2019 15:02 58
I want to add the mean and amplitude of compression to df_A that cover the datetime range. To get the following result:
df_A
Date_hour_beginning Date_hour_end mean_compression amplitude
3/8/2019 18:35 3/8/2019 19:45 47.66 14
4/8/2019 14:22 4/8/2019 14:55 49.5 3
I tried the np.where and groupby but I didn't know but I had the error of mismatching dataframe shapes.
Here is my solution. It is kind of a more verbose (and maybe more readable?) version of eva-vw's. eva-vw's uses the .apply() method which is the fastest way of looping over the rows of your dataframe. However it should only make a significant difference in run time if your df_A has really many (many) rows (which does not seem to be the case here).
for i, row in df_A.iterrows() :
start = row['Date_hour_beginning']
end = row['Date_hour_end']
mask = (df_B['Date_hour'] >= start) & (df_B['Date_hour'] <= end)
compression_values = df_B.loc[mask, 'compression']
df_A.loc[i, 'avg comp'] = compression_values.mean()
df_A.loc[i, 'amp comp'] = compression_values.max() - compression_values.min()
For completeness, here is how I created the dataframes:
import numpy as np
import pandas as pd
columns = ['Date_hour_beginning', 'Date_hour_end']
times_1 = pd.to_datetime(['3/8/2019 18:35', '3/8/2019 19:45'])
times_2 = pd.to_datetime(['4/8/2019 14:22', '4/8/2019 14:55'])
df_A = pd.DataFrame(data=[times_1, times_2], columns=columns)
data_B = [ ['3/8/2019 18:37', 41],
['3/8/2019 18:55', 47],
['3/8/2019 19:30', 55],
['3/8/2019 19:51', 51],
['4/8/2019 14:10', 53],
['4/8/2019 14:35', 48],
['4/8/2019 14:51', 51],
['4/8/2019 15:02', 58]]
columns_B = ['Date_hour', 'compression']
df_B = pd.DataFrame(data=data_B, columns=columns_B)
df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])
To go a bit further: to solve your problem, you need to loop over the rows of df_A. This can be done in three main ways: (i) with a plain for loop over the indices of the rows of the dataframe, (ii) with for loop with the .iterrows() method, or with the apply() method.
I ordered them from the slowest to the fastest at runtime. I picked method (ii) and eva-vw picked method (iii). The advantage of .apply() is that it is the fastest, but its disadvantage (to me) is that you have to write everything you want to do with the row, in a one-line lambda function.
# create test dataframes
df_A = pd.DataFrame(
{
"Date_hour_beginning": ["3/8/2019 18:35", "4/8/2019 14:22"],
"Date_hour_end": ["3/8/2019 19:45", "4/8/2019 14:55"],
}
)
df_B = pd.DataFrame(
{
"Date_hour": [
"3/8/2019 18:37",
"3/8/2019 18:55",
"3/8/2019 19:30",
"3/8/2019 19:51",
"4/8/2019 14:10",
"4/8/2019 14:35",
"4/8/2019 14:51",
"4/8/2019 15:02",
],
"compression": [41, 47, 55, 51, 53, 48, 51, 58],
}
)
# convert to datetime
df_A['Date_hour_beginning'] = pd.to_datetime(df_A['Date_hour_beginning'])
df_A['Date_hour_end'] = pd.to_datetime(df_A['Date_hour_end'])
df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])
# accumulate compression values per range
df_A["compression"] = df_A.apply(
lambda row: df_B.loc[
(df_B["Date_hour"] >= row["Date_hour_beginning"])
& (df_B["Date_hour"] <= row["Date_hour_end"]),
"compression",
].values.tolist(),
axis=1,
)
# calculate mean compression and amplitude
df_A['mean_compression'] = df_A['compression'].apply(lambda x: sum(x) / len(x))
df_A['amplitude'] = df_A['compression'].apply(lambda x: max(x) - min(x))
Use this:
df_A['Date_hour_beginning'] = pd.to_datetime(df_A['Date_hour_beginning'])
df_A['Date_hour_end'] = pd.to_datetime(df_A['Date_hour_end'])
df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])
df_A = df_A.assign(key=1)
df_B = df_B.assign(key=1)
df_merge = pd.merge(df_A, df_B, on='key').drop('key',axis=1)
df_merge = df_merge.query('Date_hour >= Date_hour_beginning and Date_hour <= Date_hour_end')
df_merge['amplitude'] = df_merge.groupby(['Date_hour_beginning','Date_hour_end'])['compression'].transform(lambda x: x.max()-x.min())
df_merge = df_merge.groupby(['Date_hour_beginning','Date_hour_end']).mean()
Output:
compression amplitude
Date_hour_beginning Date_hour_end
2019-03-08 18:35:00 2019-03-08 19:45:00 47.666667 14.0
2019-04-08 14:22:00 2019-04-08 14:55:00 49.500000 3.0
groupby can accept series equally indexed, i.e.
df['Date_hour'] = pd.to_datetime(df['Date_hour'])
df_a['begin'] = pd.to_datetime(df_a['begin'])
df_a['end'] = pd.to_datetime(df_a['end'])
selector = df.apply(lambda x: df_a.query(f'begin <= \'{x["Date_hour"]}\' <= end').index[0], axis=1)
for i_gr, gr in df.groupby(selector):
print(i_gr, gr)
And then go on with your .mean() or .median()
I have two dataframes:
df1 - is a pivot table that has totals for both columns and rows, both with default names "All"
df2 - a df I created manually by specifying values and using the same index and column names as are used in the pivot table above. This table does not have totals.
I need to multiply the first dataframe by the values in the second. I expect the totals return NaNs since totals don't exist in the second table.
When I perform multiplication, I get the following error:
ValueError: cannot join with no level specified and no overlapping names
When I try the same on dummy dataframes it works as expected:
import pandas as pd
import numpy as np
table1 = np.matrix([[10, 20, 30, 60],
[50, 60, 70, 180],
[90, 10, 10, 110],
[150, 90, 110, 350]])
df1 = pd.DataFrame(data = table1, index = ['One','Two','Three', 'All'], columns =['A', 'B','C', 'All'] )
print(df1)
table2 = np.matrix([[1.0, 2.0, 3.0],
[5.0, 6.0, 7.0],
[2.0, 1.0, 5.0]])
df2 = pd.DataFrame(data = table2, index = ['One','Two','Three'], columns =['A', 'B','C'] )
print(df2)
df3 = df1*df2
print(df3)
This gives me the following output:
A B C All
One 10 20 30 60
Two 50 60 70 180
Three 90 10 10 110
All 150 90 110 350
A B C
One 1.00 2.00 3.00
Two 5.00 6.00 7.00
Three 2.00 1.00 5.00
A All B C
All nan nan nan nan
One 10.00 nan 40.00 90.00
Three 180.00 nan 10.00 50.00
Two 250.00 nan 360.00 490.00
So, visually, the only difference between df1 and df2 is the presence/absence of the column and row "All".
And I think the only difference between my dummy dataframes and the real ones is that the real df1 was created with pd.pivot_table method:
df1_real = pd.pivot_table(PY, values = ['Annual Pay'], index = ['PAR Rating'],
columns = ['CR Range'], aggfunc = [np.sum], margins = True)
I do need to keep the total as I'm using them in other calculations.
I'm sure there is a workaround but I just really want to understand why the same code works on some dataframes of different sizes but not others. Or maybe an issue is something completely different.
Thank you for reading. I realize it's a very long post..
IIUC,
My Preferred Approach
you can use the mul method in order to pass the fill_value argument. In this case, you'll want a value of 1 (multiplicative identity) to preserve the value from the dataframe in which the value is not missing.
df1.mul(df2, fill_value=1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
Alternate Approach
You can also embrace the np.nan and use a follow-up combine_first to fill back in the missing bits from df1
(df1 * df2).combine_first(df1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
I really like Pir 's approach , and here is mine :-)
df1.loc[df2.index,df2.columns]*=df2
df1
Out[293]:
A B C All
One 10.0 40.0 90.0 60
Two 250.0 360.0 490.0 180
Three 180.0 10.0 50.0 110
All 150.0 90.0 110.0 350
#Wen, #piRSquared, thank you for your help. This is what I ended up doing. There is probably a more elegant solution but this worked for me.
Since I was able to multiply two dummy dataframes of different sizes, I reasoned the issue wasn't the size, but the fact that one of the dataframes was created as a pivot table. Somehow in this pivot table, the headers were not recognized, though visually they were there. So, I decided to convert the pivot table to a regular dataframe. Steps I took:
Converted the pivot table to records and then back to dataframe using solution from this thread: pandas pivot table to data frame .
Cleaned up the column headers using solution from the same thread above: pandas pivot table to data frame .
Set my first column as the index following suggestion in this thread: How to remove index from a created Dataframe in Python?
This gave me a dataframe that was visually identical to what I had before but was no longer a pivot table.
I was then able to multiply the two dataframes with no issues. I used approach suggested by #Wen because I like that it preserves the structure.