Pandas: How to conditionally sum values in two different dataframes - python

I have the following dataframes:
df1
Name Leads
0 City0 22
1 City1 11
2 City2 28
3 City3 15
4 City4 14
5 City5 15
6 City6 25
df2
Name Leads
0 City1 13
1 City2 0
2 City4 2
3 City6 5
I'd like to sum the values in the Leads columns only when the values in the Name columns match. I've tried:
df3 = df1['Leads'] + df2['Leads'].where(df1['Name']==df2['Name'])
which returns the error:
ValueError: Can only compare identically-labeled Series objects
Have looked at similar issues on StackOverflow but none fit my specific use. Could someone point me in the right direction?

Assume df2.Name values are unique and df2 has exact 2 columns as your sample. Let's try something different by using map and defaultdict
from collections import defaultdict
df1.Leads + df1.Name.map(defaultdict(int, df2.to_numpy()))
Out[38]:
0 22
1 24
2 28
3 15
4 16
5 15
6 30
dtype: int64

Let us try merge
df = df1.merge(df2,on='Name', how='left')
df['Leads']=df['Leads_x'].add(df['Leads_y'],fill_value=0)
df
Out[9]:
Name Leads_x Leads_y Leads
0 City0 22 NaN 22.0
1 City1 11 13.0 24.0
2 City2 28 0.0 28.0
3 City3 15 NaN 15.0
4 City4 14 2.0 16.0
5 City5 15 NaN 15.0
6 City6 25 5.0 30.0

You can use merge:
df1.merge(df2,how='left',on=['Name']).set_index(['Name']).sum(1).reset_index()
output:
Name 0
0 City0 22.0
1 City1 24.0
2 City2 28.0
3 City3 15.0
4 City4 16.0
5 City5 15.0
6 City6 30.0
You can remove how argument if you only want the matching elements, resulting this output:
Name 0
0 City1 24
1 City2 28
2 City4 16
3 City6 30
If you have more columns than Name in your actual case that you wish not to sum, include them all as index right before sum.

I am also new to python. I am pretty sure there are people who can solve it in a better way. The below solution somehow worked when I tried on my system. You can give it a try too.
for i in df2.Name:
temp = df1[df1.Name==i].Leads.sum() + df2[df2.Name==i].Leads.sum()
df1.loc[df1.Name ==i, 'Leads'] = temp

You could work with a merge and sum across the columns:
df1['Leads'] = df1.merge(df2, on='Name', how='outer').filter(like='Lead').sum(1)
Name Leads
0 City0 22.0
1 City1 24.0
2 City2 28.0
3 City3 15.0
4 City4 16.0
5 City5 15.0
6 City6 30.0

You can try:
df1.set_index('Name').add(df2.set_index('Name')).dropna().reset_index()
Output:
Name Leads
0 City1 24.0
1 City2 28.0
2 City4 16.0
3 City6 30.0
Using data alignment by setting indexes on the dataframes and dropping nan values where indexes don't match from df2.

Related

How to apply function on very large pandas dataframe, with function depending on consecutive rows?

I want to calculate the speed (m/s and km/h) with euclidian distance based on positions (x,y in meters) and time (in seconds). I found a way to take into account the fact that each time a name appears for the first time in dataframe, the speed is equal to NaN.
Problem: my dataframe is so large (> 1.5 millions rows) that, when I run the code, it is not done after more than 2 hours...
The code works with a shorter dataframe, the problem seems to be the length of the initial df.
Here is the simplified dataframe, followed by the code:
df
name time x y
0 Mary 0 17 15
1 Mary 1 18.5 16
2 Mary 2 21 18
3 Steve 0 12 16
4 Steve 1 10.5 14
5 Steve 2 8 13
6 Jane 0 15 16
7 Jane 1 17 17
8 Jane 2 18 19
# calculating speeds:
for i in range(len(df)):
if i >= 1:
df.loc[i,'speed (m/s)'] = sqrt( (df.loc[i,'x'] - df.loc[i-1,'x'])**2 + (df.loc[i,'y'] - df.loc[i-1,'y'])**2 )
df.loc[i,'speed (km/h)'] = df.loc[i,'speed (m/s)']*3.6
# each first time a name appears, speeds are equal to NaN:
first_indexes = []
names = df['name'].unique()
for j in names:
a = df.index[df['name'] == j].tolist()
if len(a) > 0 :
first_indexes.append(a[0])
for index in first_indexes:
df.loc[index, 'speed (m/s)'] = np.nan
df.loc[index, 'speed (km/h)'] = np.nan
Iterating over this dataframe is way too long, I'm looking for a way to do this faster...
Thanks by advance for helping !
EDIT
df = pd.DataFrame([["Mary",0,17,15],
["Mary",1,18.5,16],
["Mary",2,21,18],
["Steve",0,12,16],
["Steve",1,10.5,14],
["Steve",2,8,13],
["Jane",0,15,16],
["Jane",1,17,17],
["Jane",2,18,19]],columns = [ "name","time","x","y" ])
You can apply method for all data without loops and then set missing value for first name rows (data has to be sorted by name):
df['speed (m/s)'] = (np.sqrt(df['x'].sub(df['x'].shift()).pow(2) +
df['y'].sub(df['y'].shift()).pow(2)) )
df['speed (km/h)'] = df['speed (m/s)']*3.6
cols = ['speed (m/s)','speed (km/h)']
df[cols] = df[cols].mask(~df['name'].duplicated())
print (df)
name time x y speed (m/s) speed (km/h)
0 Mary 0 17.0 15 NaN NaN
1 Mary 1 18.5 16 1.802776 6.489992
2 Mary 2 21.0 18 3.201562 11.525624
3 Steve 0 12.0 16 NaN NaN
4 Steve 1 10.5 14 2.500000 9.000000
5 Steve 2 8.0 13 2.692582 9.693297
6 Jane 0 15.0 16 NaN NaN
7 Jane 1 17.0 17 2.236068 8.049845
8 Jane 2 18.0 19 2.236068 8.049845
Try this:
df = pd.read_csv('data.csv')
def calculate_speed(s):
return sqrt((s['dx'])**2 + (s['dy'])**2)
df = df.join(df.groupby('name')[['x','y']].diff().rename({'x':'dx', 'y':'dy'}, axis=1))
df['speed (m/s)'] = df.apply(calculate_speed, axis=1)
df['speed (km/h)'] = df['speed (m/s)']*3.6
print(df)
If you want to work with a MultiIndex (which has many nice properties when working with dataframes with names and time indices), you could pivot your table to make name, x and y a column MultiIndex with time being the index:
dfp = df.pivot(index='time', columns=['name'])
Then you can easily calculate the speed for each name without having to check for np.NaN, duplicates or other invalid values:
speed_ms = np.sqrt((dfp['x'] - dfp['x'].shift(-1))**2 + (dfp['y'] - dfp['y'].shift(-1))**2).shift(1)
Now get the speed in km/h
speed_kmh = speed_ms * 3.6
And make both to a multiindex to make merging/concatenating the dataframes more explicit:
speed_ms.columns = pd.MultiIndex.from_product((['speed (m/s)'], speed_ms.columns))
speed_kmh.columns = pd.MultiIndex.from_product((['speed (km/h)'], speed_kmh.columns))
And finally concatenate the results to the dataframe. swaplevel makes all columns primarily indexable by the name, while sort_index sorts by the names:
dfp = pd.concat((dfp, speed_ms, speed_kmh), axis=1).swaplevel(1, 0, 1).sort_index(axis=1)
Now your dataframe looks like:
# Out[100]:
name Jane ... Steve
speed (km/h) speed (m/s) x y ... speed (km/h) speed (m/s) x y
time ...
0 NaN NaN 15.0 16 ... NaN NaN 12.0 16
1 8.049845 2.236068 17.0 17 ... 9.000000 2.500000 10.5 14
2 8.049845 2.236068 18.0 19 ... 9.693297 2.692582 8.0 13
[3 rows x 12 columns]
And you can easily index speeds and positions by names:
dfp['Mary']
#Out[107]:
speed (km/h) speed (m/s) x y
time
0 NaN NaN 17.0 15
1 6.489992 1.802776 18.5 16
2 11.525624 3.201562 21.0 18
With dfp.stack(0) you re-transform it to your input-df-style, while keeping the names as a second index level:
dfp.stack(0).sort_index(level=1)
# Out[104]:
speed (km/h) speed (m/s) x y
time name
0 Jane NaN NaN 15.0 16
Mary NaN NaN 17.0 15
Steve NaN NaN 12.0 16
1 Jane 8.049845 2.236068 17.0 17
Mary 6.489992 1.802776 18.5 16
Steve 9.000000 2.500000 10.5 14
2 Jane 8.049845 2.236068 18.0 19
Mary 11.525624 3.201562 21.0 18
Steve 9.693297 2.692582 8.0 13
While dfp.stack(1) sets the names as columns, while setting speeds etc as indices.

Python Drop all instances of Feature from DF if NaN thresh is met

Using df.dropna(thresh = x, inplace=True), I can successfully drop the rows lacking at least x non-nan values.
But because my df looks like:
2001 2002 2003 2004
bob A 123 31 4 12
bob B 41 1 56 13
bob C nan nan 4 nan
bill A 451 8 nan 24
bill B 32 5 52 6
bill C 623 12 41 14
#Repeating features (A,B,C) for each index/name
This drops the one row/instance where the thresh= condition is met, but leaves the other instances of that feature.
What I want is something that drops the entire feature, if the thresh is met for any one row, such as:
df.dropna(thresh = 2, inplace=True):
2001 2002 2003 2004
bob A 123 31 4 12
bob B 41 1 56 13
bill A 451 8 nan 24
bill B 32 5 52 6
#Drops C from the whole df
wherein C is removed from the entire df, not just the one time it meets the condition under bob
Your sample looks like a multiindex index dataframe where index level 1 is the feature A, B, C and index level 0 is names. You may use notna and sum to create a mask to identify rows where number of non-nan values less than 2 and get their index level 1 values. Finall, use df.query to slice rows
a = df.notna().sum(1).lt(2).loc[lambda x: x].index.get_level_values(1)
df_final = df.query('ilevel_1 not in #a')
Out[275]:
2001 2002 2003 2004
bob A 123.0 31.0 4.0 12.0
B 41.0 1.0 56.0 13.0
bill A 451.0 8.0 NaN 24.0
B 32.0 5.0 52.0 6.0
Method 2:
Use notna, sum, groupby and transform to create mask True on groups having non-nan values greater than or equal 2. Finally, use this mask to slice rows
m = df.notna().sum(1).groupby(level=1).transform(lambda x: x.ge(2).all())
df_final = df[m]
Out[296]:
2001 2002 2003 2004
bob A 123.0 31.0 4.0 12.0
B 41.0 1.0 56.0 13.0
bill A 451.0 8.0 NaN 24.0
B 32.0 5.0 52.0 6.0
Keep only the rows with at least 5 non-NA values.
df.dropna(thresh=5)
thresh is for including rows with a minimum number of non-NaN

Python: how to merge and divide two dataframes?

I have a dataframe df containing the population p assigned to some buildings b
df
p b
0 150 3
1 345 7
2 177 4
3 267 2
and a dataframe df1 that associates some other buildings b1 to the buildings in df
df1
b1 b
0 17 3
1 9 7
2 13 7
I want to assign to the buildings that have an association in df1 a population divided the number of buildings. In this way we generate df2 that assign a population of 150/2=75 to the buildings 3 and 17 and a population of 345/3=115 to the buildings 7,9,13.
df2
p b
0 75 3
1 75 17
2 115 7
3 115 9
4 115 13
5 177 4
6 267 2
IIUC, you can try with merging both dfs on b then stack() and some cleansing, finally group on p and transform count and divide p with that to get divided values on p:
m=(df.merge(df1,on='b',how='left').set_index('p').stack().reset_index(name='b')
.drop_duplicates().drop('level_1',1).sort_values('p'))
m.p=m.p/m.groupby('p')['p'].transform('count')
print(m.sort_index())
p b
0 75.0 3.0
1 75.0 17.0
2 115.0 7.0
3 115.0 9.0
5 115.0 13.0
6 177.0 4.0
7 267.0 2.0
Another way using pd.concat. After that, fillna individually b1 and p. Next, transform with mean and assign filled b1 to the final dataframe
df2 = pd.concat([df, df1], sort=True).sort_values('b')
df2['b1'] = df2.b1.fillna(df2.b)
df2['p'] = df2.p.fillna(0)
df2.groupby('b').p.transform('mean').to_frame().assign(b=df2.b1).reset_index(drop=True)
Out[159]:
p b
0 267.0 2.0
1 75.0 3.0
2 75.0 17.0
3 177.0 4.0
4 115.0 7.0
5 115.0 9.0
6 115.0 13.0

Python Pandas - difference between 'loc' and 'where'?

Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?
Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23
If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m
loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).

Rolling subtraction in pandas

I am trying do something like that.
ff = pd.DataFrame({'uid':[1,1,1,20,20,20,4,4,4],
'date':['09/06','10/06','11/06',
'09/06','10/06','11/06',
'09/06','10/06','11/06'],
'balance':[150,200,230,12,15,15,700,1000,1500],
'difference':[np.NaN,50,30,np.NaN,3,0,np.NaN,300,500]})
I have tried with rolling, but I cannot find the function or the rolling sub-class that subtracts, only sum and var and other stats.
Is there a way?
I was thinking that I can create two dfs: one - with the first row of every uid eliminated, the second one - with the last row of every uid eliminated. But to be honest, I have no idea how to do that dynamically, for every uid.
Use groupby with diff:
df = pd.DataFrame({'uid':[1,1,1,20,20,20,4,4,4],
'date':['09/06','10/06','11/06',
'09/06','10/06','11/06',
'09/06','10/06','11/06'],
'balance':[150,200,230,12,15,15,700,1000,1500]})
df['difference'] = df.groupby('uid')['balance'].diff()
Output:
uid date balance difference
0 1 09/06 150 NaN
1 1 10/06 200 50.0
2 1 11/06 230 30.0
3 20 09/06 12 NaN
4 20 10/06 15 3.0
5 20 11/06 15 0.0
6 4 09/06 700 NaN
7 4 10/06 1000 300.0
8 4 11/06 1500 500.0

Categories