Python: how to merge and divide two dataframes? - python

I have a dataframe df containing the population p assigned to some buildings b
df
p b
0 150 3
1 345 7
2 177 4
3 267 2
and a dataframe df1 that associates some other buildings b1 to the buildings in df
df1
b1 b
0 17 3
1 9 7
2 13 7
I want to assign to the buildings that have an association in df1 a population divided the number of buildings. In this way we generate df2 that assign a population of 150/2=75 to the buildings 3 and 17 and a population of 345/3=115 to the buildings 7,9,13.
df2
p b
0 75 3
1 75 17
2 115 7
3 115 9
4 115 13
5 177 4
6 267 2

IIUC, you can try with merging both dfs on b then stack() and some cleansing, finally group on p and transform count and divide p with that to get divided values on p:
m=(df.merge(df1,on='b',how='left').set_index('p').stack().reset_index(name='b')
.drop_duplicates().drop('level_1',1).sort_values('p'))
m.p=m.p/m.groupby('p')['p'].transform('count')
print(m.sort_index())
p b
0 75.0 3.0
1 75.0 17.0
2 115.0 7.0
3 115.0 9.0
5 115.0 13.0
6 177.0 4.0
7 267.0 2.0

Another way using pd.concat. After that, fillna individually b1 and p. Next, transform with mean and assign filled b1 to the final dataframe
df2 = pd.concat([df, df1], sort=True).sort_values('b')
df2['b1'] = df2.b1.fillna(df2.b)
df2['p'] = df2.p.fillna(0)
df2.groupby('b').p.transform('mean').to_frame().assign(b=df2.b1).reset_index(drop=True)
Out[159]:
p b
0 267.0 2.0
1 75.0 3.0
2 75.0 17.0
3 177.0 4.0
4 115.0 7.0
5 115.0 9.0
6 115.0 13.0

Related

How to use each row's value as compare object to get the count of rows satisfying a condition from whole DataFrame?

date data1
0 2012/1/1 100
1 2012/1/2 109
2 2012/1/3 108
3 2012/1/4 120
4 2012/1/5 80
5 2012/1/6 130
6 2012/1/7 100
7 2012/1/8 140
Given the dataframe above, I want get the number of rows which data1 value is between ± 10 of each row's data1 field, and append that count to each row, such that:
date data Count
0 2012/1/1 100.0 4.0
1 2012/1/2 109.0 4.0
2 2012/1/3 108.0 4.0
3 2012/1/4 120.0 2.0
4 2012/1/5 80.0 1.0
5 2012/1/6 130.0 3.0
6 2012/1/7 100.0 4.0
7 2012/1/8 140.0 2.0
Since each row's field is rule's compare object, I use iterrows, although I know this is not elegant:
result = pd.DataFrame(index=df.index)
for i,r in df.iterrows():
high=r['data']+10
low=r['data1']-10
df2=df.loc[(df['data']<=r['data']+10)&(df['data']>=r['data']-10)]
result.loc[i,'date']=r['date']
result.loc[i,'data']=r['data']
result.loc[i,'count']=df2.shape[0]
result
Is there any more Pandas-style way to do that?
Thank you for any help!
Use numpy broadcasting for boolean mask and for count Trues use sum:
arr = df['data'].to_numpy()
df['count'] = ((arr[:, None] <= arr+10)&(arr[:, None] >= arr-10)).sum(axis=1)
print (df)
date data count
0 2012/1/1 100 4
1 2012/1/2 109 4
2 2012/1/3 108 4
3 2012/1/4 120 2
4 2012/1/5 80 1
5 2012/1/6 130 3
6 2012/1/7 100 4
7 2012/1/8 140 2

Conditionally insert columns of one Pandas dataframe into columns of another dataframe

I have 2 dataframes:
dfA = pd.DataFrame({'label':[1,5,2,4,2,3],
'group':['A']*3 + ['B']*3,
'x':[np.nan]*3 + [1,2,3],
'y':[np.nan]*3 + [1,2,3]})
dfB = pd.DataFrame({'uniqid':[1,2,3,4,5,6,7],
'horizontal':[34,41,23,34,23,43,22],
'vertical':[98,67,19,57,68,88,77]})
...which look like:
label group x y
0 1 A NaN NaN
1 5 A NaN NaN
2 2 A NaN NaN
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
uniqid horizontal vertical
0 1 34 98
1 2 41 67
2 3 23 19
3 4 34 57
4 5 23 68
5 6 43 88
6 7 22 77
Basically, dfB contains 'horizontal' and 'vertical' values for all unique IDs. I want to populate the 'x' and 'y' columns in dfA with the 'horizontal' and 'vertical' values in dfB but only for group A; data for group B should remain unchanged.
The desired output would be:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
I've used .merge() to add additional columns to the dataframe for both groups A and B and then copy data to x and y columns for group A only. And finally delete columns from dfB.
dfA = dfA.merge(dfB, how = 'left', left_on = 'label', right_on = 'uniqid')
dfA.loc[dfA['group'] == 'A','x'] = dfA.loc[dfA['group'] == 'A','horizontal']
dfA.loc[dfA['group'] == 'A','y'] = dfA.loc[dfA['group'] == 'A','vertical']
dfA = dfA[['label','group','x','y']]
The correct output is produced:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
...but this is a really, really ugly solution. Is there a better solution?
combine_first
dfA.set_index(['label', 'group']).combine_first(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
fillna
Works as well
dfA.set_index(['label', 'group']).fillna(
dfB.set_axis(['label', 'x', 'y'], axis=1).set_index(['label'])
).reset_index()
We can try loc to extract/update only the part we want. And since you are merging on one column, which also has unique value on dfB, you can use set_index and loc/reindex:
mask = dfA['group']=='A'
dfA.loc[ mask, ['x','y']] = (dfB.set_index('uniqid')
.loc[dfA.loc[mask,'label'],
['horizontal','vertical']]
.values
)
Output:
label group x y
0 1 A 34.0 98.0
1 5 A 23.0 68.0
2 2 A 41.0 67.0
3 4 B 1.0 1.0
4 2 B 2.0 2.0
5 3 B 3.0 3.0
Note that the above would fail if some of dfA.label is not in dfB.uniqueid. In which case, we need to use reindex:
(dfB.set_index('uniqid')
.reindex[dfA.loc[mask,'label']
[['horizontal','vertical']].values
)

Python Drop all instances of Feature from DF if NaN thresh is met

Using df.dropna(thresh = x, inplace=True), I can successfully drop the rows lacking at least x non-nan values.
But because my df looks like:
2001 2002 2003 2004
bob A 123 31 4 12
bob B 41 1 56 13
bob C nan nan 4 nan
bill A 451 8 nan 24
bill B 32 5 52 6
bill C 623 12 41 14
#Repeating features (A,B,C) for each index/name
This drops the one row/instance where the thresh= condition is met, but leaves the other instances of that feature.
What I want is something that drops the entire feature, if the thresh is met for any one row, such as:
df.dropna(thresh = 2, inplace=True):
2001 2002 2003 2004
bob A 123 31 4 12
bob B 41 1 56 13
bill A 451 8 nan 24
bill B 32 5 52 6
#Drops C from the whole df
wherein C is removed from the entire df, not just the one time it meets the condition under bob
Your sample looks like a multiindex index dataframe where index level 1 is the feature A, B, C and index level 0 is names. You may use notna and sum to create a mask to identify rows where number of non-nan values less than 2 and get their index level 1 values. Finall, use df.query to slice rows
a = df.notna().sum(1).lt(2).loc[lambda x: x].index.get_level_values(1)
df_final = df.query('ilevel_1 not in #a')
Out[275]:
2001 2002 2003 2004
bob A 123.0 31.0 4.0 12.0
B 41.0 1.0 56.0 13.0
bill A 451.0 8.0 NaN 24.0
B 32.0 5.0 52.0 6.0
Method 2:
Use notna, sum, groupby and transform to create mask True on groups having non-nan values greater than or equal 2. Finally, use this mask to slice rows
m = df.notna().sum(1).groupby(level=1).transform(lambda x: x.ge(2).all())
df_final = df[m]
Out[296]:
2001 2002 2003 2004
bob A 123.0 31.0 4.0 12.0
B 41.0 1.0 56.0 13.0
bill A 451.0 8.0 NaN 24.0
B 32.0 5.0 52.0 6.0
Keep only the rows with at least 5 non-NA values.
df.dropna(thresh=5)
thresh is for including rows with a minimum number of non-NaN

Python Pandas - difference between 'loc' and 'where'?

Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?
Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23
If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m
loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).

opposite of df.diff() in pandas

I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs

Categories