Pandas create new dataframe by querying other dataframes without using iterrows - python

I have two huge dataframes that both have the same id field. I want to make a simple summary dataframe where I show the maximum of specific columns. I understand iterrows() is frowned upon, so are a couple one-liners to do this? I don't understand lambda/apply very well, but maybe this would work here.
Stand-alone example
import pandas as pd
myid = [1,1,2,3,4,4,5]
name =['A','A','B','C','D','D','E']
x = [15,12,3,3,1,4,8]
df1 = pd.DataFrame(list(zip(myid, name, x)),
columns=['myid', 'name', 'x'])
display(df1)
myid = [1,2,2,2,3,4,5,5]
name =['A','B','B','B','C','D','E','E']
y = [9,6,3,4,6,2,8,2]
df2 = pd.DataFrame(list(zip(myid, name, y)),
columns=['myid', 'name', 'y'])
display(df2)
mylist = df['myid'].unique()
df_summary = pd.DataFrame(mylist, columns=['MY_ID'])
## do work here...
Desired output

merge()
named aggregations
df1.merge(df2, on=["myid","name"], how="outer")\
.groupby(["myid","name"], as_index=False).agg(MAX_X=("x","max"),MAX_Y=("y","max"))
myid
name
MAX_X
MAX_Y
0
1
A
15
9
1
2
B
3
6
2
3
C
3
6
3
4
D
4
2
4
5
E
8
8
updated
you have noted that your data frames are large and solution is giving you OOM
logically aggregate first, then merge will use less memory
pd.merge(
df1.groupby(["myid","name"],as_index=False).agg(MAX_X=("x","max")),
df2.groupby(["myid","name"],as_index=False).agg(MAX_Y=("y","max")),
on=["myid","name"]
)

you can try concat+groupby.max
out = (pd.concat((df1,df2),sort=False).groupby(['myid','name']).max()
.add_prefix("Max_").reset_index())
myid name Max_x Max_y
0 1 A 15.0 9.0
1 2 B 3.0 6.0
2 3 C 3.0 6.0
3 4 D 4.0 2.0
4 5 E 8.0 8.0

Related

How to pass the value of previous row to the dataframe apply function?

I have the following pandas dataframe and would like to build a new column 'c' which is the summation of column 'b' value and column 'a' previous values. With shifting column 'a' it is possible to do so. However, I would like to know how I can pass the previous values of column 'a' in the apply() function.
l1 = [1,2,3,4,5]
l2 = [3,2,5,4,6]
df = pd.DataFrame(data=l1, columns=['a'])
df['b'] = l2
df['shifted'] = df['a'].shift(1)
df['c'] = df.apply(lambda row: row['shifted']+ row['b'], axis=1)
print(df)
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0
I appreciate your help.
Edit: this is a dummy example. I need to use the apply function because I'm passing another function to it which uses previous rows of some columns and checks some condition.
First let's make it clear that you do not need apply for this simple operation, so I'll consider it as a dummy example of a complex function.
Assuming non-duplicate indices, you can generate a shifted Series and reference it in apply using the name attribute:
s = df['a'].shift(1)
df['c'] =df.apply(lambda row: row['b']+s[row.name], axis=1)
output:
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0

How to combine duplicate rows in python pandas

I have a data frame similar to the one listed below. For some reason, each team is listed twice, one listing corresponding to each column.
import pandas as pd
import numpy as np
d = {'Team': ['1', '2', '3', '1', '2', '3'], 'Points for': [5, 10, 15, np.nan,np.nan,np.nan], 'Points against' : [np.nan,np.nan,np.nan, 3, 6, 9]}
df = pd.DataFrame(data=d)
Team Points for Points against
0 1 5 Nan
1 2 10 Nan
2 3 15 Nan
3 1 Nan 3
4 2 Nan 6
5 3 Nan 9
How can I just combine rows of duplicate team names so that there are no missing values? This is what I would like:
Team Points for Points against
0 1 5 3
1 2 10 6
2 3 15 9
I have been trying to figure it out with pandas, but can't seem to get it. Thanks!
I made changes to your code, replacing string 'Nan' with numpy's nan.
One solution is to melt the data, drop the null entries, and pivot back to wide from long:
df = (df
.melt('Team')
.dropna()
.pivot('Team','variable','value')
.reset_index()
.rename_axis(None,axis='columns')
.astype(int)
)
df
Team Points against Points for
0 1 3 5
1 2 6 10
2 3 9 15
One way using groupby. :
df = df.replace("Nan", np.nan)
new_df = df.groupby("Team").first()
print(new_df)
Output:
Points for Points against
Team
1 5.0 3.0
2 10.0 6.0
3 15.0 9.0
You need to groupby the unique identifiers. If there is also a game ID or date or something like that, you might need to group on that as well.
df.groupby('Team').agg({'Points for': 'max', 'Points against': 'max'})
pd.pivot_table(df, values = ['Points for','Points against'],index=['Team'], aggfunc=np.sum)[['Points for','Points against']]
Output
Points for Points against
Team
1 5.0 3.0
2 10.0 6.0
3 15.0 9.0

Merging dataframes with different dimensions and related data

I have 2 dataframes with different size with related data to be merged in an efficient way:
master_df = pd.DataFrame({'kpi_1': [1,2,3,4]},
index=['dn1_app1_bar.com',
'dn1_app2_bar.com',
'dn2_app1_foo.com',
'dn2_app2_foo.com'])
guard_df = pd.DataFrame({'kpi_2': [1,2],
'kpi_3': [10,20]},
index=['dn1_bar.com', 'dn2_foo.com'])
master_df:
kpi_1
dn1_app1_bar.com 1
dn1_app2_bar.com 2
dn2_app1_foo.com 3
dn2_app2_foo.com 4
guard_df:
kpi_2 kpi_3
dn1_bar.com 1 10
dn2_foo.com 2 20
I want to get a dataframe with values from a guard_df's row indexed with <group>_<name> "propagated' to all master_df's rows matching
<group>_.*_<name>.
Expected result:
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1.0 10.0
dn1_app2_bar.com 2 1.0 10.0
dn2_app1_foo.com 3 2.0 20.0
dn2_app2_foo.com 4 2.0 20.0
What I've managed so far is the following basic approach:
def eval_base_dn(dn):
chunks = dn.split('_')
return '_'.join((chunks[0], chunks[2]))
for dn in master_df.index:
for col in guard_df.columns:
master_df.loc[dn, col] = guard_df.loc[eval_base_dn(dn), col]
but I'm looking for some more performant way to "broadcast" the values and merge the dataframes.
If use pandas 0.25+ is possible pass array, here index to on parameter of merge with left join:
master_df = master_df.merge(guard_df,
left_on=master_df.index.str.replace('_.+_', '_'),
right_index=True,
how='left')
print (master_df)
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1 10
dn1_app2_bar.com 2 1 10
dn2_app1_foo.com 3 2 20
dn2_app2_foo.com 4 2 20
Try this one:
>>> pd.merge(master_df.assign(guard_df_id=master_df.index.str.split("_").map(lambda x: "{0}_{1}".format(x[0], x[-1]))), guard_df, left_on="guard_df_id", right_index=True).drop(["guard_df_id"], axis=1)
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1 10
dn1_app2_bar.com 2 1 10
dn2_app1_foo.com 3 2 20
dn2_app2_foo.com 4 2 20

populate missing values for multiple columns with multiple values

I have gone through the posts that are similar to filling out the multiple columns for pandas in one go, however it appears that my problem here is a little different, in the sense that I need to be able to populate a missing column value with a specific column value and be able to do that for multiple columns in one go.
Eg: I can use the commands as below individually to fill the NA's
result1_copy['BASE_B'] = np.where(pd.isnull(result1_copy['BASE_B']), result1_copy['BASE_S'], result1_copy['BASE_B'])
result1_copy['QWE_B'] = np.where(pd.isnull(result1_copy['QWE_B']), result1_copy['QWE_S'], result1_copy['QWE_B'])
However, if I were to try populating it one go, it does not work:
result1_copy['BASE_B','QWE_B'] = result1_copy['BASE_B', 'QWE_B'].fillna(result1_copy['BASE_S','QWE_S'])
Do we know why ?
Please note I have only used 2 columns here for ease of purpose, however I have 10s of columns to impute. And they are either object, float or datetime.
Is datatypes the issue here ?
You need add [] for filtered DataFrame and for align columns add rename:
d = {'BASE_S':'BASE_B', 'QWE_S':'QWE_B'}
result1_copy[['BASE_B','QWE_B']] = result1_copy[['BASE_B', 'QWE_B']]
.fillna(result1_copy[['BASE_S','QWE_S']]
.rename(columns=d))
More dynamic solution:
L = ['BASE_','QWE_']
orig = ['{}B'.format(x) for x in L]
new = ['{}S'.format(x) for x in L]
d = dict(zip(new, orig))
result1_copy[orig] = (result1_copy[orig].fillna(result1_copy[new]
.rename(columns=d)))
Another solution if match columns with B and S:
for x in ['BASE_','QWE_']:
result1_copy[x + 'B'] = result1_copy[x + 'B'].fillna(result1_copy[x + 'S'])
Sample:
result1_copy = pd.DataFrame({'A':list('abcdef'),
'BASE_B':[np.nan,5,4,5,5,np.nan],
'QWE_B':[np.nan,8,9,4,2,np.nan],
'BASE_S':[1,3,5,7,1,0],
'QWE_S':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (result1_copy)
A BASE_B BASE_S F QWE_B QWE_S
0 a NaN 1 a NaN 5
1 b 5.0 3 a 8.0 3
2 c 4.0 5 a 9.0 6
3 d 5.0 7 b 4.0 9
4 e 5.0 1 b 2.0 2
5 f NaN 0 b NaN 4
d = {'BASE_S':'BASE_B', 'QWE_S':'QWE_B'}
result1_copy[['BASE_B','QWE_B']] = (result1_copy[['BASE_B', 'QWE_B']]
.fillna(result1_copy[['BASE_S','QWE_S']]
.rename(columns=d)))
print (result1_copy)
A BASE_B BASE_S F QWE_B QWE_S
0 a 1.0 1 a 5.0 5
1 b 5.0 3 a 8.0 3
2 c 4.0 5 a 9.0 6
3 d 5.0 7 b 4.0 9
4 e 5.0 1 b 2.0 2
5 f 0.0 0 b 4.0 4

Outer join in python Pandas

I have two data sets as following
A B
IDs IDs
1 1
2 2
3 5
4 7
How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A
Something like Following
B
Ids
5
7
I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following
pd.merge(A, B, on='ids', how='right')
Thanks
You can use NumPy's setdiff1d, like so -
np.setdiff1d(B['IDs'],A['IDs'])
Also, np.in1d could be used for the same effect, like so -
B[~np.in1d(B['IDs'],A['IDs'])]
Please note that np.setdiff1d would give us a sorted NumPy array as output.
Sample run -
>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
IDs
1 7
2 5
You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:
A = pd.DataFrame({'IDs':[1,2,3,4],
'B':[4,5,6,7],
'C':[1,8,9,4]})
print (A)
B C IDs
0 4 1 1
1 5 8 2
2 6 9 3
3 7 4 4
B = pd.DataFrame({'IDs':[1,2,5,7],
'A':[1,8,3,7],
'D':[1,8,9,4]})
print (B)
A D IDs
0 1 1 1
1 8 8 2
2 3 9 5
3 7 4 7
df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']
df = df.drop('_merge', axis=1)
print (df)
B C IDs A D
4 NaN NaN 5.0 3.0 9.0
5 NaN NaN 7.0 7.0 4.0
You could convert the data series to sets and take the difference:
import pandas as pd
df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)}) # Take difference and convert back to DataFrame
The variable "C" then yields
C
0 5
1 7
You can simply use pandas' .isin() method:
df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]
If these are separate DataFrames:
a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]
Output:
IDs
2 5
3 7

Categories