Hi I have 2 dataframes as below:
key
Apple
Banana
abc
1
12
bcd
23
21
key
Train
Car
abc
11
20
jkn
2
19
I want to merge these 2 dataframes together with my key column so that I can get following table:
key
Train
Car
Banana
Apple
abc
11
20
12
1
jkn
2
19
0/NA
0/NA
bcd
0/NA
0/NA
21
23
For columns where I don't have any record like for jkn / Apple either 0 or NA should be printed.
Currently I tried using pd.concat but am not exactly able to figure out how to get my desired result.
Use pd.merge() with how='outer', read further in documentation:
import pandas as pd
import io
data_string = """key Apple Banana
abc 1 12
bcd 23 21
"""
df1 = pd.read_csv(io.StringIO(data_string), sep='\s+')
data_string = """key Train Car
abc 11 20
jkn 2 19"""
df2 = pd.read_csv(io.StringIO(data_string), sep='\s+')
# Solution
df_result = pd.merge(df1, df2, on=['key'], how='outer')
print(df_result)
key Apple Banana Train Car
0 abc 1.0 12.0 11.0 20.0
1 bcd 23.0 21.0 NaN NaN
2 jkn NaN NaN 2.0 19.0
Let's try concat then groupby.sum
out = (pd.concat([df1, df2], ignore_index=True)
.groupby('key', as_index=False).sum())
print(out)
key Apple Banana Train Car
0 abc 1.0 12.0 11.0 20.0
1 bcd 23.0 21.0 0.0 0.0
2 jkn 0.0 0.0 2.0 19.0
Related
Let's say I have 2 dataframes,
both have different lengths but the same amount of columns
df1 = pd.DataFrame({'country': ['Russia','Mexico','USA','Argentina','Denmark','Syngapore'],
'population': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'country': ['Russia','Argentina','Australia','USA'],
'population': [44,12,23,64]})
Lets assume that some of the data in df1 is outdated and I've received a new dataframe that contains some new data but not which may or may not exist already in the outdated dataframe.
I want to find out if any of the values of df2.country are inside df1.country
By doing the following I'm able to return a boolean:
df = df1.country.isin(df2.country)
print(df)
Unfortunately I'm just creating a new dataframe containing the answer to my question
0 True
1 False
2 True
3 True
4 False
5 False
Name: country, dtype: bool
My goal here is to delete the rows of df1 which values match with df2 and add the new data, kind of like an update.
I've manage to come up with something like this:
df = df1.country.isin(df2.country)
i = 0
for x in df:
if x:
df1.drop(i, inplace=True)
i += 1
frames = [df1, df2]
df1 = pd.concat(frames)
df1.reset_index(drop=True, inplace=True)
print(df1)
which in fact works and updates the dataframe
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64
But I really believe there's a batter way of doing the same thing quicker and much more practical considering that the real dataframe is much bigger and updates every few seconds.
I'd love to hear some suggestions, Thanks!
Assuming col1 remains unique in the original dataframe, you can join the two tables together. Once you have them in the same dataframe, you can apply your logic i.e. update value from new dataframe if it is not null. You actually don't need to check if col2 has changed for every entry in col1. You can just replace col2 value with col1 as long as it is not NaN (based on your sample output).
df1 = pd.DataFrame({'col1': ['a','f','r','g','d','s'], 'col2': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'col1': ['a','g','o','r'], 'col2': [44,12,23,64]})
# do the join
x= pd.merge(df1,df2,how='outer',
left_on="col1", right_on="col1")
col1 col2_x col2_y
0 a 41.0 44.0
1 f 12.0 NaN
2 r 26.0 64.0
3 g 64.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o NaN 23.0
# apply your update rules
x['col2_x'] = np.where(
~x['col2_y'].isnull(),
x['col2_y'],x['col2_x']
)
col1 col2_x col2_y
0 a 44.0 44.0
1 f 12.0 NaN
2 r 64.0 64.0
3 g 12.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o 23.0 23.0
#clean up
x.drop("col2_y", axis=1, inplace = True)
x.columns = ["col1", "col2"]
col1 col2
0 a 44.0
1 f 12.0
2 r 64.0
3 g 12.0
4 d 123.0
5 s 24.0
6 o 23.0
The isin approach is so close! Simply use the results from isin as a mask, then concat the rows from df1 that are not in (~) df2 with the rest of df2:
m = df1['country'].isin(df2['country'])
df3 = pd.concat((df1[~m], df2), ignore_index=True)
df3:
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64
I have the following dataframes:
df1
Name Leads
0 City0 22
1 City1 11
2 City2 28
3 City3 15
4 City4 14
5 City5 15
6 City6 25
df2
Name Leads
0 City1 13
1 City2 0
2 City4 2
3 City6 5
I'd like to sum the values in the Leads columns only when the values in the Name columns match. I've tried:
df3 = df1['Leads'] + df2['Leads'].where(df1['Name']==df2['Name'])
which returns the error:
ValueError: Can only compare identically-labeled Series objects
Have looked at similar issues on StackOverflow but none fit my specific use. Could someone point me in the right direction?
Assume df2.Name values are unique and df2 has exact 2 columns as your sample. Let's try something different by using map and defaultdict
from collections import defaultdict
df1.Leads + df1.Name.map(defaultdict(int, df2.to_numpy()))
Out[38]:
0 22
1 24
2 28
3 15
4 16
5 15
6 30
dtype: int64
Let us try merge
df = df1.merge(df2,on='Name', how='left')
df['Leads']=df['Leads_x'].add(df['Leads_y'],fill_value=0)
df
Out[9]:
Name Leads_x Leads_y Leads
0 City0 22 NaN 22.0
1 City1 11 13.0 24.0
2 City2 28 0.0 28.0
3 City3 15 NaN 15.0
4 City4 14 2.0 16.0
5 City5 15 NaN 15.0
6 City6 25 5.0 30.0
You can use merge:
df1.merge(df2,how='left',on=['Name']).set_index(['Name']).sum(1).reset_index()
output:
Name 0
0 City0 22.0
1 City1 24.0
2 City2 28.0
3 City3 15.0
4 City4 16.0
5 City5 15.0
6 City6 30.0
You can remove how argument if you only want the matching elements, resulting this output:
Name 0
0 City1 24
1 City2 28
2 City4 16
3 City6 30
If you have more columns than Name in your actual case that you wish not to sum, include them all as index right before sum.
I am also new to python. I am pretty sure there are people who can solve it in a better way. The below solution somehow worked when I tried on my system. You can give it a try too.
for i in df2.Name:
temp = df1[df1.Name==i].Leads.sum() + df2[df2.Name==i].Leads.sum()
df1.loc[df1.Name ==i, 'Leads'] = temp
You could work with a merge and sum across the columns:
df1['Leads'] = df1.merge(df2, on='Name', how='outer').filter(like='Lead').sum(1)
Name Leads
0 City0 22.0
1 City1 24.0
2 City2 28.0
3 City3 15.0
4 City4 16.0
5 City5 15.0
6 City6 30.0
You can try:
df1.set_index('Name').add(df2.set_index('Name')).dropna().reset_index()
Output:
Name Leads
0 City1 24.0
1 City2 28.0
2 City4 16.0
3 City6 30.0
Using data alignment by setting indexes on the dataframes and dropping nan values where indexes don't match from df2.
df1 =
date column1 column2 column3 column4
1 123 124 125 126
2 23 24 25 26
3 42 43 44 45
df2 =
date c_xyz1 c_xyz2 c_xyz3
1 123 124 125
2 23 24 25
3 42 43 44
i need output like
df3 =
date column1 column2 column3 column4
1 123 124 125 126
2 23 24 25 26
3 42 43 44 45
date c_xyz1 c_xyz2 c_xyz3
1 123 124 125
2 23 24 25
3 42 43 44
append df2 with df1 with column names
df1 have columns 4 and df2 have columns 3 with different names need i need to append with column names.
if it is possible while writing xlsx file then fine.
im pretty sure you can just
df3 = pandas.concat([df1,df2])
you might want
pandas.concat([df1,df2],axis=1)
import pandas
df1 = pandas.DataFrame({"A":[1,2,3],"B":[2,3,4]})
df2 = pandas.DataFrame({"C":[3,4,5],"D":[4,5,6]})
df3 = pandas.concat([df1,df2])
"""
A B C D
0 1.0 2.0 NaN NaN
1 2.0 3.0 NaN NaN
2 3.0 4.0 NaN NaN
0 NaN NaN 3.0 4.0
1 NaN NaN 4.0 5.0
2 NaN NaN 5.0 6.0
"""
df3_alt = pandas.concat([df1,df2],axis=1)
"""
A B C D
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
"""
you could combine to make a csv that had different columns in the middle I guess
with open("out.csv","wb") as f:
f.write("\n".join([df1.to_csv(),df2.to_csv()])
print("".join([df1.to_csv(), df2.to_csv()]))
"""
,A,B
0,1,2
1,2,3
2,3,4
,C,D
0,3,4
1,4,5
"""
here is the excel version ...
with pandas.ExcelWriter("out.xlsx") as w:
df1.to_excel(w)
df2.to_excel(w,startrow=len(df1)+1)
i write into xlsx using
writer = pd.ExcelWriter('test.xlsx',engine='xlsxwriter')
workbook=writer.book
worksheet=workbook.add_worksheet('Validation')
writer.sheets['Validation'] = worksheet
df.to_excel(writer,sheet_name='Validation',startrow=0 , startcol=0)
another_df.to_excel(writer,sheet_name='Validation',startrow=20, startcol=0)
Using df.dropna(thresh = x, inplace=True), I can successfully drop the rows lacking at least x non-nan values.
But because my df looks like:
2001 2002 2003 2004
bob A 123 31 4 12
bob B 41 1 56 13
bob C nan nan 4 nan
bill A 451 8 nan 24
bill B 32 5 52 6
bill C 623 12 41 14
#Repeating features (A,B,C) for each index/name
This drops the one row/instance where the thresh= condition is met, but leaves the other instances of that feature.
What I want is something that drops the entire feature, if the thresh is met for any one row, such as:
df.dropna(thresh = 2, inplace=True):
2001 2002 2003 2004
bob A 123 31 4 12
bob B 41 1 56 13
bill A 451 8 nan 24
bill B 32 5 52 6
#Drops C from the whole df
wherein C is removed from the entire df, not just the one time it meets the condition under bob
Your sample looks like a multiindex index dataframe where index level 1 is the feature A, B, C and index level 0 is names. You may use notna and sum to create a mask to identify rows where number of non-nan values less than 2 and get their index level 1 values. Finall, use df.query to slice rows
a = df.notna().sum(1).lt(2).loc[lambda x: x].index.get_level_values(1)
df_final = df.query('ilevel_1 not in #a')
Out[275]:
2001 2002 2003 2004
bob A 123.0 31.0 4.0 12.0
B 41.0 1.0 56.0 13.0
bill A 451.0 8.0 NaN 24.0
B 32.0 5.0 52.0 6.0
Method 2:
Use notna, sum, groupby and transform to create mask True on groups having non-nan values greater than or equal 2. Finally, use this mask to slice rows
m = df.notna().sum(1).groupby(level=1).transform(lambda x: x.ge(2).all())
df_final = df[m]
Out[296]:
2001 2002 2003 2004
bob A 123.0 31.0 4.0 12.0
B 41.0 1.0 56.0 13.0
bill A 451.0 8.0 NaN 24.0
B 32.0 5.0 52.0 6.0
Keep only the rows with at least 5 non-NA values.
df.dropna(thresh=5)
thresh is for including rows with a minimum number of non-NaN
Total meltdown here, need some assistance.
I have a DataFrame with +10m rows and some 150 columns with two ids, looking like below:
df = pd.DataFrame({'id1' : [1,2,5,3,6,4]
,'id2' : [2,1,np.nan,4,np.nan,3]
,'num' : [123, 3231, 123, 231, 6534,2394]})
id1 id2 num
0 1 2.0 123
1 2 1.0 3231
2 5 NaN 123
3 3 4.0 231
4 6 NaN 6534
5 4 3.0 2394
Where row index 0 and 1 are a pair given id1 and id2, and row index 3 and 5 are a pair in the same way. I want the table below, where the second row pair is merged with first row pair
df = pd.DataFrame({'id1' : [1,5,3,6]
,'id2' : [2,np.nan,3,np.nan]
,'num' : [123, 123, 231, 6534]
,'2num' : [3231, np.nan, 2394, np.nan,]})
id1 id2 num 2_num
0 1 2.0 123 3231.0
1 5 NaN 123 NaN
2 3 3.0 231 2394.0
3 6 NaN 6534 NaN
How can this be archived using id1 and id2 and labeling all following columns from "id row 2" with "2_"?
Heres one a merge based approach ,(thank you #pirSquared for improvement). i.e
ndf = df.merge(df, 'left', left_on=['id1', 'id2'], right_on=['id2', 'id1'], suffixes=['', '_2']).drop(['id1_2', 'id2_2'], 1)
cols = ['id1','id2']
ndf[cols] = np.sort(ndf[cols],1)
new = ndf.drop_duplicates(subset=['id1','id2'],keep='first')
id1 id2 num num_2
0 1.0 2.0 123 3231.0
2 5.0 NaN 123 NaN
3 3.0 4.0 231 2394.0
4 6.0 NaN 6534 NaN
The idea is to sort each pair of ids so that we group by them.
cols = ['id1', 'id2']
df[cols] = np.sort(df[cols], 1)
df.set_index(
cols + [df.fillna(-1).groupby(cols).cumcount() + 1]
).num.unstack().add_suffix('_num').reset_index()
id1 id2 1_num 2_num
0 1.0 2.0 123.0 3231.0
1 3.0 4.0 231.0 2394.0
2 5.0 NaN 123.0 NaN
3 6.0 NaN 6534.0 NaN
Use:
df[['id1','id2']] = pd.DataFrame(np.sort(df[['id1','id2']].values, axis=1)).fillna('tmp')
print (df)
id1 id2 num
0 1.0 2 123
1 1.0 2 3231
2 5.0 tmp 123
3 3.0 4 231
4 6.0 tmp 6534
5 3.0 4 2394
df1 = df.groupby(['id1','id2'])['num'].apply(list)
print (df1)
id1 id2
1.0 2.0 [123, 3231]
3.0 4.0 [231, 2394]
5.0 tmp [123]
6.0 tmp [6534]
Name: num, dtype: object
df2 = pd.DataFrame(df1.values.tolist(),
index=df1.index,
columns=['num','2_num'])
.reset_index().replace('tmp', np.nan)
print (df2)
id1 id2 num 2_num
0 1.0 2.0 123 3231.0
1 3.0 4.0 231 2394.0
2 5.0 NaN 123 NaN
3 6.0 NaN 6534 NaN