Merge dataframe dynamic - python

I have 2 dataframes: df1 and df2. I would like to merge the 2 dataframes on the column link in df2. Link column in df2 contains a list of column and values which match in df1:
df1 = pd.DataFrame({'p':[1,2,3,4], 'a':[1,2,2,2],'b':['z','z','z','z'],'c':[3,3,4,4],'d':[5,5,5,6]})
df2 = pd.DataFrame({'e':[11,22,33,44], 'link':['a=1,c=3','a=2,c=3','a=2,c=4,d=5','a=2,c=4']})
The result should end with dataframe like this where column e from df2 are merge together with df1:
df_res = pd.DataFrame({'p':[1,2,3,3,4], 'a':[1,2,2,2,2],'b':['z','z','z','z','z'],'c':[3,3,4,4,4],'d':[5,5,5,5,6],'e':[11,22,33,44,44]})
How can this be done in pandas?

df1["e"] = df2["e"]

frames = [df1, df2]
result = pd.concat(frames)

Related

python - drop duplicated index in place in a pandas dataframe

I have a list of dataframes:
all_df = [df1, df2, df3]
I would like to remove rows with duplicated indices in all dataframes in the list, such that the changes are reflected in the original dataframes df1, df2 and df3.
I tried to do
for df in all_df:
df = df[~df.index.duplicated()]
But the changes are only applied in the list, not on the original dataframes.
Essentially, I want to avoid doing the following:
df1 = df1[~df1.index.duplicated()]
df2 = df2[~df2.index.duplicated()]
df3 = df3[~df3.index.duplicated()]
all_df = [df1,df2,df3]
You need recreate list of DataFrames:
all_df = [df[~df.index.duplicated()] for df in all_df]
Or:
for i, df in enumerate(all_df):
all_df[i] = df[~df.index.duplicated()]
print (all_df[0])
EDIT: If name of dictionary is important use dictionary of DataFrames, but also inplace modification df1, df2 is not here, need select by keys of dicts:
d = {'price': df1, 'volumes': df2}
d = {k: df[~df.index.duplicated()] for k, df in all_df.items()}
print (d['price'])

Pandas error in merging two files based on different number of columns (2 and 1) in 2 dataframes

I have two files with the structures like below:
df1
intA,intB
4933401J01Rik,Gm37180
Gm37686,Gm37363
df2
chr,gene_type,gene_symbol
chr1,TEC,4933401J01Rik
chr2,TEC,Gm37180
chr3,TEC,Gm37363
chr4,TEC,Gm37686
I am trying to merge these two files. So basically I need to lift information from the df2 for columns intA and intB in df1. In the final output, for each column of df1 there should be two additional columns reporting the chr and gene_type based on the df2. The final output should look like:
result
intA,intB,chr,chr,gene_type,gene_type
4933401J01Rik,Gm37180,chr1,chr2,TEC,TEC
Gm37686,Gm37363,chr4,chr3,TEC,TEC
I run this code but it gives the error Can only merge Series or DataFrame objects, a <class 'str'> was passed.
df1 = pd.read_csv(df1)
df2 = pd.read_csv(df2)
result = pd.merge(df1, df2, how='left', left_on=['intA','intB'], right_on = ['gene_symbol'])
print(result)
Any help is appreciated - thank you.
You can do it in an idiomatic / Pandas-ish way as follows:
As you are intending to merge contents of 2 columns (intA, intB) in df1 with another dataframe df2 and match only on one column (gene_symbol), you cannot directly merge them. It is because the number of columns to match on are different. The error ValueError: len(right_on) must equal len(left_on) would be resulted.
Instead, you have to transform the 2 columns intA, intB into one column with their contents in separate rows first before merging.
1. Transform df1 with intA, intB combined into one column with contents in separate rows:
df1a = df1.copy()
df1a.columns = df1a.columns.str.split(r'(int)', expand=True) # split column labels
df1a = df1a.droplevel(level=0, axis=1)
df1a = df1a.stack().rename_axis(index=['index', 'int_type']).reset_index()
2. Merge on new column int (combined intA and intB) from df1 and gene_symbol from df2:
Now, we can merge on the same number of columns from the 2 dataframes:
df_merge = pd.merge(df1a, df2, how='left', left_on='int', right_on='gene_symbol')
# remove column 'gene_symbol' which has same duplicated info as 'int'
df_merge2 = df_merge.drop('gene_symbol', axis=1)
3. Pivot to put intA, intB back to 2 separate columns:
df_out = df_merge2.pivot(index='index', columns='int_type')
df_out.columns = df_out.columns.map(''.join) # combine column labels
Result:
print(df_out)
intA intB chrA chrB gene_typeA gene_typeB
index
0 4933401J01Rik Gm37180 chr1 chr2 TEC TEC
1 Gm37686 Gm37363 chr4 chr3 TEC TEC
There's probably a more pandas-ish way to do this, but this will do what you want:
import pandas as pd
df1 = pd.read_csv('a')
df2 = pd.read_csv('b')
df3 = pd.DataFrame(columns=['intA', 'intB', 'chrA', 'chrB', 'gene_typeA', 'gene_typeB'])
for index, row in df1.iterrows():
aMatch = df2.loc[df2['gene_symbol'] == row['intA']]
bMatch = df2.loc[df2['gene_symbol'] == row['intB']]
if aMatch.empty or bMatch.empty:
# malformed data somehow
print("malformed data")
df3 = df3.append( { 'intA': row['intA'],
'intB': row['intB'],
'chrA': aMatch['chr'].values[0],
'chrB': bMatch['chr'].values[0],
'gene_typeA': aMatch['gene_type'].values[0],
'gene_typeB': bMatch['gene_type'].values[0]
}, ignore_index=True)
result:
intA intB chrA chrB gene_typeA gene_typeB
0 4933401J01Rik Gm37180 chr1 chr2 TEC TEC
1 Gm37686 Gm37363 chr4 chr3 TEC TEC

How to Concat() dataframe having 1 row with dataframe having multiple rows in Python

df1 result
df2 result
I need to concatenate columns from df1 and df2.
for each row in df2, df1 columns will be concatenated
Final result
Assuming that these are represented as Pandas data frames, the following should return the concatenated data frame you're looking for:
result = pd.concat([df1, df2], axis=1, join="inner")

Pandas merge on part of two columns

I have two dataframes with a common column called 'upc' as such:
df1:
upc
23456793749
78907809834
35894796324
67382808404
93743008374
df2:
upc
4567937
9078098
8947963
3828084
7430083
Notice that df2 'upc' values are the innermost 7 values of df1 'upc' values.
Note that both df1 and df2 have other columns not shown above.
What I want to do is do an inner merge on 'upc' but only on the innermost 7 values. How can I achieve this?
1) Create both dataframes and convert to string type.
2) pd.merge the two frames, but using the left_on keyword to access the inner 7 characters of your 'upc' series
df1 = pd.DataFrame(data=[
23456793749,
78907809834,
35894796324,
67382808404,
93743008374,], columns = ['upc1'])
df1 = df1.astype(str)
df2 = pd.DataFrame(data=[
4567937,
9078098,
8947963,
3828084,
7430083,], columns = ['upc2'])
df2 = df2.astype(str)
pd.merge(df1, df2, left_on=df1['upc1'].astype(str).str[2:-2], right_on='upc2', how='inner')
Out[5]:
upc1 upc2
0 23456793749 4567937
1 78907809834 9078098
2 35894796324 8947963
3 67382808404 3828084
4 93743008374 7430083
Using str.extact, match all items in df1 with df2, then we using the result as merge key merge with df2
df1['keyfordf2']=df1.astype(str).upc.str.extract(r'({})'.format('|'.join(df2.upc.astype(str).tolist())),expand=True).fillna(False)
df1.merge(df2.astype(str),left_on='keyfordf2',right_on='upc')
Out[273]:
upc_x keyfordf2 upc_y
0 23456793749 4567937 4567937
1 78907809834 9078098 9078098
2 35894796324 8947963 8947963
3 67382808404 3828084 3828084
4 93743008374 7430083 7430083
You could make a new column in df1 and merge on that.
import pandas as pd
df1= pd.DataFrame({'upc': [ 23456793749, 78907809834, 35894796324, 67382808404, 93743008374]})
df2= pd.DataFrame({'upc': [ 4567937, 9078098, 8947963, 3828084, 7430083]})
df1['upc_old'] = df1['upc'] #in case you still need the old (longer) upc column
df1['upc'] = df1['upc'].astype(str).str[2:-2].astype(int)
merged_df = pd.merge(df1, df2, on='upc')

python merge two dataframe with common columns

I am trying to merge two dataframe, which has 3 column with same name.
df1 = ['id','first','last','present']
df2 = ['id','first','last','age','date','location']
I want to join on these three common columns, I tried following
cols = df2.columns.difference(df1.columns)
result = pd.merge(df1, df2[cols_to_use], left_index=True, right_index=True, how='outer')
to get
['id','first','last','present','age','date','location']
Is this can be done in step pd.merge ?

Categories