pandas.merge with coinciding column names - python

Consider the following data frames:
import pandas as pd
df1 = pd.DataFrame({'id': list('fghij'), 'A': ['A' + str(i) for i in range(5)]})
A id
0 A0 f
1 A1 g
2 A2 h
3 A3 i
4 A4 j
df2 = pd.DataFrame({'id': list('fg'), 'B': ['B' + str(i) for i in range(2)]})
B id
0 B0 f
1 B1 g
df3 = pd.DataFrame({'id': list('ij'), 'B': ['B' + str(i) for i in range(3, 5)]})
B id
0 B3 i
1 B4 j
I want to merge them to get
A id B
0 A0 f B0
1 A1 g B1
2 A2 h NaN
3 A3 i B3
4 A4 j B4
Inspired by this answer I tried
final = reduce(lambda l, r: pd.merge(l, r, how='outer', on='id'), [df1, df2, df3])
but unfortunately it yields
A id B_x B_y
0 A0 f B0 NaN
1 A1 g B1 NaN
2 A2 h NaN NaN
3 A3 i NaN B3
4 A4 j NaN B4
Additionally, I checked out this question but I can't adapt the solution to my problem. Also, I didn't find any options in the docs for pandas.merge to make this happen.
In my real world problem the list of data frames might be much longer and the size of the data frames might be much larger.
Is there any "pythonic" way to do this directly and without "postprocessing"? It would be perfect to have a solution that raises an exception if column B of df2 and df3 would overlap (so if there might be multiple candidates for some value in column B of the final data frame).

Consider pd.concat + groupby?
pd.concat([df1, df2, df3], axis=0).groupby('id').first().reset_index()
id A B
0 f A0 B0
1 g A1 B1
2 h A2 NaN
3 i A3 B3
4 j A4 B4

Related

Efficiently merge a dataframe with another by selecting the second dataframe based on each row

I have a one pandas DataFrame like this,
A B C
0 A0 B0 X
1 A1 B1 Y
2 A2 B2 X
And I want to merge the above with the following DataFrames,
df_x
A D
0 A0 X0
1 A1 X1
2 A2 X2
3 A3 X3
df_y
A D
0 A0 Y0
1 A1 Y1
2 A2 Y2
3 A3 Y3
When merging I want a select the second DataFrame based on the column C. In here, if the value in C is X then I need to use df_x to merge with that row, and similarly if the value in C is Y use df_y. So, the final output would be like,
A B C D
0 A0 B0 X X0
1 A1 B1 Y Y1
2 A2 B2 X X2
We may use some methods like, i) Iterating over each row and processing, or ii) Merging by adding C column for each df_x and df_y and then merging, etc. Obviously iterating method would not be much efficient. And the other method will consume additional space for a column with redundant data. Is there a better method to achieve this?
Try this:
import io
df=pd.read_csv(io.StringIO('''A B C
0 A0 B0 X
1 A1 B1 Y
2 A2 B2 X'''), sep='\s+', engine='python')
df_x=pd.read_csv(io.StringIO(''' A D
0 A0 X0
1 A1 X1
2 A2 X2
3 A3 X3'''), sep='\s+', engine='python')
df_y=pd.read_csv(io.StringIO(''' A D
0 A0 Y0
1 A1 Y1
2 A2 Y2
3 A3 Y3'''), sep='\s+', engine='python')
# print(df)
# print(df_x)
# print(df_y)
dfx = df[df.C == 'X']
# print(dfx)
dfy = df[df.C == 'Y']
# print(dfy)
df1 = dfx.merge(df_x, left_on='A', right_on='A')
df2 = dfy.merge(df_y, left_on='A', right_on='A')
print(df1)
print(df2)
df_final = pd.concat([df1, df2]).sort_values('A')
Output
A B C D
0 A0 B0 X X0
0 A1 B1 Y Y1
1 A2 B2 X X2
There is no direct way of doing it, however merge will do the job.
df_new = df.merge(df_x, 'left', ['A', 'B','C', 'D'], suffixes=('*x', '*y')).groupby(lambda x: x.split('*')[0], axis=1).last()
df_new = df.merge(df_y, 'left', ['A', 'B','C', 'D'], suffixes=('*x', '*y')).groupby(lambda x: x.split('*')[0], axis=1).last()
Try something like this. This is may not be the direct answer. But, you could easily do the job by understanding the above code.

Change order of randomly selected rows within a pandas dataframe

I have a pandas dataframe that looks like:
c1 c2 c3 c4 result
a b c d 1
b c d a 1
a e d b 1
g a f c 1
but I want to randomly select 50% of the rows to swap the order of and also flip the result column from 1 to 0 (as shown below):
c1 c2 c3 c4 result
a b c d 1
d a b c 0 (we swapped c3 and c4 with c1 and c2)
a e d b 1
f c g a 0 (we swapped c3 and c4 with c1 and c2)
What's the idiomatic way to accomplish this?
You had the general idea. Shuffle the DataFrame and split it in half. Then modify one half and join back.
import numpy as np
np.random.seed(410112)
dfs = np.array_split(df.sample(frac=1), 2) # Shuffle then split in 1/2
# On one half set result to 0 and swap the columns
dfs[1]['result'] = 0
dfs[1] = dfs[1].rename(columns={'c1': 'c2', 'c2': 'c1', 'c3': 'c4', 'c4': 'c3'})
# Join Back
df = pd.concat(dfs).sort_index()
c1 c2 c3 c4 result
0 a b c d 1
1 c b a d 0
2 e a b d 0
3 g a f c 1

How can I efficiently replicate a pandas row, changing only one column?

I have a dataframe that looks like this:
v1 v2
0 a A1
1 b A2,A3
2 c B4
3 d A5, B6, B7
I want to modify this dataframe such that any row which has more than one value in the v2 column gets replicated for each value in v2. For example for the above dataframe, the result is as follows:
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
I was able to do this with the following code:
new_df = pd.DataFrame()
for index, row in df.iterrows():
if len(row["v2"].split(','))>1:
row_base = row
for r in row["v2"].split(','):
row_base["v2"] = r
new_df = new_df.append(row_base, ignore_index=True)
else:
new_df = new_df.append(row)
however it is extremely inefficient on a large dataframe and I am would like to learn how to do it more efficiently.
Pandas solution for 0.25+ version with Series.str.split and DataFrame.explode:
df = df.assign(v2 = df.v2.str.split(',')).explode('v2').reset_index(drop=True)
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
For oldier versions and also perfromace should be better with numpy:
from itertools import chain
s = df.v2.str.split(',')
lens = s.str.len()
df = pd.DataFrame({
'v1' : df['v1'].values.repeat(lens),
'v2' : list(chain.from_iterable(s.values.tolist()))
})
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7

Combining rows of a dataframe with string columns

I think this is a simple one, but I am not able to figure this out today and needed some help.
I have a pandas dataframe:
df = pd.DataFrame({
'id': [0, 0, 1, 1, 2],
'q.name':['A'] * 3 + ['B'] * 2,
'q.value':['A1','A2','A3','B1','B2'],
'w.name':['Q', 'W', 'E', 'R', 'Q'],
'w.value':['B1','B2','C3','C1','D2']
})
that looks like this
id q.name q.value w.name w.value
0 0 A A1 Q B1
1 0 A A2 W B2
2 1 A A3 E C3
3 1 B B1 R C1
4 2 B B2 Q D2
I am looking to convert it to
id q.name q.value w.name w.value
0 0 A A A1 A2 Q W B1 B2
1 1 A B A3 B1 E R C3 C1
2 2 B B2 Q D2
I tried pd.DataFrame(df.apply(lambda s: s.str.cat(sep=" "))) but that did not give me the result I wanted. I have done this before but I'm struggling to recall or find any post on SO to help me.
Update:
I should have mentioned this before: Is there a way of doing this without specifying which column? The DataFrame changes based on context.
I have also updated the dataframe and shown an id field, as I just realised that this was possible. I think now a groupby on the id field should solve this.
UPDATE:
In [117]: df.groupby('id', as_index=False).agg(' '.join)
Out[117]:
id q.name q.value w.name w.value
0 0 A A A1 A2 Q W B1 B2
1 1 A B A3 B1 E R C3 C1
2 2 B B2 Q D2
Old answer:
In [106]: df.groupby('category', as_index=False).agg(' '.join)
Out[106]:
category name
0 A A1 A2 A3
1 B B1 B2

Compare Multiple Columns to Get Rows that are Different in Two Pandas Dataframes

I have two dataframes:
df1=
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
df2=
A B C
0 A2 B2 C10
1 A1 B3 C11
2 A9 B4 C12
and I want to find rows in df1 that are not found in df2 based on one or two columns (or more columns). So, if I only compare column 'A' then the following rows from df1 are not found in df2 (note that column 'B' and column 'C' are not used for comparison between df1 and df2)
A B C
0 A0 B0 C0
And I would like to return a series with
0 False
1 True
2 True
Or, if I only compare column 'A' and column 'B' then the following rows from df1 are not found in df2 (note that column 'C' is not used for comparison between df1 and df2)
A B C
0 A0 B0 C0
1 A1 B1 C1
And I would want to return a series with
0 False
1 False
2 True
I know how to accomplish this using sets but I am looking for a straightforward Pandas way of accomplishing this.
If your version is 0.17.0 then you can use pd.merge and pass the cols of interest, how='left' and set indicator=True to whether the values are only present in left or both. You can then test whether the appended _merge col is equal to 'both':
In [102]:
pd.merge(df1, df2, on='A',how='left', indicator=True)['_merge'] == 'both'
Out[102]:
0 False
1 True
2 True
Name: _merge, dtype: bool
In [103]:
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)['_merge'] == 'both'
Out[103]:
0 False
1 False
2 True
Name: _merge, dtype: bool
output from the merge:
In [104]:
pd.merge(df1, df2, on='A',how='left', indicator=True)
Out[104]:
A B_x C_x B_y C_y _merge
0 A0 B0 C0 NaN NaN left_only
1 A1 B1 C1 B3 C11 both
2 A2 B2 C2 B2 C10 both
In [105]:
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)
Out[105]:
A B C_x C_y _merge
0 A0 B0 C0 NaN left_only
1 A1 B1 C1 NaN left_only
2 A2 B2 C2 C10 both
Ideally, one would like to be able to just use ~df1[COLS].isin(df2[COLS]) as a mask, but this requires index labels to match (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html)
Here is a succinct form that uses .isin but converts the second DataFrame to a dict so that index labels don't need to match:
COLS = ['A', 'B'] # or whichever columns to use for comparison
df1[~df1[COLS].isin(df2[COLS].to_dict(
orient='list')).all(axis=1)]
~df1['A'].isin(df2['A'])
Should get you the series you want
df1[ ~df1['A'].isin(df2['A'])]
The dataframe:
A B C
0 A0 B0 C0
Method ( 1 )
In [63]:
df1['A'].isin(df2['A']) & df1['B'].isin(df2['B'])
Out[63]:
0 False
1 False
2 True
Method ( 2 )
you can use the left merge to obtain values that exist in both frames + values that exist in the first data frame only
In [10]:
left = pd.merge(df1 , df2 , on = ['A' , 'B'] ,how = 'left')
left
Out[10]:
A B C_x C_y
0 A0 B0 C0 NaN
1 A1 B1 C1 NaN
2 A2 B2 C2 C10
then of course values that exist only in the first frame will have NAN values in columns of the other data frame , then you can filter by this NAN values by doing the following
In [16]:
left.loc[pd.isnull(left['C_y']) , 'A':'C_x']
Out[16]:
A B C_x
0 A0 B0 C0
1 A1 B1 C1
In [17]:
if you want to get whether the values in A exists in B you can do the following
In [20]:
pd.notnull(left['C_y'])
Out[20]:
0 False
1 False
2 True
Name: C_y, dtype: bool

Categories