How to do left join on several dataframe - python

I have a several dataframes with the same name. Each dataframe has one row and two columns. One column is common in all of dataframes. I would like to left-join them together. Assuming the name of dataframes is same. I have no plan on differing their names from each other as they are so many of them and I am just putting a few of them here. Is there any way that i can do left join them and generate the desired output mentioned below?
Here is the dataframes:
col1 col2_4
0 1 2
col1 col2_9
0 1 10
col1 col2_1
0 1 12
col1 col2_3
0 1 5
Output:
col1 col2_4 col2_9 col2_1 col_3
0 1 2 10 12 5
Code:
group = df.groupby([randomcolumnname])
for name, groups in group:
#do some stuff for groups
print(groups)
#I want to join the groups dataframes after this line(some groups dataframes are given above)
Thanks in advance!

I believe you need for left join merge with list of DataFrames by column col1:
dfs = [df1, df2, df3, df4]
from functools import reduce
df = df_final = reduce(lambda left,right: pd.merge(left,right,on='col1', how='left'), dfs)
print (df)
col1 col2_1 col2_2 col2_3 col2_4
0 1 2 10 12 5
Or for outer join create index by set_index and concat:
df = pd.concat([x.set_index('col1') for x in dfs], axis=1).reset_index()
print (df)
col1 col2_1 col2_2 col2_3 col2_4
0 1 2 10 12 5
EDIT:
I think better is use custom function with GroupBy.apply:
def func(x):
print (x)
#do some stuff for groups
return x
group = df.groupby([randomcolumnname]).apply(func)
If not possible, for lsit of DataFrames use:
dfs = []
group = df.groupby([randomcolumnname])
for name, groups in group:
#do some stuff for groups
print(groups)
dfs.append(groups)

Related

How do I stop aggregate functions from adding unwanted rows to dataframe?

I wrote a line of code that groups the dataframe by column
df = df.groupby(['where','when']).agg({'col1': ['max'], 'col2': ['sum']})
After using the above code, the aggregated columns in the output has two extra rows, with 'max' and 'sum' taking up a column below the 'col1' and 'col2' index. It looks like this:
col1
col2
max
sum
where
when
home
1
a
a
work
2
b
b
This is my expected outcome:
where
when
col1
col2
home
1
a
a
work
2
b
b
I want to bring down both col1 and col2 down to the same row as location and month, and at the same time remove 'max' and 'sum' from showing. I couldn't really think of a way to make this work so help would be appreciated.
What you need is reset_index and pass column name to aggregate function in advance.
Use followoing:
df = df.groupby(['where','when']).agg(col1 = ('col1', 'max'), col2 = ('col2', 'sum')).reset_index()
Dataframe:
where when col1 col2
0 home 1 1 1
1 work 2 2 2
2 home 1 3 3
Output:
where when col1 col2
0 home 1 3 3
1 work 2 2 2
Update:
We can pass as_index = False to groupby which will stop pandas to put keys as the index and hence we don't need to reset the index afterwards.
df = df.groupby(['where','when'], as_index = False).agg(col1 = ('col1', 'max'), col2 = ('col2', 'sum'))

How to compare two dataframes in Python pandas and output the difference?

I have two df with the same numbers of columns but different numbers of rows.
df1
col1 col2
0 a 1,2,3,4
1 b 1,2,3
2 c 1
df2
col1 col2
0 b 1,3
1 c 1,2
2 d 1,2,3
3 e 1,2
df1 is the existing list, df2 is the updated list. The expected result is whatever in df2 that was previously not in df1.
Expected result:
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2
I've tried with
mask = df1['col2'] != df2['col2']
but it doesn't work with different rows of df.
Use DataFrame.explode by splitted values in columns col2, then use DataFrame.merge with right join and indicato parameter, filter by boolean indexing only rows with right_only and last aggregate join:
df11 = df1.assign(col2 = df1['col2'].str.split(',')).explode('col2')
df22 = df2.assign(col2 = df2['col2'].str.split(',')).explode('col2')
df = df11.merge(df22, indicator=True, how='right', on=['col1','col2'])
df = (df[df['_merge'].eq('right_only')]
.groupby('col1')['col2']
.agg(','.join)
.reset_index(name='col2'))
print (df)
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2

Union of two pandas DataFrames

Say I have two data frames:
df1:
A
0 a
1 b
df2:
A
0 a
1 c
I want the result to be the union of the two frames with an extra column showing the source data frame that the row belongs to. In case of duplicates, duplicates should be removed and the respective extra column should show both sources:
A B
0 a df1, df2
1 b df1
2 c df2
I can get the concatenated data frame (df3) without duplicates as follows:
import pandas as pd
df3=pd.concat([df1,df2],ignore_index=True).drop_duplicates().reset_index(drop=True)
I can't think of/find a method to have control over what element goes where. How can I add the extra column?
Thank you very much for any tips.
Merge with an indicator argument, and remap the result:
m = {'left_only': 'df1', 'right_only': 'df2', 'both': 'df1, df2'}
result = df1.merge(df2, on=['A'], how='outer', indicator='B')
result['B'] = result['B'].map(m)
result
A B
0 a df1, df2
1 b df1
2 c df2
Use the command below:
df3 = pd.concat([df1.assign(source='df1'), df2.assign(source='df2')]) \
.groupby('A') \
.aggregate(list) \
.reset_index()
The result will be:
A source
0 a [df1, df2]
1 b [df1]
2 c [df2]
The assign will add a column named source with value df1 and df2 to your dataframes. groupby command groups rows with same A value to single row. aggregate command describes how to aggregate other columns (source) for each group of rows with same A. I have used list aggregate function so that the source column be the list of values with same A.
We use outer join to solve this -
df1 = pd.DataFrame({'A':['a','b']})
df2 = pd.DataFrame({'A':['a','c']})
df1['col1']='df1'
df2['col2']='df2'
df=pd.merge(df1, df2, on=['A'], how="outer").fillna('')
df['B']=df['col1']+','+df['col2']
df['B'] = df['B'].str.strip(',')
df=df[['A','B']]
df
A B
0 a df1,df2
1 b df1
2 c df2

python panda: return indexes of common rows

Apologies, if this is a fairly newbie question. I was trying to find which rows are common between two data frames. The return values should be the row indexes of df2 that are common with df1. My clunky example:
df1 = pd.DataFrame({'col1':['cx','cx','cx2'], 'col2':[1,4,12]})
df1['col2'] = df1['col2'].map(str);
df2 = pd.DataFrame({'col1':['cx','cx','cx','cx','cx2','cx2'], 'col2':[1,3,5,10,12,12]})
df2['col2'] = df2['col2'].map(str);
df1['idx'] = df1[['col1','col2']].apply(lambda x: '_'.join(x),axis=1);
df2['idx'] = df2[['col1','col2']].apply(lambda x: '_'.join(x),axis=1);
df1['idx_values'] = df1.index.values
df2['idx_values'] = df2.index.values
df3 = pd.merge(df1,df2,on = 'idx');
myindexes = df3['idx_values_y'];
myindexes.to_csv(idir + 'test.txt',sep='\t',index = False);
The return values should be [0,4,5]. It would be great to have this done efficiently since the two dataframes would have several million rows.
New column with join values is not necessary, merge by default inner merge by both columns and if need values of df2.index add reset_index:
df1 = pd.DataFrame({'col1':['cx','cx','cx2'], 'col2':[1,4,12]})
df2 = pd.DataFrame({'col1':['cx','cx','cx','cx','cx2','cx2'], 'col2':[1,3,5,10,12,12]})
df3 = pd.merge(df1,df2.reset_index(), on = ['col1','col2'])
print (df3)
col1 col2 index
0 cx 1 0
1 cx2 12 4
2 cx2 12 5
For both indexes need:
df4 = pd.merge(df1.reset_index(),df2.reset_index(), on = ['col1','col2'])
print (df4)
index_x col1 col2 index_y
0 0 cx 1 0
1 2 cx2 12 4
2 2 cx2 12 5
For only intersection of both DataFrames:
df5 = pd.merge(df1,df2, on = ['col1','col2'])
#if 2 column DataFrame
#df5 = pd.merge(df1,df2)
print (df5)
col1 col2
0 cx 1
1 cx2 12
2 cx2 12
This can easily be done by merging (inner join) both dataframes:
common_rows = pd.merge(df1, df2.reset_index(), how='inner', on=['idx_values'])

How to make the values of a pandas dataframe column as column

I would like to reshape my dataframe:
from Input_DF
col1 col2 col3
Course_66 0\nCourse_67 1\nCourse_68 0 a c
Course_66 1\nCourse_67 0\nCourse_68 0 a d
to Output_DF
Course_66 Course_67 Course_68 col2 col3
0 0 1 a c
0 1 0 a d
Please, note that col1 contains one long string.
Please, any help would be very appreciated.
Many Thanks in advance.
Best Regards,
Carlo
Use:
#first split by whitespaces to df
df1 = df['col1'].str.split(expand=True)
#for each column split by \n and select first value
df2 = df1.apply(lambda x: x.str.split(r'\\n').str[0])
#for columns select only first row and select second splitted value
df2.columns = df1.iloc[0].str.split(r'\\n').str[1]
print (df2)
0 Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
#join to original, remove unnecessary column
df = df2.join(df.drop('col1', axis=1))
print (df)
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d
Another solution with list comprehension:
L = [[y.split('\\n')[0] for y in x.split()] for x in df['col1']]
cols = [x.split('\\n')[1] for x in df.loc[0, 'col1'].split()]
df1 = pd.DataFrame(L, index=df.index, columns=cols)
print (df1)
Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
EDIT:
#split values by whitespaces - it split by \n too
df1 = df['course_vector'].str.split(expand=True)
#select each pair columns
df2 = df1.iloc[:, 1::2]
#for columns select each unpair value in first row
df2.columns = df1.iloc[0, 0::2]
#join to original
df = df2.join(df.drop('course_vector', axis=1))
Since your data are ordered in value, key pairs, you can split on newlines and multiple spaces with regex to get a list, and then take every other value starting at the first position for values and the second position for labels and return a Series object. By applying, you will get back a DataFrame from these multiple series, which you can then combine with the original DataFrame.
import pandas as pd
df = pd.DataFrame({'col1': ['0\nCourse_66 0\nCourse_67 1\nCourse_68',
'0\nCourse_66 1\nCourse_67 0\nCourse_68'],
'col2': ['a', 'a'], 'col3': ['c', 'd']})
def to_multiple_columns(str_list):
# take the numeric values for each series and column labels and return as a series
# by taking every other value
return pd.Series(str_list[::2], str_list[1::2])
# split on newlines and spaces
splits = df['col1'].str.split(r'\n|\s+').apply(to_multiple_columns)
output = pd.concat([splits, df.drop('col1', axis=1)], axis=1)
print(output)
Output:
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d

Categories