How to compare two dataframes in Python pandas and output the difference? - python

I have two df with the same numbers of columns but different numbers of rows.
df1
col1 col2
0 a 1,2,3,4
1 b 1,2,3
2 c 1
df2
col1 col2
0 b 1,3
1 c 1,2
2 d 1,2,3
3 e 1,2
df1 is the existing list, df2 is the updated list. The expected result is whatever in df2 that was previously not in df1.
Expected result:
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2
I've tried with
mask = df1['col2'] != df2['col2']
but it doesn't work with different rows of df.

Use DataFrame.explode by splitted values in columns col2, then use DataFrame.merge with right join and indicato parameter, filter by boolean indexing only rows with right_only and last aggregate join:
df11 = df1.assign(col2 = df1['col2'].str.split(',')).explode('col2')
df22 = df2.assign(col2 = df2['col2'].str.split(',')).explode('col2')
df = df11.merge(df22, indicator=True, how='right', on=['col1','col2'])
df = (df[df['_merge'].eq('right_only')]
.groupby('col1')['col2']
.agg(','.join)
.reset_index(name='col2'))
print (df)
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2

Related

How do you drop a column by index?

When I run this code it drops the first row instead of the first column:
df.drop(axis=1, index=0)
How do you drop a column by index?
You can use df.columns[i] to denote the column. Example:
df.drop(df.columns[0], axis=1)
Using the example
df = pd.DataFrame([
[1023.423,12.59595],
[1000,11.63024902],
[975,9.529815674],
[100,-48.20524597]], columns = ['col1', 'col2'])
col1 col2
0 1023.423 12.595950
1 1000.000 11.630249
2 975.000 9.529816
3 100.000 -48.205246
If you do df.drop(index=0), the output is dropping row with index 0
col1 col2
1 1000.0 11.630249
2 975.0 9.529816
3 100.0 -48.205246
If you do df.drop('col1', axis=1), the output is dropping column with name 'col1'
col2
0 12.595950
1 11.630249
2 9.529816
3 -48.205246
Please remember to use inplace=True where necessary

Identifying rows shared between two Pandas DataFrames based on two columns

Related to: How to find row with same value in 2 columns between 2 dataframes but different values in other columns pandas
I have two DataFrames: df1 and df2.
I would like to find all the rows in these combined DataFrames that have identical values in 'columnA' (object) and 'columnB' (int). These rows will have differing values in other columns I don't care about. The shape of these DataFrames also differs.
I've tried something like:
concat = pd.concat([df1, df2])
overlap = concat[concat.duplicated(subset=['columnA','columnB'], keep=False)]
But the output doesn't look right (maybe it is). Just want to check - am I missing anything?
Edit:
Say I wanted all the rows with the same value in columnA but different values in columnB - would this work?
df3 = (concat[concat.duplicated(subset=['columnA'], keep=False)]
.drop_duplicates(subset=['columnB']))
You can use pd.merge
df1 = pd.DataFrame(data=[('A','B','C'),('E','F','G'),('A','B','F')], columns=['columnA','columnB','columnC'])
df2 = pd.DataFrame(data=[('X','Y','G'),('A','B','Y'),('A','C','F')], columns=['columnA','columnB','columnC'])
df2['columnB'] = df2['columnB'].astype(str) #convert to string
print(df1)
columnA columnB columnC
0 A B C
1 E F G
2 A B F
print(df2)
columnA columnB columnC
0 X Y G
1 A B Y
2 A C F
And then after applying pd.merge:
df_m = pd.merge(df1,df2,how='inner',on='columnA')
----
df_m
columnA columnB_x columnC_x columnB_y columnC_y
0 A B C B Y
1 A B C C F
2 A B F B Y
3 A B F C F
Regarding your edit, try this:
df_final = df_m[df_m['columnB_x'] != df_m['columnB_y']]
------
print(df_final)
columnA columnB_x columnC_x columnB_y columnC_y
1 A B C C F
3 A B F C F

How to do left join on several dataframe

I have a several dataframes with the same name. Each dataframe has one row and two columns. One column is common in all of dataframes. I would like to left-join them together. Assuming the name of dataframes is same. I have no plan on differing their names from each other as they are so many of them and I am just putting a few of them here. Is there any way that i can do left join them and generate the desired output mentioned below?
Here is the dataframes:
col1 col2_4
0 1 2
col1 col2_9
0 1 10
col1 col2_1
0 1 12
col1 col2_3
0 1 5
Output:
col1 col2_4 col2_9 col2_1 col_3
0 1 2 10 12 5
Code:
group = df.groupby([randomcolumnname])
for name, groups in group:
#do some stuff for groups
print(groups)
#I want to join the groups dataframes after this line(some groups dataframes are given above)
Thanks in advance!
I believe you need for left join merge with list of DataFrames by column col1:
dfs = [df1, df2, df3, df4]
from functools import reduce
df = df_final = reduce(lambda left,right: pd.merge(left,right,on='col1', how='left'), dfs)
print (df)
col1 col2_1 col2_2 col2_3 col2_4
0 1 2 10 12 5
Or for outer join create index by set_index and concat:
df = pd.concat([x.set_index('col1') for x in dfs], axis=1).reset_index()
print (df)
col1 col2_1 col2_2 col2_3 col2_4
0 1 2 10 12 5
EDIT:
I think better is use custom function with GroupBy.apply:
def func(x):
print (x)
#do some stuff for groups
return x
group = df.groupby([randomcolumnname]).apply(func)
If not possible, for lsit of DataFrames use:
dfs = []
group = df.groupby([randomcolumnname])
for name, groups in group:
#do some stuff for groups
print(groups)
dfs.append(groups)

how to create a dataframe aggregating (grouping?) a dataframe containing only strings

I would like to create a dataframe "aggregating" a larger data set.
Starting:
df:
col1 col2
1 A B
2 A C
3 A B
and getting:
df_aggregated:
col1 col2
1 A B
2 A C
without using any calclulation (count())
I would write:
df_aggreagated = df.groupby('col1')
but I do not get anything
print ( df_aggregated )
"error"
any help appreciated
You can accomplish this by simply dropping the duplicate entries using the df.drop_duplicates function:
df_aggregated = df.drop_duplicates(subset=['col1', 'col2'], keep=False)
print(df_aggregated)
col1 col2
1 A B
2 A C
You can use groupby with a function:
In [849]: df.groupby('col2', as_index=False).max()
Out[849]:
col2 col1
0 B A
1 C A

How to make the values of a pandas dataframe column as column

I would like to reshape my dataframe:
from Input_DF
col1 col2 col3
Course_66 0\nCourse_67 1\nCourse_68 0 a c
Course_66 1\nCourse_67 0\nCourse_68 0 a d
to Output_DF
Course_66 Course_67 Course_68 col2 col3
0 0 1 a c
0 1 0 a d
Please, note that col1 contains one long string.
Please, any help would be very appreciated.
Many Thanks in advance.
Best Regards,
Carlo
Use:
#first split by whitespaces to df
df1 = df['col1'].str.split(expand=True)
#for each column split by \n and select first value
df2 = df1.apply(lambda x: x.str.split(r'\\n').str[0])
#for columns select only first row and select second splitted value
df2.columns = df1.iloc[0].str.split(r'\\n').str[1]
print (df2)
0 Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
#join to original, remove unnecessary column
df = df2.join(df.drop('col1', axis=1))
print (df)
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d
Another solution with list comprehension:
L = [[y.split('\\n')[0] for y in x.split()] for x in df['col1']]
cols = [x.split('\\n')[1] for x in df.loc[0, 'col1'].split()]
df1 = pd.DataFrame(L, index=df.index, columns=cols)
print (df1)
Course_66 Course_67 Course_68
0 0 0 1
1 0 1 0
EDIT:
#split values by whitespaces - it split by \n too
df1 = df['course_vector'].str.split(expand=True)
#select each pair columns
df2 = df1.iloc[:, 1::2]
#for columns select each unpair value in first row
df2.columns = df1.iloc[0, 0::2]
#join to original
df = df2.join(df.drop('course_vector', axis=1))
Since your data are ordered in value, key pairs, you can split on newlines and multiple spaces with regex to get a list, and then take every other value starting at the first position for values and the second position for labels and return a Series object. By applying, you will get back a DataFrame from these multiple series, which you can then combine with the original DataFrame.
import pandas as pd
df = pd.DataFrame({'col1': ['0\nCourse_66 0\nCourse_67 1\nCourse_68',
'0\nCourse_66 1\nCourse_67 0\nCourse_68'],
'col2': ['a', 'a'], 'col3': ['c', 'd']})
def to_multiple_columns(str_list):
# take the numeric values for each series and column labels and return as a series
# by taking every other value
return pd.Series(str_list[::2], str_list[1::2])
# split on newlines and spaces
splits = df['col1'].str.split(r'\n|\s+').apply(to_multiple_columns)
output = pd.concat([splits, df.drop('col1', axis=1)], axis=1)
print(output)
Output:
Course_66 Course_67 Course_68 col2 col3
0 0 0 1 a c
1 0 1 0 a d

Categories