Merging rows from different dataframes together - python

I have two Dataframes: one with columns "Name", "Year" and "Type" and the other one with different parameters. There are 4 different types and each one has his specific parameters. Now i need to merge them together.
My approach is to use a if-function to find out the "type". For example in row two of df3 i have type 'a'. The parameters for type 'a' are in row 3 of df4. I tried to connect them with the following code:
df3.ix[[2]]
s1 = df3.ix[[2]]
s2 = df4.ix[[3]]
result = pd.concat([s1, s2], axis=1)
My problem is now, that the parameters are in a seperate row and not added to row 2. Is there a chance to merge them together in one row? Thanks for your answers!

If df3 has a Type column and df4 has a type column, then the two DataFrames can be merged with
pd.merge(df3, df4, left_on='Type', right_on='type')
This is by default an inner join.
In [13]: df3
Out[13]:
Name Year Type
1 A 2012 boat
2 B 2013 car
3 C 2011 truck
4 D 2013 boat
In [14]: df4
Out[14]:
type Parameter1 Parameter2 Parameter3
0 boat 2 8 7
1 car 1 9 3
2 truck 5 4 2
In [15]: pd.merge(df3, df4, left_on='Type', right_on='type')
Out[15]:
Name Year Type type Parameter1 Parameter2 Parameter3
0 A 2012 boat boat 2 8 7
1 D 2013 boat boat 2 8 7
2 B 2013 car car 1 9 3
3 C 2011 truck truck 5 4 2
Note that if the column names matched exactly, then
pd.merge(df3, df4)
would merge on column names shared in common by default.

Related

Keep only the final or the latest rev of a file name

I have a dataframe with columns as below:
Name Measurement
0 Blue_Water_Final_Rev_0 3
1 Blue_Water_Final_Rev_1 4
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
4 Red_Water_Initial_Rev_0 6
I want to keep only the rows with the latest rev or rows with "Final" if the other is "Initial".
In the case above, my output will be as below:
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
How can I do this in python in my pandas dataframe? Thanks.
You can extract the name before "Final" and drop_duplicates with keep='last':
keep = (df['Name']
.str.extract('^(.*)_Final', expand=False)
.drop_duplicates(keep='last')
.dropna()
)
out = df.loc[keep.index]
NB. Assuming the data is sorted by revision.
Output:
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
If you want to keep all duplicates of the last revision:
out = df[df['Name'].isin(df.loc[keep.index, 'Name'])]
If possible exist only Initial and no Final and need keep it use Series.str.extract for get 3 columns for groups, Final or Initial and number of revision, convert last column to integers and then sorting by all columns with DataFrame.sort_values and get last duplicates per groups by DataFrame.duplicated:
print (df)
Name Measurement
0 Blue_Water_Final_Rev_0 3
1 Blue_Water_Final_Rev_1 4
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
4 Red_Water_Initial_Rev_0 6
5 Green_Water_Initial_Rev_0 6
df1 = (df['Name'].str.extract(r'(?P<a>\w+)_(?P<b>Final|Initial)_Rev_(?P<c>\d+)$')
.assign(c=lambda x: x.c.astype(int)))
df = df[~df1.sort_values(['a','c','b'], ascending=[True, True, False])
.duplicated('a', keep='last')]
print (df)
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
5 Green_Water_Initial_Rev_0 6
But if need remove all Initial and processing only Final rows use first part same like above, only then filter out rows with Initial and for last revisions use DataFrame.loc with DataFrameGroupBy.idxmax:
df1 = (df['Name'].str.extract(r'(?P<a>\w+)_(?P<b>Final|Initial)_Rev_(?P<c>\d+)$')
.assign(c=lambda x: x.c.astype(int)))
df = df.loc[df1[df1.b.ne('Initial')].groupby('a')['c'].idxmax()]
print (df)
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
you can you the df.iloc[2:4,:] for this

How to Compare 2 DataFrames which have different Column Names(have same and different values) in Python

I have several dataframes, each has different and same column names, and the columns with same and different column names may have same values. I want to find the columns in one dataset which has matching values with other dataset' columns(may have same or different column names). Is there any efficient way to do that using python?
For example:
df1: ID count Name
0 1 A
1 2 B
2 3 C
df2: person_id count_number Name Value
0 1 A 11
2 3 C 22
3 4 D 33
df3: key Value
11 11
22 22
33 33
I tried 'isin()':this is not efficient, and 'datacompy': can't be used? because I have different column names.
My expected output: the column names that have matchings. And also better show how many matchings do they have.
For example: In this example, I want to find the matching columns of df1, df2 and df3. And the output I want is: Their pairwise matches: For df1 and df2: ID&person_id; count&count_number, Name; for df2 and df3: Value, and so on.
As you have no expected output, it's hard to answer. A first proposition:
>>> df1.merge(df2, left_on='ID', right_on='person_id').merge(df3, on='Value')
ID count Name_x person_id count_number Name_y Value key
0 0 1 A 0 1 A 11 11
1 2 3 C 2 3 C 22 22

How to findout difference between two dataframes irrespective of index? [duplicate]

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?
In other word, a data frame that has all the rows/columns in df1 that are not in df2?
By using drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
The above method only works for those data frames that don't already have duplicates themselves. For example:
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
Wrong Output :
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
Method 1: Using isin with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only
For rows, try this, where Name is the joint index column (can be a list for multiple common columns, or specify left_on and right_on):
m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)
The indicator=True setting is useful as it adds a column called _merge, with all changes between df1 and df2, categorized into 3 possible kinds: "left_only", "right_only" or "both".
For columns, try this:
set(df1.columns).symmetric_difference(df2.columns)
Accepted answer Method 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan. I am not sure if this is the best way, but it can be avoided by
df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]
It's slower, because it needs to cast data to string, but thanks to this casting pd.np.nan == pd.np.nan.
Let's go trough the code. First we cast values to string, and apply tuple function to each row.
df1.astype(str).apply(tuple, 1)
df2.astype(str).apply(tuple, 1)
Thanks to that, we get pd.Series object with list of tuples. Each tuple contains whole row from df1/df2.
Then we apply isin method on df1 to check if each tuple "is in" df2.
The result is pd.Series with bool values. True if tuple from df1 is in df2. In the end, we negate results with ~ sign, and applying filter on df1. Long story short, we get only those rows from df1 that are not in df2.
To make it more readable, we may write it as:
df1_str_tuples = df1.astype(str).apply(tuple, 1)
df2_str_tuples = df2.astype(str).apply(tuple, 1)
df1_values_in_df2_filter = df1_str_tuples.isin(df2_str_tuples)
df1_values_not_in_df2 = df1[~df1_values_in_df2_filter]
import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
'Age':[23,12,34,44,28,40]})
# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)
# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)
# df1
# Age Name
# 0 23 John
# 1 45 Mike
# 2 12 Smith
# 3 34 Wale
# 4 27 Marry
# 5 44 Tom
# 6 28 Menda
# 7 39 Bolt
# 8 40 Yuswa
# df2
# Age Name
# 0 23 John
# 1 12 Smith
# 2 34 Wale
# 3 44 Tom
# 4 28 Menda
# 5 40 Yuswa
# df_1notin2
# Age Name
# 0 45 Mike
# 1 27 Marry
# 2 39 Bolt
Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values.
newDf = df1.set_index('Name1')
.drop(df2['Name2'], errors='ignore')
.reset_index(drop=False)
edit2, I figured out a new solution without the need of setting index
newdf=pd.concat([df1,df2]).drop_duplicates(keep=False)
Okay i found the answer of highest vote already contain what I have figured out. Yes, we can only use this code on condition that there are no duplicates in each two dfs.
I have a tricky method. First we set ’Name’ as the index of two dataframe given by the question. Since we have same ’Name’ in two dfs, we can just drop the ’smaller’ df’s index from the ‘bigger’ df.
Here is the code.
df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)
Pandas now offers a new API to do data frame diff: pandas.DataFrame.compare
df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
In addition to accepted answer, I would like to propose one more wider solution that can find a 2D set difference of two dataframes with any index/columns (they might not coincide for both datarames). Also method allows to setup tolerance for float elements for dataframe comparison (it uses np.isclose)
import numpy as np
import pandas as pd
def get_dataframe_setdiff2d(df_new: pd.DataFrame,
df_old: pd.DataFrame,
rtol=1e-03, atol=1e-05) -> pd.DataFrame:
"""Returns set difference of two pandas DataFrames"""
union_index = np.union1d(df_new.index, df_old.index)
union_columns = np.union1d(df_new.columns, df_old.columns)
new = df_new.reindex(index=union_index, columns=union_columns)
old = df_old.reindex(index=union_index, columns=union_columns)
mask_diff = ~np.isclose(new, old, rtol, atol)
df_bool = pd.DataFrame(mask_diff, union_index, union_columns)
df_diff = pd.concat([new[df_bool].stack(),
old[df_bool].stack()], axis=1)
df_diff.columns = ["New", "Old"]
return df_diff
Example:
In [1]
df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})
print("df1:\n", df1, "\n")
print("df2:\n", df2, "\n")
diff = get_dataframe_setdiff2d(df1, df2)
print("diff:\n", diff, "\n")
Out [1]
df1:
A C
0 2 2
1 1 1
2 2 2
df2:
A B
0 1 1
1 1 1
diff:
New Old
0 A 2.0 1.0
B NaN 1.0
C 2.0 NaN
1 B NaN 1.0
C 1.0 NaN
2 A 2.0 NaN
C 2.0 NaN
As mentioned here
that
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
is correct solution but it will produce wrong output if
df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
In that case above solution will give
Empty DataFrame, instead you should use concat method after removing duplicates from each datframe.
Use concate with drop_duplicates
df1=df1.drop_duplicates(keep="first")
df2=df2.drop_duplicates(keep="first")
pd.concat([df1,df2]).drop_duplicates(keep=False)
I had issues with handling duplicates when there were duplicates on one side and at least one on the other side, so I used Counter.collections to do a better diff, ensuring both sides have the same count. This doesn't return duplicates, but it won't return any if both sides have the same count.
from collections import Counter
def diff(df1, df2, on=None):
"""
:param on: same as pandas.df.merge(on) (a list of columns)
"""
on = on if on else df1.columns
df1on = df1[on]
df2on = df2[on]
c1 = Counter(df1on.apply(tuple, 'columns'))
c2 = Counter(df2on.apply(tuple, 'columns'))
c1c2 = c1-c2
c2c1 = c2-c1
df1ondf2on = pd.DataFrame(list(c1c2.elements()), columns=on)
df2ondf1on = pd.DataFrame(list(c2c1.elements()), columns=on)
df1df2 = df1.merge(df1ondf2on).drop_duplicates(subset=on)
df2df1 = df2.merge(df2ondf1on).drop_duplicates(subset=on)
return pd.concat([df1df2, df2df1])
> df1 = pd.DataFrame({'a': [1, 1, 3, 4, 4]})
> df2 = pd.DataFrame({'a': [1, 2, 3, 4, 4]})
> diff(df1, df2)
a
0 1
0 2
There is a new method in pandas DataFrame.compare that compare 2 different dataframes and return which values changed in each column for the data records.
Example
First Dataframe
Id Customer Status Date
1 ABC Good Mar 2023
2 BAC Good Feb 2024
3 CBA Bad Apr 2022
Second Dataframe
Id Customer Status Date
1 ABC Bad Mar 2023
2 BAC Good Feb 2024
5 CBA Good Apr 2024
Comparing Dataframes
print("Dataframe difference -- \n")
print(df1.compare(df2))
print("Dataframe difference keeping equal values -- \n")
print(df1.compare(df2, keep_equal=True))
print("Dataframe difference keeping same shape -- \n")
print(df1.compare(df2, keep_shape=True))
print("Dataframe difference keeping same shape and equal values -- \n")
print(df1.compare(df2, keep_shape=True, keep_equal=True))
Result
Dataframe difference --
Id Status Date
self other self other self other
0 NaN NaN Good Bad NaN NaN
2 3.0 5.0 Bad Good Apr 2022 Apr 2024
Dataframe difference keeping equal values --
Id Status Date
self other self other self other
0 1 1 Good Bad Mar 2023 Mar 2023
2 3 5 Bad Good Apr 2022 Apr 2024
Dataframe difference keeping same shape --
Id Customer Status Date
self other self other self other self other
0 NaN NaN NaN NaN Good Bad NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 3.0 5.0 NaN NaN Bad Good Apr 2022 Apr 2024
Dataframe difference keeping same shape and equal values --
Id Customer Status Date
self other self other self other self other
0 1 1 ABC ABC Good Bad Mar 2023 Mar 2023
1 2 2 BAC BAC Good Good Feb 2024 Feb 2024
2 3 5 CBA CBA Bad Good Apr 2022 Apr 2024
A slight variation of the nice #liangli's solution that does not require to change the index of existing dataframes:
newdf = df1.drop(df1.join(df2.set_index('Name').index))
Finding difference by index. Assuming df1 is a subset of df2 and the indexes are carried forward when subsetting
df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
# Example
df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])
df2 = df1.loc[[1,3,5]]
df1
gender subject
1 f bio
2 m chem
3 f phy
4 m bio
5 f bio
df2
gender subject
1 f bio
3 f phy
5 f bio
df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
df3
gender subject
2 m chem
4 m bio
Defining our dataframes:
df1 = pd.DataFrame({
'Name':
['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
'Age':
[23,45,12,34,27,44,28,39,40]
})
df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])
df1
Name Age
0 John 23
1 Mike 45
2 Smith 12
3 Wale 34
4 Marry 27
5 Tom 44
6 Menda 28
7 Bolt 39
8 Yuswa 40
df2
Name Age
0 John 23
2 Smith 12
3 Wale 34
5 Tom 44
6 Menda 28
8 Yuswa 40
The difference between the two would be:
df1[~df1.isin(df2)].dropna()
Name Age
1 Mike 45.0
4 Marry 27.0
7 Bolt 39.0
Where:
df1.isin(df2) returns the rows in df1 that are also in df2.
~ (Element-wise logical NOT) in front of the expression negates the results, so we get the elements in df1 that are NOT in df2–the difference between the two.
.dropna() drops the rows with NaN presenting the desired output
Note This only works if len(df1) >= len(df2). If df2 is longer than df1 you can reverse the expression: df2[~df2.isin(df1)].dropna()
I found the deepdiff library is a wonderful tool that also extends well to dataframes if different detail is required or ordering matters. You can experiment with diffing to_dict('records'), to_numpy(), and other exports:
import pandas as pd
from deepdiff import DeepDiff
df1 = pd.DataFrame({
'Name':
['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
'Age':
[23,45,12,34,27,44,28,39,40]
})
df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])]
DeepDiff(df1.to_dict(), df2.to_dict())
# {'dictionary_item_removed': [root['Name'][1], root['Name'][4], root['Name'][7], root['Age'][1], root['Age'][4], root['Age'][7]]}
Symmetric Difference
If you are interested in the rows that are only in one of the dataframes but not both, you are looking for the set difference:
pd.concat([df1,df2]).drop_duplicates(keep=False)
⚠️ Only works, if both dataframes do not contain any duplicates.
Set Difference / Relational Algebra Difference
If you are interested in the relational algebra difference / set difference, i.e. df1-df2 or df1\df2:
pd.concat([df1,df2,df2]).drop_duplicates(keep=False)
⚠️ Only works, if both dataframes do not contain any duplicates.
Another possible solution is to use numpy broadcasting:
df1[np.all(~np.all(df1.values == df2.values[:, None], axis=2), axis=0)]
Output:
Name Age
1 Mike 45
4 Marry 27
7 Bolt 39
Using the lambda function you can filter the rows with _merge value “left_only” to get all the rows in df1 which are missing from df2
df3 = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x :x['_merge']=='left_only']
df
Try this one:
df_new = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', 1)
It will result a new dataframe with the differences: the values that exist in df1 but not in df2.

Pandas vectorization for a multiple data frame operation

I am looking to increase the speed of an operation within pandas and I have learned that it is generally best to do so via using vectorization. The problem I am looking for help with is vectorizing the following operation.
Setup:
df1 = a table with a date-time column, and city column
df2 = another (considerably larger) table with a date-time column, and city column
The Operation:
for i, row in df2.iterrows():
for x, row2 in df1.iterrows():
if row['date-time'] - row2['date-time'] > pd.Timedelta('8 hours') and row['city'] == row2['city']:
df2.at[i, 'result'] = True
break
As you might imagine, this operation is insanely slow on any dataset of a decent size. I am also just beginning to learn pandas vector operations and would like some help in figuring out a more optimal way to solve this problem
I think what you need is merge() with numpy.where() to achieve the same result.
Since you don't have a reproducible sample in your question, kindly consider this:
>>> df1 = pd.DataFrame({'time':[24,20,15,10,5], 'city':['A','B','C','D','E']})
>>> df2 = pd.DataFrame({'time':[2,4,6,8,10,12,14], 'city':['A','B','C','F','G','H','D']})
>>> df1
time city
0 24 A
1 20 B
2 15 C
3 10 D
4 5 E
>>> df2
time city
0 2 A
1 4 B
2 6 C
3 8 F
4 10 G
5 12 H
6 14 D
From what I understand, you only need to get all the rows in your df2 that has a value in the city column in df1, where the difference in the dates are at least 9 hours (greater than 8 hours).
To do that, we need to merge on your city column:
>>> new_df = df2.merge(df1, how = 'inner', left_on = 'city', right_on = 'city')
>>> new_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
3 14 D 10
time_x basically is the time in your df2 dataframe, and time_y is from your df1.
Now we need to check the difference of those times and retain the one that will give a greater than 8 value in doing so, by using numpy.where() flagging them to do the filtering later:
>>> new_df['flag'] = np.where(new_df['time_y'] - new_df['time_x'] > 8, ['Retain'], ['Remove'])
>>> new_df
time_x city time_y flag
0 2 A 24 Retain
1 4 B 20 Retain
2 6 C 15 Retain
3 14 D 10 Remove
Now that you have that, you can simply filter your new_df by the flag column, removing the column in the final output as such:
>>> final_df = new_df[new_df['flag'].isin(['Retain'])][['time_x', 'city', 'time_y']]
>>> final_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
And there you go, no looping needed. Hope this helps :D

How to Stack Data Frames on top of one another (Pandas,Python3)

Lets say i Have 3 Pandas DF
DF1
Words Score
The Man 2
The Girl 4
Df2
Words2 Score2
The Boy 6
The Mother 7
Df3
Words3 Score3
The Son 3
The Daughter 4
Right now, I have them concatenated together so that it becomes 6 columns in one DF. That's all well and good but I was wondering, is there a pandas function to stack them vertically into TWO columns and change the headers?
So to make something like this?
Family Members Score
The Man 2
The Girl 4
The Boy 6
The Mother 7
The Son 3
The Daughter 4
everything I'm reading here http://pandas.pydata.org/pandas-docs/stable/merging.html seems to only have "horizontal" methods of joining DF!
As long as you rename the columns so that they're the same in each dataframe, pd.concat() should work fine:
# I read in your data as df1, df2 and df3 using:
# df1 = pd.read_clipboard(sep='\s\s+')
# Example dataframe:
Out[8]:
Words Score
0 The Man 2
1 The Girl 4
all_dfs = [df1, df2, df3]
# Give all df's common column names
for df in all_dfs:
df.columns = ['Family_Members', 'Score']
pd.concat(all_dfs).reset_index(drop=True)
Out[16]:
Family_Members Score
0 The Man 2
1 The Girl 4
2 The Boy 6
3 The Mother 7
4 The Son 3
5 The Daughter 4

Categories