Python: compare dataframes based on two conditions - python

I have the following two dataframes:
df1:
date id
2000 1
2001 1
2002 2
df2:
date id
2000 1
2002 2
I now want to extract a list of observations that are in df1 but not in df2 based on date AND id.
The result should look like this:
date id
2001 1
I know how make a command to compare a column to a list with isin like this:
result = df1[~df1["id"].isin(df2["id"].tolist())]
However, this would only compare the two dataframes based on the column id. Because it could be that the id is in df1 and df2, but for different dates it is important that I only get values where both - id and date- are present in the two dataframes. Does somebody know how to do that?

Using merge
In [795]: (df1.merge(df2, how='left', indicator='_a')
.query('_a == "left_only"')
.drop('_a', 1))
Out[795]:
date id
1 2001 1
Details
In [796]: df1.merge(df2, how='left', indicator='_a')
Out[796]:
date id _a
0 2000 1 both
1 2001 1 left_only
2 2002 2 both
In [797]: df1.merge(df2, how='left', indicator='_a').query('_a == "left_only"')
Out[797]:
date id _a
1 2001 1 left_only

Related

Combining two pandas dataframes into one based on conditions

I got two dataframes, simplified they look like this:
Dataframe A
ID
item
1
apple
2
peach
Dataframe B
ID
flag
price ($)
1
A
3
1
B
2
2
B
4
2
A
2
ID: unique identifier for each item
flag: unique identifier for each vendor
price: varies for each vendor
In this simplified case I want to extract the price values of dataframe B and add them to dataframe A in separate columns depending on their flag value.
The result should look similar to this
Dataframe C
ID
item
price_A
price_B
1
apple
3
2
2
peach
2
4
I tried to split dataframe B into two dataframes the different flag values and merge them afterwards with dataframe A, but there must be an easier solution.
Thank you in advance! :)
*edit: removed the pictures
You can use pd.merge and pd.pivot_table for this:
df_C = pd.merge(df_A, df_B, on=['ID']).pivot_table(index=['ID', 'item'], columns='flag', values='price')
df_C.columns = ['price_' + alpha for alpha in df_C.columns]
df_C = df_C.reset_index()
Output:
>>> df_C
ID item price_A price_B
0 1 apple 3 2
1 2 peach 2 4
(dfb
.merge(dfa, on="ID")
.pivot_table(index=['ID', 'item'], columns='flag', values='price ($)')
.add_prefix("price_")
.reset_index()
)

Merge dataframes based on column values with duplicated rows

I want to merge two dataframes based on equal column values. The problem is that one of my columns have duplicated row values, which cannot be drop since it's correlated to another columns. Here's an example of my two dataframes:
Essentialy, I want to merge this two dataframes based on equal values of FromPatchID (df1) and Id (df2) columns, in order to get something like this:
FromPatchID ToPatchID ... Id MMM LB
1 1 ... 1 26.67 27.67
1 2 ... 1 26.67 27.67
1 3 ... 1 26.67 27.67
2 1 ... 2 26.50 27.50
3 1 ... 3 26.63 27.63
I already tried a simple merge with df_merged = pd.merge(df1, df2, on=['FromPatchID','Id']), but I got KeyError indicating to check for duplicates in FromPatchID column.
You have to specify the different column names to match on with left_on and right_on. Also specify how='right' to use only keys from the right frame.
df_merged = pd.merge(df1, df2, left_on='FromPatchID', right_on='Id', how='right')

Match multiple columns on Python to a single value

I hope you are doing well.
I am trying to perform a match based on multiple columns where my values of Column B of df1 is scattered in three to four columns in df2. The goal here is the the return the values of Column A of df2 if values of Column B matches any values in the columns C,D,E.
What I did until now was actually to do multiple left merges (and changing the name of Column B to match the name of columns C,D,E of df2).
I am trying to simplify the process but I am unsure how I am supposed to do this?
My dataset looks like that:
Df1:
ID
0 77
1 4859
2 LSP
DF2:
X id1 id2 id3
0 AAAAA_XX 889 77 BSP
1 BBBBB_XX 4859 CC 998P
2 CCCC_YY YUI TYU LSP
My goal is to have in df1:
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
Thank you very much !
you can get all the values in the columns to one first with pd.concat
then we merge the tables like this:
df3 = pd.concat([df2.id1, df2.id2]).reset_index()
df1 = df2.merge(df3, how="left", left_on = df1.ID, right_on = df3[0])
df1 = df1.iloc[:, :2]
df1 = df1.rename(columns={"key_0": "ID"})
not the most beautiful code in the world, but it works.
output:
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
Use DataFrame.merge with DataFrame.melt:
df = df1.merge(df2.melt(id_vars='X', value_name='ID').drop('variable', axis=1),
how='left',
on='ID')
print (df)
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
If possible duplicated ID is possible use:
df = (df1.merge(df2.melt(id_vars='X', value_name='ID')
.drop('variable', axis=1)
.drop_duplicates('ID'),
how='left',
on='ID'))

Merging Dataframe with Different Dates?

I want to merge a seperate dataframe (df2) with the main dataframe (df1), but if, for a given row, the dates in df1 do not exist in df2, then search for the recent date before the underlying date in df1.
I tried to use pd.merge, but it would remove rows with unmatched dates, and only keep the rows that matched in both df's.
df1 = [['2007-01-01','A'],
['2007-01-02','B'],
['2007-01-03','C'],
['2007-01-04','B'],
['2007-01-06','C']]
df2 = [['2007-01-01','B',3],
['2007-01-02','A',4],
['2007-01-03','B',5],
['2007-01-06','C',3]]
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
df1[0] = pd.to_datetime(df1[0])
df2[0] = pd.to_datetime(df2[0])
Current df1 | pd.merge():
0 1 2
0 2007-01-06 C 3
Only gets the exact date between both df's, it does not consider value from recent dates.
Expected df1:
0 1 2
0 2007-01-01 A NaN
1 2007-01-02 B 3
2 2007-01-03 C NaN
3 2007-01-04 B 3
4 2007-01-06 C 3
Getting NaNs because data doesn't exist on or before that date in df2. For index row 1, it gets data before a day before, while index row 4, it gets data exactly on the same day.
Check you output by using merge_asof
pd.merge_asof(df1,df2,on=0,by=1,allow_exact_matches=True)
Out[15]:
0 1 2
0 2007-01-01 A NaN
1 2007-01-02 B 3.0
2 2007-01-03 C NaN
3 2007-01-04 B 5.0 # here should be 5 since 5 ' date is more close. also df2 have two B
4 2007-01-06 C 3.0
Using your merge code, which I assume you have since its not present in your question, insert the argument how=left or how=outer.
It should look like this:
dfmerged = pd.merge(df1, df2, how='left', left_on=['Date'], right_on=['Date'])
You can then use slicing and renaming to keep the columns you wish.
dfmerged = dfmerged[['Date', 'Letters', 'Numbers']]
Note: I do not know your column names since you haven't shown any code. Substitute as necessary

pandas how to outer join without creating new columns

I have 2 pandas dataframes like this:
date value
20100101 100
20100102 150
date value
20100102 150.01
20100103 180
The expected output should be:
date value
20100101 100
20100102 150
20100103 180
The 2nd dataframe always contains newest value that I'd like to add into the 1st dataframe. However, the value on the same day may differ slightly between the two dataframes. I would like to ignore the same dates and focus on adding the new date and value into the 1st dataframe.
I've tried outer join in pandas, but it gives me two columns value_x and value_y because the value are not essentially the same on same dates. Any solution to this?
I believe need concat with drop_duplicates:
df = pd.concat([df1,df2]).drop_duplicates('date', keep='last')
print (df)
date value
0 20100101 100.00
0 20100102 150.01
1 20100103 180.00
df = pd.concat([df1,df2]).drop_duplicates('date', keep='first')
print (df)
date value
0 20100101 100.0
1 20100102 150.0
1 20100103 180.0

Categories