Difference two dataframes based on two columns Python Panda

Difference two dataframes based on two columns Python Panda - python

Df1
A B C
1 1 'a'
2 3 'b'
3 4 'c'
Df2
A B C
1 1 'k'
5 4 'e'
Expected output (after difference and merge of Df1 and Df2)
i.e. Df1-Df2 and then merge
output
A B C
1 1 'a'
2 3 'b'
3 4 'c'
5 4 'e'
The difference should be based on two columns A and B and not all three columns. I do not care what column C contains in both Df2 and Df1.

try this:
In [44]: df1.set_index(['A','B']).combine_first(df2.set_index(['A','B'])).reset_index()
Out[44]:
A B C
0 1 1 'a'
1 2 3 'b'
2 3 4 'c'
3 5 4 'e'

It's an outer join, then merging in column C from df2 if a value is not known in df1:
dfx = df1.merge(df2, how='outer', on=['A', 'B'])
dfx['C'] = dfx.apply(
lambda r: r.C_x if not pd.isnull(r.C_x) else r.C_y, axis=1)
dfx[['A', 'B', 'C']]
=>
A B C
0 1 1 a
1 2 3 b
2 3 4 c
3 5 4 e

Using concat and drop_duplicates:
output = pd.concat([df1, df2])
output = output.drop_duplicates(subset = ["A", "B"], keep = 'first')
* Desired df: *
A B C
0 1 1 a
1 2 3 b
2 3 4 c
1 5 4 e

Related

Join tables and create combinations in python

In advance: Sorry, the title is a bit fuzzy
PYTHON
I have two tables. In one there are unique names for example 'A', 'B', 'C' and in the other table there is a Time series with months example 10/2021, 11/2021, 12/2021. I want to join the tables now that I have all TimeStemps for each name. So the final data should look like this:
Month
Name
10/2021
A
11/2021
A
12/2021
A
10/2021
B
11/2021
B
12/2021
B
10/2021
C
11/2021
C
12/2021
C

from cartesian product in pandas
df1 = pd.DataFrame([1, 2, 3], columns=['A'])
df2 = pd.DataFrame(["a", "b", "c"], columns=['B'])
df = (df1.assign(key=1)
.merge(df2.assign(key=1), on="key")
.drop("key", axis=1)
)
A B
0 1 a
1 1 b
2 1 c
3 2 a
4 2 b
5 2 c
6 3 a
7 3 b
8 3 c

If you are only trying to get the cartesian product of the values - you can do it using itertools.product
import pandas as pd
from itertools import product
df1 = pd.DataFrame(list('abcd'), columns=['letters'])
df2 = pd.DataFrame(list('1234'), columns=['numbers'])
df_combined = pd.DataFrame(product(df1['letters'], df2['numbers']), columns=['letters', 'numbers'])
output
letters numbers
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4
12 d 1
13 d 2
14 d 3
15 d 4

drop rows using pandas groupby and filter

I'm trying to drop rows from a df where certain conditions are met. Using below, I'm grouping values using column C. For each unique group, I want to drop ALL rows where A is less than 1 AND B is greater than 100. This has to occur on the same row though. If I use .any() or .all(), it doesn't return what I want.
df = pd.DataFrame({
'A' : [1,0,1,0,1,0,0,1,0,1],
'B' : [101, 2, 3, 1, 5, 101, 2, 3, 4, 5],
'C' : ['d', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f',],
})
df.groupby(['C']).filter(lambda g: g['A'].lt(1) & g['B'].gt(100))
initial df:
A B C
0 1 101 d # A is not lt 1 so keep all d's
1 0 2 d
2 1 3 d
3 0 1 d
4 1 5 e
5 0 101 e # A is lt 1 and B is gt 100 so drop all e's
6 0 2 e
7 1 3 f
8 0 4 f
9 1 5 f
intended out:
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f

For better performnce get all C values match condition and then filter original column C by Series.isin in boolean indexing with inverted mask:
df1 = df[~df['C'].isin(df.loc[df['A'].lt(1) & df['B'].gt(100), 'C'])]
Another idea is use GroupBy.transform with GroupBy.any for test if match at least one value:
df1 = df[~(df['A'].lt(1) & df['B'].gt(100)).groupby(df['C']).transform('any')]
Your solution is possible with any and not for scalars, if large DataFrame it should be slow:
df1 = df.groupby(['C']).filter(lambda g:not ( g['A'].lt(1) & g['B'].gt(100)).any())
df1 = df.groupby(['C']).filter(lambda g: (g['A'].ge(1) | g['B'].le(100)).all())
print (df1)
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f

Add all columns form one dataframe to another without joining on a key/index

Having two dataframes df1 and df2 (same number of rows) how can we, very simply, take all the columns from df2 and add them to df1? Using join, we are joining them on the index or a given column, but assuming their index's are completely different and they have no columns in common. Is that doable (without the obvious way of looping over each column in df2and add them as new to df1)?
EDIT: added an example.
Note; no index, column names are mentioned since it should not matter (thats is the "problem").
df1= [[1,3,2,
[11,20,33]]
df2 = [["bird",np.nan,37,np.sqrt(2)]
["dog",0.123,3.14,0]]
pd.some_operation(df1,df2)
#[[1,3,2,"bird",np.nan,37,np.sqrt(2)]
#[11,20,33,"dog",0.123,3.14,0]]

Samples:
df1 = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
}, index = list('QRSTUW'))
df2 = pd.DataFrame({
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
}, index = list('KLMNOP'))
Pandas always use index values if use join or concat by axis=1, so for correct alignement is necessary create same index values:
df = df1.join(df2.set_index(df1.index))
df = pd.concat([df1, df2.set_index(df1.index)], axis=1)
print (df)
A B C D E F
Q a 4 7 1 5 a
R b 5 8 3 3 a
S c 4 9 5 6 a
T d 5 4 7 9 b
U e 5 2 1 2 b
W f 4 3 0 4 b
Or create default index in both DataFrames:
df = df1.reset_index(drop=True).join(df2.reset_index(drop=True))
df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b

Assign each unique value of column to whole Dataframe as if data frame duplicate itself based on value of another column

i am trying to iterate value of column from df2 and assign each value of column from df2 to the df1.As if df1 will multiply itself based on value of column from df2.
let's say i have df1 as per below:
df1
1
2
3
and df2 as per below:
df2
A
B
C
I want third dataframe df3 will became like below:
df3
1 A
2 A
3 A
1 B
2 B
3 B
1 C
2 C
3 C
for now i have tried below code
for i, value in ACS_shock['scenario'].iteritems():
df1['sec'] = df1[i] = value[:]
But when i generate the file from DF1 my output is like below:
1 A B C
2 A B C
3 A B C
Any idea how can i correct this code.
much appreciated.

You can use pd.concat and np.repeat:
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.Series([1,2,3])
>>> df1
0 1
1 2
2 3
dtype: int64
>>> df2 = pd.Series(list('ABC'))
>>> df2
0 A
1 B
2 C
dtype: object
>>> df3 = pd.DataFrame({'df1': pd.concat([df1]*3).reset_index(drop=True),
'df2': np.repeat(df2, 3).reset_index(drop=True)})
>>> df3
df1 df2
0 1 A
1 2 A
2 3 A
3 1 B
4 2 B
5 3 B
6 1 C
7 2 C
8 3 C

Adding a column in dataframes based on similar columns in them

I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.

You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5

df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Difference two dataframes based on two columns Python Panda - python

try this: In [44]: df1.set_index(['A','B']).combine_first(df2.set_index(['A','B'])).reset_index() Out[44]: A B C 0 1 1 'a' 1 2 3 'b' 2 3 4 'c' 3 5 4 'e'

It's an outer join, then merging in column C from df2 if a value is not known in df1: dfx = df1.merge(df2, how='outer', on=['A', 'B']) dfx['C'] = dfx.apply( lambda r: r.C_x if not pd.isnull(r.C_x) else r.C_y, axis=1) dfx[['A', 'B', 'C']] => A B C 0 1 1 a 1 2 3 b 2 3 4 c 3 5 4 e

Using concat and drop_duplicates: output = pd.concat([df1, df2]) output = output.drop_duplicates(subset = ["A", "B"], keep = 'first') * Desired df: * A B C 0 1 1 a 1 2 3 b 2 3 4 c 1 5 4 e

Related

Join tables and create combinations in python

drop rows using pandas groupby and filter

Add all columns form one dataframe to another without joining on a key/index

Assign each unique value of column to whole Dataframe as if data frame duplicate itself based on value of another column

Adding a column in dataframes based on similar columns in them

Categories

Resources