join two pandas dataframe using a specific column - python

I am new with pandas and I am trying to join two dataframes based on the equality of one specific column. For example suppose that I have the followings:
df1
A B C
1 2 3
2 2 2
df2
A B C
5 6 7
2 8 9
Both dataframes have the same columns and the value of only one column (say A) might be equal. What I want as output is this:
df3
A B C B C
2 8 9 2 2
The values for column 'A' are unique in both dataframes.
Thanks

pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner')
If you wish to maintain column A as a non-index, then:
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner').reset_index()

Alternatively, you could just do:
df3 = df1.merge(df2, on='A', how='inner', suffixes=('_1', '_2'))
And then you can keep track of each value's origin

Related

How to reduce conditionality of a categorical feature using a lookup table

I a dataframe (df1) whose one categorical column is
df1=pd.Dataframe({'COL1': ['AA','AB','BC','AC','BA','BB','BB','CA','CB','CD','CE']})
I have another dataframe (df2) which has two columns
df2=pd.Dataframe({'Category':['AA','AB','AC','BA','BB','BC','CA','CB','CC','CD','CE','CF'],'general_mapping':['A','A','A','B','B','B','C','C','C','C','C','C']})
I need to modify df1 using df2 and finally will look like:
df1->> ({'COL1': ['A','A','B','A','B','B','B','C','C','C','C']})
You can use pd.Series.map after setting Category as index using df.set_index.
df1['COL1'] = df1['COL1'].map(df2.set_index('Category')['general_mapping'])
df1
COL1
0 A
1 A
2 B
3 A
4 B
5 B
6 B
7 C
8 C
9 C
10 C

Joining two dataframes that have overlapping without specifying the overlapping columns but overwriting one of them (in python)

I'm struggling with the following problem:
I have two dataframes:
df1
A B C
1 5 8
2 1 2
3 2 1
4 3 6
and df2 with same column names, but not as much columns as df1:
A B
1 1
8 2
1 5
6 3
df1 and df2 always have the same amount of rows, only the amount of columns of df2 is less or equal than df1. Also, the column names are the same, but not the values in the column (they can be the same, but this is definitely not always the case)
Now, I want to create a new dataframe, where the overlapping columns between df1 and df2 (column A and B, NOT C) are determined by df2, but has the same shape as df1 (so df1 is dominating in amount of columns, but df2 is dominating in which value to take from the overlapping columns). Important to know is that I don't want to specify which columns are overlapping.
So the result should give:
df3:
A B C
1 1 8
8 2 2
1 5 1
6 3 6
Is this possible, especially with the difficulty of not specifying upfront the overlapping columns. anyone a clever solution? Because it seems not to be possible with all the possible variations of merge and join.
As long as there are no column labels in df2 not present in df1, you can use
df3 = df1.copy()
df3.loc[:,df2.columns] = df2

Filter out non-existent numbers on columns A or B in a pandas dataframe?

I have two dataframes. df1 has the form
deva devb c
1 3 5
and df2 has the form
dev
1
3
Now, I would like to join both dataframes so that I get only the numbers in deva or devb that appear in dev. In other words, I'd like to filter out the numbers that are not in df2. I've tried the following, but to no avail:
df1 = df2.merge(df1, left_on=["dev", "dev"], right_on=["deva","devb"])
How do you join/merge with an "OR" of two different columns?
Use DataFrame.where + Series.isin:
df1[['deva','devb']]=df1[['deva','devb']].where(df1[['deva','devb']].isin(df2['dev'].tolist()))
# deva devb c
#0 1 3 5
Check with isin + any
df1[df1[['deva','devb']].isin(df2.dev.tolist()).any(1)]
Out[76]:
deva devb c
0 1 3 5

Pandas groubpy and then join multiple columns

Given the following dataframe:
A B C
1 2 3
1 9 8
df = df.groupby(['A'])['B'].apply(','.join).reset_index()
this produces
A B
1 2,9
However I also want to join the 'C' column values together with a comma the same way as b.
Expected:
A B C
1 2,9 3,8
I tried:
df = df.groupby(['A'])['B','C'].apply(','.join).reset_index()
Use GroupBy.agg:
df = df.groupby(['A'])['B','C'].agg(','.join).reset_index()

Will passing ignore_index=True to pd.concat preserve index succession within dataframes that I'm concatenating?

I have two dataframes:
df1 =
value
0 a
1 b
2 c
df2 =
value
0 d
1 e
I need to concatenate them across index, but I have to preserve the index of the first dataframe and continue it in the second dataframe, like this:
result =
value
0 a
1 b
2 c
3 d
4 e
My guess is that pd.concat([df1, df2], ignore_index=True) will do the job. However, I'm worried that for large dataframes the order of the rows may be changed and I'll end up with something like this (first two rows changed indices):
result =
value
0 b
1 a
2 c
3 d
4 e
So my question is, does the pd.concat with ignore_index=True save the index succession within dataframes that are being concatenated, or there is randomness in the index assignment?
In my experience, pd.concat concats the rows in the order the DataFrames are passed to it during concatenation.
If you want to be safe, specify sort=False which will also avoid sorting on columns:
pd.concat([df1, df2], axis=0, sort=False, ignore_index=True)
value
0 a
1 b
2 c
3 d
4 e

Categories