Python Pandas Compare 2 dataFrame [duplicate] - python

This question already has an answer here:
Find unique column values out of two different Dataframes
(1 answer)
Closed 1 year ago.
i'm working on python with Pandas and i have 2 dataFrame
1 'A'
2 'B'
1 'A'
2 'B'
3 'C'
4 'D'
and i want to return the difference:
1 'C'
2 'D'

You can concatenate two dataframes and drop duplicates:
pd.concat([df1, df2]).drop_duplicates(keep=False)
If your dataframe contains more columns you can add a certain column name as a subset:
pd.concat([df1, df2]).drop_duplicates(subset='col_name', keep=False)

What i retrieve with pd.concat([df1, df2]).drop_duplicates(keep=False)
(N = name of column)
df1:
N
0 A
1 B
2 C
df2:
N
0 A
1 B
2 C
df3
N
0 A
1 B
2 C
0 A
1 B
2 C
Value in df is phone Number without '+' in it. i can't show them.
i import them with :
df1 = pd.DataFrame(ListResponse, columns=['33000000000'])
df2 = pd.read_csv('number.csv')
ListResponse return List with number and number.csv is ListResponse that i save in csv the last time i run the script
edit:
(what i want in this case is "Empty DataFrame")
just test with new value :
df3:
N
0 A
1 B
2 C
3 D
0 B
1 C
2 D
Edit2: i think drop_duplicate is not working because my func implement new value as index = 0 and not index = length+1 like you can see just above. but when same values in both df, it not return me empty df...

Related

How do I delete columns where the average of the column already exists

In the example below, Column C should be deleted because it already exists (Column A should remain)
type(df): pandas.core.frame.DataFrame
A B C
1 2 1
0 2 0
3 2 3
I tried creating a dictionary to later delete repeated values but got stuck
dict_test = {}
for each_column in df:
dict_test[each_column] = df[[each_column]].mean()
dict_test
The result came out to be dtype: float64, 'A' : A 1.33333
The problem above is that the dictionary is storing the 'Key and Value' in the Value section so I can't compare Values to one another
You can use df.mean().drop_duplicates() and pandas indexing:
In [30]: df[df.mean().drop_duplicates().index]
Out[30]:
A B
0 1 2
1 0 2
2 3 2

Concatenate dataframes alternating rows with Pandas

I have two dataframes df1 and df2 that are defined like so:
df1 df2
Out[69]: Out[70]:
A B A B
0 2 a 0 5 q
1 1 s 1 6 w
2 3 d 2 3 e
3 4 f 3 1 r
My goal is to concatenate the dataframes by alternating the rows so that the resulting dataframe is like this:
dff
Out[71]:
A B
0 2 a <--- belongs to df1
0 5 q <--- belongs to df2
1 1 s <--- belongs to df1
1 6 w <--- belongs to df2
2 3 d <--- belongs to df1
2 3 e <--- belongs to df2
3 4 f <--- belongs to df1
3 1 r <--- belongs to df2
As you can see the first row of dff corresponds to the first row of df1 and the second row of dff is the first row of df2. The pattern repeats until the end.
I tried to reach my goal by using the following lines of code:
import pandas as pd
df1 = pd.DataFrame({'A':[2,1,3,4], 'B':['a','s','d','f']})
df2 = pd.DataFrame({'A':[5,6,3,1], 'B':['q','w','e','r']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
However this approach doesn't work because df1.iloc[i] and df2.iloc[i] are automatically reshaped into columns instead of rows and I cannot revert the process (even by using .T).
Question: Can you please suggest me a nice and elegant way to reach my goal?
Optional: Can you also provide an explanation about how to convert a column back to row?
I'm unable to comment on the accepted answer, but note that the sort operation in unstable by default, so you must choose a stable sorting algorithm.
pd.concat([df1, df2]).sort_index(kind='merge')
IIUC
In [64]: pd.concat([df1, df2]).sort_index()
Out[64]:
A B
0 2 a
0 5 q
1 1 s
1 6 w
2 3 d
2 3 e
3 4 f
3 1 r

Pandas (Python) - Update column of a dataframe from another one with conditions and different columns

I had a problem and I found a solution but I feel it's the wrong way to do it. Maybe, there is a more 'canonical' way to do it.
I already had an answer for a really similar problem, but here I have not the same amount of rows in each dataframe. Sorry for the "double-post", but the first one is still valid so I think it's better to make a new one.
Problem
I have two dataframe that I would like to merge without having extra column and without erasing existing infos. Example :
Existing dataframe (df)
A A2 B
0 1 4 0
1 2 5 1
2 2 5 1
Dataframe to merge (df2)
A A2 B
0 1 4 2
1 3 5 2
I would like to update df with df2 if columns 'A' and 'A2' corresponds.
The result would be :
A A2 B
0 1 4 2 <= Update value ONLY
1 2 5 1
2 2 5 1
Here is my solution, but I think it's not a really good one.
import pandas as pd
df = pd.DataFrame([[1,4,0],[2,5,1],[2,5,1]],columns=['A','A2','B'])
df2 = pd.DataFrame([[1,4,2],[3,5,2]],columns=['A','A2','B'])
df = df.merge(df2,on=['A', 'A2'],how='left')
df['B_y'].fillna(0, inplace=True)
df['B'] = df['B_x']+df['B_y']
df = df.drop(['B_x','B_y'], axis=1)
print(df)
I tried this solution :
rows = (df[['A','A2']] == df2[['A','A2']]).all(axis=1)
df.loc[rows,'B'] = df2.loc[rows,'B']
But I have this error because of the wrong number of rows :
ValueError: Can only compare identically-labeled DataFrame objects
Does anyone has a better way to do ?
Thanks !
I think you can use DataFrame.isin for check where are same rows in both DataFrames. Then create NaN by mask, which is filled by combine_first. Last cast to int:
mask = df[['A', 'A2']].isin(df2[['A', 'A2']]).all(1)
print (mask)
0 True
1 False
2 False
dtype: bool
df.B = df.B.mask(mask).combine_first(df2.B).astype(int)
print (df)
A A2 B
0 1 4 2
1 2 5 1
2 2 5 1
With a minor tweak in the way in which the boolean mask gets created, you can get it to work:
cols = ['A', 'A2']
# Slice it to match the shape of the other dataframe to compare elementwise
rows = (df[cols].values[:df2.shape[0]] == df2[cols].values).all(1)
df.loc[rows,'B'] = df2.loc[rows,'B']
df

Pandas python - matching values

I currently have two dataframes that have two matching columns. For example :
Data frame 1 with columns : A,B,C
Data frame 2 with column : A
I want to keep all lines in the first dataframe that have the values that the A contains. For example if df2 and df1 are:
df1
A B C
0 1 3
4 2 5
6 3 1
8 0 0
2 1 1
df2
Α
4
6
1
So in this case, I want to only keep the second and third line of df1.
I tried doing it like this, but it didnt work since both dataframes are pretty big:
for index, row in df1.iterrows():
counter = 0
for index2,row2 in df2.iterrows():
if row["A"] == row2["A"]:
counter = counter + 1
if counter == 0:
df2.drop(index, inplace=True)
Use isin to test for membership:
In [176]:
df1[df1['A'].isin(df2['A'])]
Out[176]:
A B C
1 4 2 5
2 6 3 1
Or use the merge method:
df1= pandas.DataFrame([[0,1,3],[4,2,5],[6,3,1],[8,0,0],[2,1,1]], columns = ['A', 'B', 'C'])
df2= pandas.DataFrame([4,6,1], columns = ['A'])
df2.merge(df1, on = 'A')

set multiple Pandas DataFrame columns to values in a single column or multiple scalar values at the same time

I'm trying to set multiple new columns to one column and, separately, multiple new columns to multiple scalar values. Can't do either. Any way to do it other than setting each one individually?
df=pd.DataFrame(columns=['A','B'],data=np.arange(6).reshape(3,2))
df.loc[:,['C','D']]=df['A']
df.loc[:,['C','D']]=[0,1]
for c in ['C', 'D']:
df[c] = d['A']
df['C'] = 0
df['D'] = 1
Maybe it is what you are looking for.
df=pd.DataFrame(columns=['A','B'],data=np.arange(6).reshape(3,2))
df['C'], df['D'] = df['A'], df['A']
df['E'], df['F'] = 0, 1
# Result
A B C D E F
0 0 1 0 0 0 1
1 2 3 2 2 0 1
2 4 5 4 4 0 1
The assign method will create multiple, new columns in one step. You can pass a dict() with the column and values to return a new DataFrame with the new columns appended to the end.
Using your examples:
df = df.assign(**{'C': df['A'], 'D': df['A']})
and
df = df.assign(**{'C': 0, 'D':1})
See this answer for additional detail: https://stackoverflow.com/a/46587717/4843561

Categories