how append dataframe to dataframe with duplicate id [duplicate] - python

This question already has answers here:
Concatenate a list of pandas dataframes together
(6 answers)
Closed 2 years ago.
i have a dataframe df1
id Name City type
1 Anna Paris AB
2 Marc Rome D
3 erika madrid AC
and a dataframe df2
id Name City type
1 Anna Paris B
and a dataframe df3
id Name City type
1 Anna Paris C
i want to append df2 and df3 to df1 , this is my expected output :
id Name City type
1 Anna Paris AB
2 Marc Rome D
3 erika madrid AC
1 Anna Paris B
1 Anna Paris C
df1 = df1.append(df2)
df1 = df1.append(df3)
but the dataframe add only the last row and delete the other rows with the same id
id Name City type
2 Marc Rome D
3 erika madrid AC
1 Anna Paris C
i mtrying also concat
df1= pd.concat([df1,df2,df3], join='inner')

I think the problem with pd.concat() is that you are passing the parameter join = inner. I expect this to work:
output = pd.concat([df1,df2,df3])
Using this an example code:
df1 = pd.DataFrame({'Name':['Anna','Marc','erika'],
'City':['Paris','Rome','madrid'],
'Type':['AB','D','AC']})
df2 = pd.DataFrame({'Name':['Anna'],
'City':['Paris'],
'Type':['B']})
df3 = pd.DataFrame({'Name':['Anna'],
'City':['Paris'],
'Type':['C']})
pd.concat([df1,df2,df3])
It outputs:
Name City Type
0 Anna Paris AB
1 Marc Rome D
2 erika madrid AC
0 Anna Paris B
0 Anna Paris C

Related

How to compare pandas dataframe using given keys

I have two datasets. I want to compare using id, name and need to write in different data frame with mismatched values as "Mismatched" and mismatched rows as it is.
df1
Index id name dept addr
0 1 Jeff IT Delhi
1 2 Tom Support Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT Pune
4 5 Lee Dev Delhi
df2
Index id name dept addr
0 1 Jeff IT Delhi
1 2 Tom QA Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT Hyderabad
And I need result like,
Result
Index id name dept addr
0 2 Tom Mismatched Bangalore
1 4 Kaif IT Mismatched
2 5 Lee Dev Delhi
One way to do what you intend to do (if 'id' and 'name' already match as in the case you show) is to do an inner merge according to the 'name' column and then change the 'Dept' value to 'mismatch' if the 'dept_x' and 'dept_y' value of the merged dataframe don't match.
A = pd.merge(df1,df2, on='name', how='inner')
# It creates new columns
print(A)
id_x name dept_x addr_x id_y dept_y addr_y
0 1 Jeff IT Delhi 1 IT Delhi
1 2 Tom Support Bangalore 2 QA Bangalore
2 3 Peter Admin Pune 3 Admin Pune
3 4 Kaif IT Pune 4 IT Hyderabad
B = A.copy()
B['dept_x'] = A.apply(lambda x : 'mismatch' if x.dept_x!=x.dept_y else x.dept_x, axis=1)
print(B)
id_x name dept_x addr_x id_y dept_y addr_y
0 1 Jeff IT Delhi 1 IT Delhi
1 2 Tom mismatch Bangalore 2 QA Bangalore
2 3 Peter Admin Pune 3 Admin Pune
3 4 Kaif IT Pune 4 IT Hyderabad
Then you can do the same for the address column, filter the rows with mismatch if you intend to only keep them, and rename or delete the columns that you need/don't need accordingly.
If you have many columns, you can use a function inside the .apply() to make it more general :
# the columns that you intend to check the mismatch for
cols = ['dept','addr']
# or if you want to do it on all columns except the first two because there's too many
cols = [a for a in df1.columns if a not in ['name','id']]
# define a function that compares for all columns
def is_mismatch(x) :
L = ['mismatch' if x[cols[i]+'_x']!=x[cols[i]+'_y'] else x[cols[i]+'_x'] for i in range(len(cols))]
return pd.Series(L)
C = A.copy()
C[cols] = C.apply(is_mismatch, axis=1) # don't forget that axis=1 here !
print(C)
id_x name dept_x addr_x id_y dept_y addr_y dept \
0 1 Jeff IT Delhi 1 IT Delhi IT
1 2 Tom Support Bangalore 2 QA Bangalore mismatch
2 3 Peter Admin Pune 3 Admin Pune Admin
3 4 Kaif IT Pune 4 IT Hyderabad IT
addr
0 Delhi
1 Bangalore
2 Pune
3 mismatch
# if you want to clean the columns
C = C[['id_x','name']+cols]
print(C)
id_x name dept addr
0 1 Jeff IT Delhi
1 2 Tom mismatch Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT mismatch

Merged column doesn't display properly

This is my first data frame df1
ID
Name
1
Sam
2
Mam
3
Dam
This is my second data frame df2
ID
Parent ID
1
4
2
5
3
6
This is my third dataframe df3
Parent ID
Location
7
New York
8
London
9
San Diego
4
Mumbai
5
Pataya
6
Canberra
df4=pd.merge(df1,df2,on=['ID'], how='left')
ID
Game
Parent ID
1
Sam
4
2
Mam
5
3
Dam
6
I want to merge now the df5 which is giving me the invalid result
df5=pd.merge(df4,df3,on=['Parent ID'], how='left')
ID
Game
Parent ID
Location
1
Sam
4
New york
2
Mam
5
London
3
Dam
6
San Diego
I am not sure why it is selecting the first values instead of the common ones like the first merge

Compare 2 dataframes Pandas, returns wrong values

There are 2 dfs
datatypes are the same
df1 =
ID city name value
1 LA John 111
2 NY Sam 222
3 SF Foo 333
4 Berlin Bar 444
df2 =
ID city name value
1 NY Sam 223
2 LA John 111
3 SF Foo 335
4 London Foo1 999
5 Berlin Bar 444
I need to compare them and produce a new df, only with values, which are in df2, but not in df1
By some reason results after applying different methods are wrong
So far I've tried
pd.concat([df1, df2], join='inner', ignore_index=True)
but it returns all values together
pd.merge(df1, df2, how='inner')
it returns df1
then this one
df1[~(df1.iloc[:, 0].isin(list(df2.iloc[:, 0])))
it returns df1
The desired output is
ID city name value
1 NY Sam 223
2 SF Foo 335
3 London Foo1 999
Use DataFrame.merge by all columns without first and indicator parameter:
c = df1.columns[1:].tolist()
Or:
c = ['city', 'name', 'value']
df = (df2.merge(df1,on=c, indicator = True, how='left', suffixes=('','_'))
.query("_merge == 'left_only'")[df1.columns])
print (df)
ID city name value
0 1 NY Sam 223
2 3 SF Foo 335
3 4 London Foo1 999
Try this:
print("------------------------------")
print(df1)
df2 = DataFrameFromString(s, columns)
print("------------------------------")
print(df2)
common = df1.merge(df2,on=["city","name"]).rename(columns = {"value_y":"value", "ID_y":"ID"}).drop("value_x", 1).drop("ID_x", 1)
print("------------------------------")
print(common)
OUTPUT:
------------------------------
ID city name value
0 ID city name value
1 1 LA John 111
2 2 NY Sam 222
3 3 SF Foo 333
4 4 Berlin Bar 444
------------------------------
ID city name value
0 1 NY Sam 223
1 2 LA John 111
2 3 SF Foo 335
3 4 London Foo1 999
4 5 Berlin Bar 444
------------------------------
city name ID value
0 LA John 2 111
1 NY Sam 1 223
2 SF Foo 3 335
3 Berlin Bar 5 444

Python: how to drop rows in Pandas if two columns don't appear in another pandas column?

I have two dataframes df and df1. df contains name and attributes of people.
df Name Age
0 Jack 33
1 Anna 25
2 Emilie 49
3 Frank 19
4 John 42
while df1 contains the info of the number of contacts between two people. In df1 we can have some people that don't appear in df.
df1 Name1 Name2 c
0 Frank Paul 2
1 Julia Anna 5
2 Frank John 1
3 Emilie Jack 3
4 Tom Steven 2
5 Tom Jack 5
I want to drop all the rows from df1 in Name1 or Name2 don't appear in df.
df1 Name1 Name2 c
0 Frank John 1
1 Emilie Jack 3
Use isin -
df1[df1[['Name1', 'Name2']].isin(df.Name).all(1)]
# Name1 Name2 c
#2 Frank John 1
#3 Emilie Jack 3
Or:
df1[df1.Name1.isin(df.Name) & df1.Name2.isin(df.Name)]
# Name1 Name2 c
#2 Frank John 1
#3 Emilie Jack 3
Can also use np.isin
df1[np.isin(df1.Name1, df.Name) &
np.isin(df1.Name2, df.Name)]

Python: how to get sum of values based on different columns

I have a datframe df like the following:
df name city
0 John New York
1 Carl New York
2 Carl Paris
3 Eva Paris
4 Eva Paris
5 Carl Paris
I want to know the total number of people in the different cities
df2 city number
0 New York 2
1 Paris 3
or the number of people with the same name in the cities
df2 name city number
0 John New York 1
1 Eva Paris 2
2 Carl Paris 2
3 Eva New York 0
I believe need GroupBy.size:
df1 = df.groupby(['city']).size().reset_index(name='number')
print (df1)
city number
0 New York 2
1 Paris 4
df2 = df.groupby(['name','city']).size().reset_index(name='number')
print (df2)
name city number
0 Carl New York 1
1 Carl Paris 2
2 Eva Paris 2
3 John New York 1
If need all combinations one solution is add unstack and stack:
df3=df.groupby(['name','city']).size().unstack(fill_value=0).stack().reset_index(name='count')
print (df3)
name city number
0 Carl New York 1
1 Carl Paris 2
2 Eva New York 0
3 Eva Paris 2
4 John New York 1
5 John Paris 0
Or reindex with MultiIndex.from_product:
df2 = df.groupby(['name','city']).size()
mux = pd.MultiIndex.from_product(df2.index.levels, names=df2.index.names)
df2 = df2.reindex(mux, fill_value=0).reset_index(name='number')
print (df2)
name city number
0 Carl New York 1
1 Carl Paris 2
2 Eva New York 0
3 Eva Paris 2
4 John New York 1
5 John Paris 0
To count the number of people with different names in the same city:
groups = df.groupby('city').count().reset_index()
To count the number of people with the same name in different cities:
groups = df.groupby('city').count().reset_index()

Categories