Pandas DataFrame merge summing column - python

I'm trying to merge two DataFrames summing columns value.
>>> print(df1)
id name weight
0 1 A 0
1 2 B 10
2 3 C 10
>>> print(df2)
id name weight
0 2 B 15
1 3 C 10
I need to sum weight values during merging for similar values in the common column.
merge = pd.merge(df1, df2, how='inner')
So the output will be something like following.
id name weight
1 2 B 25
2 3 C 20

This solution works also if you want to sum more than one column. Assume data frames
>>> df1
id name weight height
0 1 A 0 5
1 2 B 10 10
2 3 C 10 15
>>> df2
id name weight height
0 2 B 25 20
1 3 C 20 30
You can concatenate them and group by index columns.
>>> pd.concat([df1, df2]).groupby(['id', 'name']).sum().reset_index()
id name weight height
0 1 A 0 5
1 2 B 35 30
2 3 C 30 45

In [41]: pd.merge(df1, df2, on=['id', 'name']).set_index(['id', 'name']).sum(axis=1)
Out[41]:
id name
2 B 25
3 C 20
dtype: int64

If you set the common columns as the index, you can just sum the two dataframes, much simpler than merging:
In [30]: df1 = df1.set_index(['id', 'name'])
In [31]: df2 = df2.set_index(['id', 'name'])
In [32]: df1 + df2
Out[32]:
weight
id name
1 A NaN
2 B 25
3 C 20

Related

Assign each unique value of column to whole Dataframe as if data frame duplicate itself based on value of another column

i am trying to iterate value of column from df2 and assign each value of column from df2 to the df1.As if df1 will multiply itself based on value of column from df2.
let's say i have df1 as per below:
df1
1
2
3
and df2 as per below:
df2
A
B
C
I want third dataframe df3 will became like below:
df3
1 A
2 A
3 A
1 B
2 B
3 B
1 C
2 C
3 C
for now i have tried below code
for i, value in ACS_shock['scenario'].iteritems():
df1['sec'] = df1[i] = value[:]
But when i generate the file from DF1 my output is like below:
1 A B C
2 A B C
3 A B C
Any idea how can i correct this code.
much appreciated.
You can use pd.concat and np.repeat:
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.Series([1,2,3])
>>> df1
0 1
1 2
2 3
dtype: int64
>>> df2 = pd.Series(list('ABC'))
>>> df2
0 A
1 B
2 C
dtype: object
>>> df3 = pd.DataFrame({'df1': pd.concat([df1]*3).reset_index(drop=True),
'df2': np.repeat(df2, 3).reset_index(drop=True)})
>>> df3
df1 df2
0 1 A
1 2 A
2 3 A
3 1 B
4 2 B
5 3 B
6 1 C
7 2 C
8 3 C

How to pivot a dataframe into a square dataframe with number of intersections in other column as values

How to pivot a dataframe into a square dataframe with number of intersections in value column as values where
my input dataframe is
field value
a 1
a 2
b 3
b 1
c 2
c 5
Output should be
a b c
a 2 1 1
b 1 2 0
c 1 0 2
The values in the output data frame should be the number of intersection of values in the value column.
Use cross join with crosstab:
df = df.merge(df, on='value')
df = pd.crosstab(df['field_x'], df['field_y'])
print (df)
field_y a b c
field_x
a 2 1 1
b 1 2 0
c 1 0 2
Then remove index and columns names by rename_axis:
#pandas 0.24+
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(index=None, columns=None)
print (df)
a b c
a 2 1 1
b 1 2 0
c 1 0 2
#pandas bellow
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(None).rename_axis(None, axis=1)

intersect dataframes python, keep one dataframe columns

I would like to join 2 dataframes, so that the result will be the intersection on the two datasets on the key column.
By doing this:
result = pd.merge(df1,df2,on='key', how='inner')
I will get what I need, but with extra columns of df2. I only want df1 columns in the results. (I do not want to delete them later).
Any ideas?
Thanks,
Here is a generic solution which will work for one and for multiple keys (joining) columns:
Setup:
In [28]: a = pd.DataFrame({'a':[1,2,3,4], 'b':[10,20,30,40], 'c':list('abcd')})
In [29]: b = pd.DataFrame({'a':[3,4,5,6], 'b':[30,41,51,61], 'c':list('efgh')})
In [30]: a
Out[30]:
a b c
0 1 10 a
1 2 20 b
2 3 30 c
3 4 40 d
In [31]: b
Out[31]:
a b c
0 3 30 e
1 4 41 f
2 5 51 g
3 6 61 h
multiple joining keys:
In [32]: join_cols = ['a','b']
In [33]: a.merge(b[join_cols], on=join_cols)
Out[33]:
a b c
0 3 30 c
single joining key:
In [34]: join_cols = ['a']
In [35]: a.merge(b[join_cols], on=join_cols)
Out[35]:
a b c
0 3 30 c
1 4 40 d

keep the same factorizing between two data

We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4

Pandas check if row exist in another dataframe and append index

I'm having one problem to iterate over my dataframe. The way I'm doing is taking a long time and I don't have that many rows (I have like 300k rows)
What am I trying to do?
Check if one DF (A) contains the value of two columns of the other DF (B). You can think of this as a multiple-key field
If True, get the index of DF.B and assign to one column of DF.A
If False, two steps:
a. append to DF.B the two columns not found
b. assign the new ID to DF.A (I couldn't do this one)
This is my code, where:
df is DF.A and df_id is DF.B:
SampleID and ParentID are the two columns I am interested to check if they exist in both dataframes
Real_ID is the column to which I want to assign the id of DF.B (df_id)
for index, row in df.iterrows():
#check if columns exist in the other dataframe
real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])]
if real_id.empty:
#row does not exist, append to df_id
df_id = df_id.append(row[['SampleID','ParentID']])
else:
#row exists, assign id of df_id to df
row['Real_ID'] = real_id.index
EXAMPLE:
DF.A (df)
Real_ID SampleID ParentID Something AnotherThing
0 20 21 a b
1 10 11 a b
2 40 51 a b
DF.B (df_id)
SampleID ParentID
0 10 11
1 20 21
Result:
Real_ID SampleID ParentID Something AnotherThing
0 1 10 11 a b
1 0 20 21 a b
2 2 40 51 a b
SampleID ParentID
0 20 21
1 10 11
2 40 51
Again, this solution is very slow. I'm sure there is a better way to do this and that's why I'm asking here. Unfortunately this was what I got after some hours...
Thanks
you can do it this way:
Data (pay attention at the index in the B DF):
In [276]: cols = ['SampleID', 'ParentID']
In [277]: A
Out[277]:
Real_ID SampleID ParentID Something AnotherThing
0 NaN 10 11 a b
1 NaN 20 21 a b
2 NaN 40 51 a b
In [278]: B
Out[278]:
SampleID ParentID
3 10 11
5 20 21
Solution:
In [279]: merged = pd.merge(A[cols], B, on=cols, how='outer', indicator=True)
In [280]: merged
Out[280]:
SampleID ParentID _merge
0 10 11 both
1 20 21 both
2 40 51 left_only
In [281]: B = pd.concat([B, merged.ix[merged._merge=='left_only', cols]])
In [282]: B
Out[282]:
SampleID ParentID
3 10 11
5 20 21
2 40 51
In [285]: A['Real_ID'] = pd.merge(A[cols], B.reset_index(), on=cols)['index']
In [286]: A
Out[286]:
Real_ID SampleID ParentID Something AnotherThing
0 3 10 11 a b
1 5 20 21 a b
2 2 40 51 a b

Categories