join two overlapping dataframes vertically - python

I am trying to update df1 with df2:
add new rows from df2 to df1
update existing rows (if row index exist)
df1 = pd.DataFrame([[1,3],[2,4]], index=[1,2], columns=['a','b'])
df2 = pd.DataFrame([[0,1],[3,2]], index=[3,2], columns=['a','b'])
The expected result should be
a b
1 1 3
2 2 3
3 1 0
but
df1.append(df2).drop_duplicates(keep='last') # drop_duplicates has no effect
gives a simple vertical stack
a b
1 1 3
2 2 4
3 1 0
2 2 3
df1.merge(df2, how='outer')
gives the same values and destroys the row index
a b
0 1 3
1 2 4
2 1 0
3 2 3
df1.join(df2)
df1.loc[df2.index] = df1.values
raise error

Try this:
new_df = df1.append(df2)
new_df = new_df[~new_df.index.duplicated(keep='last')]

Related

Looking up values in two pandas data frames and create new columns

I have two data frames in my problem.
df1
ID Value
1 A
2 B
3 C
df2:
ID F_ID S_ID
1 2 3
2 3 1
3 1 2
I want to create a column next to each ID column that will store the values looked up from df1. The output should look like this :
ID ID_Value F_ID F_ID_Value S_ID S_ID_Value
1 A 2 B 3 C
2 B 3 C 1 A
3 C 1 A 2 B
Basically looking up from df1 and creating a new column to store these values.
you can use map on each column of df2 with the value of df1.
s = df1.set_index('ID')['Value']
for col in df2.columns:
df2[f'{col}_value'] = df2[col].map(s)
print (df2)
ID F_ID S_ID ID_value F_ID_value S_ID_value
0 1 2 3 A B C
1 2 3 1 B C A
2 3 1 2 C A B
or with apply and concat
df_ = pd.concat([df2, df2.apply(lambda x: x.map(s)).add_prefix('_value')], axis=1)
df_ = df_.reindex(sorted(df_.columns), axis=1)
If order is important (I realised not in comments) is necessary use DataFrame.insert with enumerate and some maths:
s = df1.set_index('ID')['Value']
for i, col in enumerate(df2.columns, 1):
df2.insert(i * 2 - 1, f'{col}_value', df2[col].map(s))
print (df2)
ID ID_value F_ID F_ID_value S_ID S_ID_value
0 1 A 2 B 3 C
1 2 B 3 C 1 A
2 3 C 1 A 2 B

How to pivot a dataframe into a square dataframe with number of intersections in other column as values

How to pivot a dataframe into a square dataframe with number of intersections in value column as values where
my input dataframe is
field value
a 1
a 2
b 3
b 1
c 2
c 5
Output should be
a b c
a 2 1 1
b 1 2 0
c 1 0 2
The values in the output data frame should be the number of intersection of values in the value column.
Use cross join with crosstab:
df = df.merge(df, on='value')
df = pd.crosstab(df['field_x'], df['field_y'])
print (df)
field_y a b c
field_x
a 2 1 1
b 1 2 0
c 1 0 2
Then remove index and columns names by rename_axis:
#pandas 0.24+
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(index=None, columns=None)
print (df)
a b c
a 2 1 1
b 1 2 0
c 1 0 2
#pandas bellow
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(None).rename_axis(None, axis=1)

Set a string as index of pandas DataFrame

Given the dataframe df
df = pd.DataFrame([1,2,3,4])
print(df)
0
0 1
1 2
2 3
3 4
I would like to modify it as
print(df)
0
A 1
A 2
A 3
A 4
In this specific case you can use:
df.index = ['A'] * len(df)
Use set_index
In [797]: df.set_index([['A']*len(df)], inplace=True)
In [798]: df
Out[798]:
0
A 1
A 2
A 3
A 4
When you create the df, you can add it.
df = pd.DataFrame([1,2,3,4],index=['A']*4)
df
Out[325]:
0
A 1
A 2
A 3
A 4

keep the same factorizing between two data

We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4

concat two dataframe using python

We have one dataframe like
-0.140447131 0.124802527 0.140780106
0.062166349 -0.121484447 -0.140675515
-0.002989106 0.13984927 0.004382326
and the other as
1
1
2
We need to concat both the dataframe like
-0.140447131 0.124802527 0.140780106 1
0.062166349 -0.121484447 -0.140675515 1
-0.002989106 0.13984927 0.004382326 2
Let's say your first dataframe is like
In [281]: df1
Out[281]:
a b c
0 -0.140447 0.124803 0.140780
1 0.062166 -0.121484 -0.140676
2 -0.002989 0.139849 0.004382
And, the second like,
In [283]: df2
Out[283]:
d
0 1
1 1
2 2
Then you could create new column for df1 using df2
In [284]: df1['d_new'] = df2['d']
In [285]: df1
Out[285]:
a b c d_new
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 1
2 -0.002989 0.139849 0.004382 2
The assumption however being both dataframes have common index
Use pd.concat and specify the axis equal to 1 (rows):
df_new = pd.concat([df1, df2], axis=1)
>>> df_new
0 1 2 0
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 2
2 -0.002989 0.139849 0.004382 3

Categories