keep the same factorizing between two data - python

We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.

You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4

You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32

You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4

Related

join two overlapping dataframes vertically

I am trying to update df1 with df2:
add new rows from df2 to df1
update existing rows (if row index exist)
df1 = pd.DataFrame([[1,3],[2,4]], index=[1,2], columns=['a','b'])
df2 = pd.DataFrame([[0,1],[3,2]], index=[3,2], columns=['a','b'])
The expected result should be
a b
1 1 3
2 2 3
3 1 0
but
df1.append(df2).drop_duplicates(keep='last') # drop_duplicates has no effect
gives a simple vertical stack
a b
1 1 3
2 2 4
3 1 0
2 2 3
df1.merge(df2, how='outer')
gives the same values and destroys the row index
a b
0 1 3
1 2 4
2 1 0
3 2 3
df1.join(df2)
df1.loc[df2.index] = df1.values
raise error
Try this:
new_df = df1.append(df2)
new_df = new_df[~new_df.index.duplicated(keep='last')]

Assign each unique value of column to whole Dataframe as if data frame duplicate itself based on value of another column

i am trying to iterate value of column from df2 and assign each value of column from df2 to the df1.As if df1 will multiply itself based on value of column from df2.
let's say i have df1 as per below:
df1
1
2
3
and df2 as per below:
df2
A
B
C
I want third dataframe df3 will became like below:
df3
1 A
2 A
3 A
1 B
2 B
3 B
1 C
2 C
3 C
for now i have tried below code
for i, value in ACS_shock['scenario'].iteritems():
df1['sec'] = df1[i] = value[:]
But when i generate the file from DF1 my output is like below:
1 A B C
2 A B C
3 A B C
Any idea how can i correct this code.
much appreciated.
You can use pd.concat and np.repeat:
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.Series([1,2,3])
>>> df1
0 1
1 2
2 3
dtype: int64
>>> df2 = pd.Series(list('ABC'))
>>> df2
0 A
1 B
2 C
dtype: object
>>> df3 = pd.DataFrame({'df1': pd.concat([df1]*3).reset_index(drop=True),
'df2': np.repeat(df2, 3).reset_index(drop=True)})
>>> df3
df1 df2
0 1 A
1 2 A
2 3 A
3 1 B
4 2 B
5 3 B
6 1 C
7 2 C
8 3 C

Looking up values in two pandas data frames and create new columns

I have two data frames in my problem.
df1
ID Value
1 A
2 B
3 C
df2:
ID F_ID S_ID
1 2 3
2 3 1
3 1 2
I want to create a column next to each ID column that will store the values looked up from df1. The output should look like this :
ID ID_Value F_ID F_ID_Value S_ID S_ID_Value
1 A 2 B 3 C
2 B 3 C 1 A
3 C 1 A 2 B
Basically looking up from df1 and creating a new column to store these values.
you can use map on each column of df2 with the value of df1.
s = df1.set_index('ID')['Value']
for col in df2.columns:
df2[f'{col}_value'] = df2[col].map(s)
print (df2)
ID F_ID S_ID ID_value F_ID_value S_ID_value
0 1 2 3 A B C
1 2 3 1 B C A
2 3 1 2 C A B
or with apply and concat
df_ = pd.concat([df2, df2.apply(lambda x: x.map(s)).add_prefix('_value')], axis=1)
df_ = df_.reindex(sorted(df_.columns), axis=1)
If order is important (I realised not in comments) is necessary use DataFrame.insert with enumerate and some maths:
s = df1.set_index('ID')['Value']
for i, col in enumerate(df2.columns, 1):
df2.insert(i * 2 - 1, f'{col}_value', df2[col].map(s))
print (df2)
ID ID_value F_ID F_ID_value S_ID S_ID_value
0 1 A 2 B 3 C
1 2 B 3 C 1 A
2 3 C 1 A 2 B

Adding a column in dataframes based on similar columns in them

I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.
You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5
df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5

concat two dataframe using python

We have one dataframe like
-0.140447131 0.124802527 0.140780106
0.062166349 -0.121484447 -0.140675515
-0.002989106 0.13984927 0.004382326
and the other as
1
1
2
We need to concat both the dataframe like
-0.140447131 0.124802527 0.140780106 1
0.062166349 -0.121484447 -0.140675515 1
-0.002989106 0.13984927 0.004382326 2
Let's say your first dataframe is like
In [281]: df1
Out[281]:
a b c
0 -0.140447 0.124803 0.140780
1 0.062166 -0.121484 -0.140676
2 -0.002989 0.139849 0.004382
And, the second like,
In [283]: df2
Out[283]:
d
0 1
1 1
2 2
Then you could create new column for df1 using df2
In [284]: df1['d_new'] = df2['d']
In [285]: df1
Out[285]:
a b c d_new
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 1
2 -0.002989 0.139849 0.004382 2
The assumption however being both dataframes have common index
Use pd.concat and specify the axis equal to 1 (rows):
df_new = pd.concat([df1, df2], axis=1)
>>> df_new
0 1 2 0
0 -0.140447 0.124803 0.140780 1
1 0.062166 -0.121484 -0.140676 2
2 -0.002989 0.139849 0.004382 3

Categories