Transform multiple categorical columns - python

In my dataset I have two categorical columns which I would like to numerate. The two columns both contain countries, some overlap (appear in both columns). I would like to give the same number in column1 and column2 for the same country.
My data looks somewhat like:
import pandas as pd
d = {'col1': ['NL', 'BE', 'FR', 'BE'], 'col2': ['BE', 'NL', 'ES', 'ES']}
df = pd.DataFrame(data=d)
df
Currenty I am transforming the data like:
from sklearn.preprocessing import LabelEncoder
df.apply(LabelEncoder().fit_transform)
However this makes no distinction between FR and ES. Is there another simple way to come to the following output?
o = {'col1': [2,0,1,0], 'col2': [0,2,4,4]}
output = pd.DataFrame(data=o)
output

Here is one way
df.stack().astype('category').cat.codes.unstack()
Out[190]:
col1 col2
0 3 0
1 0 3
2 2 1
3 0 1
Or
s=df.stack()
s[:]=s.factorize()[0]
s.unstack()
Out[196]:
col1 col2
0 0 1
1 1 0
2 2 3
3 1 3

You can fit the LabelEncoder() with the unique values in your dataframe first and then transform.
le = LabelEncoder()
le.fit(pd.concat([df.col1, df.col2]).unique()) # or np.unique(df.values.reshape(-1,1))
df.apply(le.transform)
Out[28]:
col1 col2
0 3 0
1 0 3
2 2 1
3 0 1

np.unique with return_invesere. Though you then need to reconstruct the DataFrame.
pd.DataFrame(np.unique(df, return_inverse=True)[1].reshape(df.shape),
index=df.index,
columns=df.columns)
col1 col2
0 3 0
1 0 3
2 2 1
3 0 1

Related

How to add interleaving rows as result of sort / groups?

I have the following sample input data:
import pandas as pd
df = pd.DataFrame({'col1': ['x', 'y', 'z'], 'col2': [1, 2, 3], 'col3': ['a', 'a', 'b']})
I would like to sort and group by col3 while interleaving the summaries on top of the corresponding group in col1 and get the following output:
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
I can of course do the part:
df.sort_values(by=['col3']).groupby(by=['col3']).sum()
col2
col3
a 3
b 3
but I am not sure how to interleave the group labels on top of col1.
Use custom function for top1 row for each group:
def f(x):
return pd.DataFrame({'col1': x.name, 'col2': x['col2'].sum()}, index=[0]).append(x)
df = (df.sort_values(by=['col3'])
.groupby(by=['col3'], group_keys=False)
.apply(f)
.drop('col3', 1)
.reset_index(drop=True))
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
More performant solution is use GroupBy.ngroup for indices, aggregate sum amd last join values by concat with only stable sorting by mergesort:
df = df.sort_values(by=['col3'])
df1 = df.groupby(by=['col3'])['col2'].sum().rename_axis('col1').reset_index()
df2 = df.set_index(df.groupby(by=['col3']).ngroup())
df = pd.concat([df1, df2]).sort_index(kind='mergesort', ignore_index=True).drop('col3', 1)
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
What about:
(df.melt(id_vars='col2')
.rename(columns={'value': 'col1'})
.groupby('col1').sum()
.reset_index()
)
output:
col1 col2
0 a 3
1 b 3
2 x 1
3 y 2
4 z 3
def function1(dd:pd.DataFrame):
df.loc[dd.index.min()-0.5,['col1','col2']]=[dd.name,dd.col2.sum()]
df.groupby('col3').apply(function1).pipe(lambda dd:df.sort_index(ignore_index=True)).drop('col3',axis=1)
output
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
or use pandasql library
def function1(dd:pd.DataFrame):
return dd.sql("select '{}' as col1,{} as col2 union select col1,col2 from self".format(dd.name,dd.col2.sum()))
df.groupby('col3').apply(function1).reset_index(drop=False)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3

create another column based on matching of two data frames columns

I have two data frames df1 and df2 with a common column ID. Both the data frames have different number of rows and columns.
I want to compare these two dataframe ID’s. I want to create another column y in df1 and for all the common id’s present in df1 and df2 the value of y should be 0 else 1.
For example df1 is
Id col1 col2
1 Abc def
2 Geh ghk
3 Abd fg
1 Dfg abc
And df2 is
Id col3 col4
1 Dgh gjs
2 Gsj aie
The final dataframe should be
Id col1 col2 y
1 Abc def 0
2 Geh ghk 0
3 Abd fg 1
1 Dfg abc 0
Lets create df1 and df2 first:
df1=pd.DataFrame({'ID':[1,2,3,1], 'col1':['A','B','C', 'D'], 'col2':['C','D','E', 'F']})
df2=pd.DataFrame({'ID':[1,2], 'col3':['AA','BB'], 'col4':['CC','DD']})
Here, pandas lambda function comes handy:
df1['y'] = df1['ID'].map(lambda x:0 if x in df2['ID'].values else 1)
df1
ID col1 col2 y
0 1 A C 0
1 2 B D 0
2 3 C E 1
3 1 D F 0

Dummy variables when not all categories are present across multiple features & data sets

I want to ask an extension of this question, which talks about adding a label to missing classes to make sure the dummies are encoded as blanks correctly.
Is there a way to do this automatically across multiple sets of data and have the labels automatically synched between the two? (I.e. for Test & Training sets). I.e. the same columns but different classes of data represented in each?
E.g.:
Suppose I have the following two dataframes:
df1 = pd.DataFrame.from_items([('col1', list('abc')), ('col2', list('123'))])
df2 = pd.DataFrame.from_items([('col1', list('bcd')), ('col2', list('234'))])
df1
col1 col2
1 a 1
2 b 2
3 c 3
df2
col1 col2
1 b 2
2 c 3
3 d 4
I want to have:
df1
col1_a col1_b col1_c col1_d col2_1 col2_2 col2_3 col2_4
1 1 0 0 0 1 0 0 0
2 0 1 0 0 0 1 0 0
3 0 0 1 0 0 0 1 0
df2
col1_a col1_b col1_c col1_d col2_1 col2_2 col2_3 col2_4
1 0 1 0 0 0 1 0 0
2 0 0 1 0 0 0 1 0
3 0 0 0 1 0 0 0 1
WITHOUT having to specify in advance that
col1_labels = ['a', 'b', 'c', 'd'], col2_labels = ['1', '2', '3', '4']
And can I do this systematically for many columns all at once? I'm imagining a fuction that when passed in two or more dataframes (assuming columns are the same for all):
reads which columns in the pandas dataframe are categories
figures out what that overall labels are
and then provides the category labels to each column
Does that seem right? Is there a better way?
I think you need reindex by union of all columns if same categorical columns names in both Dataframes:
print (df1)
df1
1 a
2 b
3 c
print (df2)
df1
1 b
2 c
3 d
df1 = pd.get_dummies(df1)
df2 = pd.get_dummies(df2)
union = df1.columns | df2.columns
df1 = df1.reindex(columns=union, fill_value=0)
df2 = df2.reindex(columns=union, fill_value=0)
print (df1)
df1_a df1_b df1_c df1_d
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
print (df2)
df1_a df1_b df1_c df1_d
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1

Pandas replace, multi column criteria

I'm trying to replace values in a Pandas data frame, based on certain criteria on multiple columns. For a single column criteria this can be done very elegantly with a dictionary (e.g. Remap values in pandas column with a dict):
import pandas as pd
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2}, 'col2': {0:10, 1:20, 2:20}})
rdict = {1:'a', 2:'b'}
df2 = df.replace({"col1": rdict})
Input df:
col1 col2
0 1 10
1 1 20
2 2 20
Resulting df2:
col1 col2
0 a 10
1 a 20
2 b 20
I'm trying to extend this to criteria over multiple columns (e.g. where col1==1, col2==10 -> replace). For a single criteria this can be done like:
df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'c'
Which results in a df3:
col1 col2
0 c 10
1 1 20
2 2 20
My real life problem has a large number of criteria, which would involve a large number of df3.loc[((criteria1)&(criteria2)), column] = value calls, which is far less elegant the the replacement using a dictionary as a "lookup table". Is it possible to extend the elegant solution (df2 = df.replace({"col1": rdict})) to a setup where values in one column are replaced by criteria based on multiple columns?
An example of what I'm trying to achieve (although in my real life case the number of criteria is a lot larger):
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})
df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'a'
df3.loc[((df['col1']==1)&(df['col2']==20)), 'col1'] = 'b'
df3.loc[((df['col1']==2)&(df['col2']==10)), 'col1'] = 'c'
df3.loc[((df['col1']==2)&(df['col2']==20)), 'col1'] = 'd'
Input df:
0 1 10
1 1 20
2 2 10
3 2 20
Resulting df3:
col1 col2
0 a 10
1 b 20
2 c 10
3 d 20
We can use merge.
Suppose your df looks like
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2, 4:2, 5:1}, 'col2': {0:10, 1:20, 2:10, 3:20, 4: 20, 5:10}})
col1 col2
0 1 10
1 1 20
2 2 10
3 2 20
4 2 20
5 1 10
And your conditional replacement can be represented as another dataframe:
df_replace
col1 col2 val
0 1 10 a
1 1 20 b
2 2 10 c
3 2 20 d
(As OP (Bart) pointed out, you can save this in a csv file.)
Then you can use
df = df.merge(df_replace, on=["col1", "col2"], how="left")
col1 col2 val
0 1 10 a
1 1 20 b
2 2 10 c
3 2 20 d
4 2 20 d
5 1 10 a
Then you just need to drop col1.
As MaxU pointed out, there could be rows that does not get replaced and resulting in NaN. We can use a line like
df["val"] = df["val"].combine_first(df["col1"])
to fill in values from col1 if the resulting values after merge is NaN.
Demo:
Source DF:
In [120]: df
Out[120]:
col1 col2
0 1 10
1 1 10
2 1 20
3 1 20
4 2 10
5 2 20
6 3 30
Conditions & Replacements DF:
In [121]: cond
Out[121]:
col1 col2 repl
1 1 20 b
2 2 10 c
0 1 10 a
3 2 20 d
Solution:
In [121]: res = df.merge(cond, how='left')
yields:
In [122]: res
Out[122]:
col1 col2 repl
0 1 10 a
1 1 10 a
2 1 20 b
3 1 20 b
4 2 10 c
5 2 20 d
6 3 30 NaN # <-- NOTE
In [123]: res['col1'] = res.pop('repl').fillna(res['col1'])
In [124]: res
Out[124]:
col1 col2
0 a 10
1 a 10
2 b 20
3 b 20
4 c 10
5 d 20
6 3 30
This method is likely to be more efficient than pandas functionality, as it relies on numpy arrays and dictionary mappings.
import pandas as pd
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})
rdict = {(1, 10): 'a', (1, 20): 'b', (2, 10): 'c', (2, 20): 'd'}
df['col1'] = list(map(rdict.get, [(x[0], x[1]) for x in df1[['c1', 'c2']].values]))

keep the same factorizing between two data

We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4

Categories