I have the following dataframe:
>>> df
n1 n2 dense c1 c2 c3
0 1 4 [1, 4] a h1 tt
1 2 5 [2, 5] b bbw ebay
2 3 6 [3, 6] c we yahoo
If I want to create a one-hot encoding columns for c1, c2, c3 columns:
>>> df_updated = pd.get_dummies(df, prefix_sep='_', dummy_na=True, columns=['c1', 'c2', 'c3'])
>>> df_updated
n1 n2 dense c1_a c1_b c1_c c1_nan c2_bbw c2_h1 c2_we c2_nan c3_ebay c3_tt c3_yahoo c3_nan
0 1 4 [1, 4] 1 0 0 0 0 1 0 0 0 1 0 0
1 2 5 [2, 5] 0 1 0 0 1 0 0 0 1 0 0 0
2 3 6 [3, 6] 0 0 1 0 0 0 1 0 0 0 1 0
But how can I get a list of columns that is generated by get_dummies()?
Ex. ['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1', 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']
I know one way of doing that is list(set(df_updated.columns) - set(df.columns)) but is there a better way?
One way is to store the pre hot-encoded columns in a variable and then use filter :
cols, sep = ['c1', 'c2', 'c3'], '_'
df_updated = pd.get_dummies(df, prefix_sep=sep,
dummy_na=True, columns=cols)
df_dum = df_updated.filter(regex=f'^{"|".join(cols)}{sep}\w+', axis=1)
Or, simply and even better, use difference :
cols_dum = list(df_updated.columns.difference(df))
Output :
print(list(df_dum.columns)) #or print(cols_dum)
['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1',
'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']
Related
I have been using one-hot encoding for a while now in all pre-processing data pipelines that I have had.
But I have run into an issue now that I am trying to pre-process new data automatically with flask server running a model.
TLDR of what I am trying to do is to search new data for a specific Date, region and type and run a .predict on it.
The problem arises as after I search for a specific data point I have to change the columns from objects to the one-hot encoded ones.
My question is, how do I know which column is for which category inside a feature? As I have around 240 columns after one hot encoding.
IIUC, use get_feature_names_out():
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 1, 0],
'C': [0, 2, 2], 'D': [0, 1, 1]})
ohe = OneHotEncoder()
data = ohe.fit_transform(df)
df1 = pd.DataFrame(data.toarray(), columns=ohe.get_feature_names_out(), dtype=int)
Output:
>>> df
A B C D
0 0 3 0 0
1 1 1 2 1
2 2 0 2 1
>>> df1
A_0 A_1 A_2 B_0 B_1 B_3 C_0 C_2 D_0 D_1
0 1 0 0 0 0 1 1 0 1 0
1 0 1 0 0 1 0 0 1 0 1
2 0 0 1 1 0 0 0 1 0 1
>>> pd.Series(ohe.get_feature_names_out()).str.rsplit('_', 1).str[0]
0 A
1 A
2 A
3 B
4 B
5 B
6 C
7 C
8 D
9 D
dtype: object
Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?
You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method
You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0
import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
Using get_dummies(), it is possible to create one-hot encoded dummy variables for categorical data. For example:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c']})
print(pd.get_dummies(df))
# A_a A_b B_a B_b B_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
So far, so good. But how can I use get_dummies() in combination with multi-index columns? The default behavior is not very practical: The multi-index tuple is converted into a string and the same suffix mechanism applies as with the simple-index columns.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
ret = pd.get_dummies(df)
print(ret)
print(type(ret.columns[0]))
# ('i','A')_a ('i','A')_b ('ii','B')_a ('ii','B')_b ('ii','B')_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# str
What I would like to get, however, is that the dummies create a new column level:
ret = pd.get_dummies(df, ???)
print(ret)
print(type(ret.columns[0]))
# i ii
# A B
# a b a b c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# tuple
#
# Note that the ret would be equivalent to the following:
# ('i','A','a') ('i','A','b') ('ii','B','a') ('ii','B','b') ('ii','B','c')
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
How could this be achieved?
Update: I placed a feature request for better support of multi-index data-frames in get_dummies: https://github.com/pandas-dev/pandas/issues/26560
You can parse the column names and rename them:
import ast
def parse_dummy(x):
parts = x.split('_')
return ast.literal_eval(parts[0]) + (parts[1],)
ret.columns = pd.Series(ret.columns).apply(parse_dummy)
# (i, A, a) (i, A, b) (ii, B, a) (ii, B, b) (ii, B, c)
#0 1 0 0 1 0
#1 0 1 1 0 0
#2 1 0 0 0 1
Note that this DataFrame is not the same as a DataFrame with three-level multiindex column names.
I had a similar need, but in a more complex DataFrame, with a multi index as row index and numerical columns which shall not be converted to dummy. So my case required to scan through the columns, expand to dummy only the columns of dtype='object', and build a new column index as a concatenation of the name of the column with the dummy variable and the value of the dummy variable itself. This because I didn't want to add a new column index level.
Here is the code
first build a dataframe in the format I need
import pandas as pd
import numpy as np
df_size = 3
objects = ['obj1','obj2']
attributes = ['a1','a2','a3']
cols = pd.MultiIndex.from_product([objects, attributes], names=['objects', 'attributes'])
lab1 = ['car','truck','van']
lab2 = ['bay','zoe','ros','lol']
time = np.arange(df_size)
node = ['n1','n2']
idx = pd.MultiIndex.from_product([time, node], names=['time', 'node'])
df = pd.DataFrame(np.random.randint(10,size=(len(idx),len(cols))),columns=cols,index=idx)
c1 = map(lambda i:lab1[i],np.random.randint(len(lab1),size=len(idx)))
c2 = map(lambda i:lab2[i],np.random.randint(len(lab2),size=len(idx)))
df[('obj1','a3')]=list(c1)
df[('obj2','a2')]=list(c2)
print(df)
objects obj1 obj2
attributes a1 a2 a3 a1 a2 a3
time node
0 n1 6 5 truck 3 ros 3
n2 5 6 car 9 lol 7
1 n1 0 8 truck 7 zoe 8
n2 4 3 truck 8 bay 3
2 n1 5 8 van 0 bay 0
n2 4 8 car 5 lol 4
And here is the code to dummify only the object columns
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[f'{t[1]}_{c}' for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
objects obj1 obj2
attributes a1 a2 a3_car a3_truck a3_van a1 a2_bay a2_lol a2_ros a2_zoe a3
time node
0 n1 6 5 0 1 0 3 0 0 1 0 3
n2 5 6 1 0 0 9 0 1 0 0 7
1 n1 0 8 0 1 0 7 0 0 0 1 8
n2 4 3 0 1 0 8 1 0 0 0 3
2 n1 5 8 0 0 1 0 1 0 0 0 0
n2 4 8 1 0 0 5 0 1 0 0 4
It can be easily changed to answer to original use case, adding one more row to the columns multi index.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
print(df)
i ii
A B
0 a b
1 b a
2 a c
df.columns = pd.MultiIndex.from_tuples([t+('',) for t in df.columns])
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[t[1]],[c for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
i ii
A B
a b a b c
0 1 0 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
note that it still works if there are numerical columns - it just adds an empty additional level to them in the columns index as well.
I want to merge two datasets by indexes and columns.
I want to merge entire dataset
df1 = pd.DataFrame([[1, 0, 0], [0, 2, 0], [0, 0, 3]],columns=[1, 2, 3])
df1
1 2 3
0 1 0 0
1 0 2 0
2 0 0 3
df2 = pd.DataFrame([[0, 0, 1], [0, 2, 0], [3, 0, 0]],columns=[1, 2, 3])
df2
1 2 3
0 0 0 1
1 0 2 0
2 3 0 0
I have tried this code but I got this error. I can't get why it shows the size of axis as an error.
df_sum = pd.concat([df1, df2])\
.groupby(df2.index)[df2.columns]\
.sum().reset_index()
ValueError: Grouper and axis must be same length
This was what I expected the output of df_sum
df_sum
1 2 3
0 1 0 1
1 0 4 0
2 3 0 3
You can use :df1.add(df2, fill_value=0). It will add df2 into df1 also it will replace NAN value with 0.
>>> import numpy as np
>>> import pandas as pd
>>> df2 = pd.DataFrame([(10,9),(8,4),(7,np.nan)], columns=['a','b'])
>>> df1 = pd.DataFrame([(1,2),(3,4),(5,6)], columns=['a','b'])
>>> df1.add(df2, fill_value=0)
a b
0 11 11.0
1 11 8.0
2 12 6.0
I'm trying to one-hot encode one column of a dataframe.
enc = OneHotEncoder()
minitable = enc.fit_transform(df["ids"])
But I'm getting
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and willraise ValueError in 0.19.
Is there a workaround for this?
I think you can use get_dummies:
df = pd.DataFrame({'ids':['a','b','c']})
print (df)
ids
0 a
1 b
2 c
print (df.ids.str.get_dummies())
a b c
0 1 0 0
1 0 1 0
2 0 0 1
EDIT:
If input is column with lists, first cast to str, remove [] by strip and call get_dummies:
df = pd.DataFrame({'ids':[[0,4,5],[4,7,8],[5,1,2]]})
print(df)
ids
0 [0, 4, 5]
1 [4, 7, 8]
2 [5, 1, 2]
print (df.ids.astype(str).str.strip('[]').str.get_dummies(', '))
0 1 2 4 5 7 8
0 1 0 0 1 1 0 0
1 0 0 0 1 0 1 1
2 0 1 1 0 1 0 0