Dask DummyEncoder not returning all the columns - python

I tried using dask DummyEncoder for OneHotEncoding my data. But the results are not as expected.
dask's DummyEncoder Example:
from dask_ml.preprocessing import DummyEncoder
import pandas as pd
data = pd.DataFrame({
'B': ['a', 'a', 'a', 'b','c']
})
de = DummyEncoder()
de = de.fit(data)
testD = pd.DataFrame({'B': ['a','a']})
trans = de.transform(testD)
print(trans)
Ouput:
B_a
0 1
1 1
Why it doesn't show B_b, B_c? But when I change the testD as this:
testD = pd.DataFrame({'B': ['a','a', 'b', 'c']})
Result is:
B_a B_b B_c
0 1 0 0
1 1 0 0
2 0 1 0
3 0 0 1
sklearn's OneHotEncoder Example (After LabelEncoding):
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
data = pd.DataFrame({
'B': [1, 1, 1, 2, 3]
})
encoder = OneHotEncoder()
encoder = encoder.fit(data)
testdf = pd.DataFrame({'B': [2, 2]})
trans = encoder.transform(testdf).toarray()
pd.DataFrame(trans, columns=encoder.active_features_)
Output:
1 2 3
0 0.0 1.0 0.0
1 0.0 1.0 0.0
How do I achieve the same results? The reason I want it this way because I will be encoding a subset of the columns and then concatenate the resultant encoded_df to the main df along with that dropping main column from the main df.
So something like below (main df):
A B C
0 M 1 10
1 F 2 20
2 T 3 30
3 M 4 40
4 F 5 50
5 F 6 60
Expected output:
A_F A_M A_T B C
0 0 1 0 1 10
1 1 0 0 2 20
2 0 0 1 3 30
3 0 1 0 4 40
4 1 0 0 5 50
5 1 0 0 6 60
EDIT:
Since dask internally uses pandas, I believe it uses get_dummies. Which is how DummyEncoder is behaving. If someone could point out a way to do the same in pandas will also be appreciated.

From dask's documentation for DummyEncoder columns parameter:
The columns to dummy encode. Must be categorical dtype.
Dummy encodes all categorical dtype columns by default.
Also, it says here that you must always use a Categorizer before using some encoders (DummyEncoder included).
A correct way to do this:
from dask_ml.preprocessing import Categorizer, DummyEncoder
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(
Categorizer(), DummyEncoder())
pipe.fit(data)
pipe.transform(testD)
Which will ouput:
B_a B_b B_c
0 1 0 0
1 1 0 0

Related

sklearn.preprocessing.OneHotEncoder and the way to read it

I have been using one-hot encoding for a while now in all pre-processing data pipelines that I have had.
But I have run into an issue now that I am trying to pre-process new data automatically with flask server running a model.
TLDR of what I am trying to do is to search new data for a specific Date, region and type and run a .predict on it.
The problem arises as after I search for a specific data point I have to change the columns from objects to the one-hot encoded ones.
My question is, how do I know which column is for which category inside a feature? As I have around 240 columns after one hot encoding.
IIUC, use get_feature_names_out():
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 1, 0],
'C': [0, 2, 2], 'D': [0, 1, 1]})
ohe = OneHotEncoder()
data = ohe.fit_transform(df)
df1 = pd.DataFrame(data.toarray(), columns=ohe.get_feature_names_out(), dtype=int)
Output:
>>> df
A B C D
0 0 3 0 0
1 1 1 2 1
2 2 0 2 1
>>> df1
A_0 A_1 A_2 B_0 B_1 B_3 C_0 C_2 D_0 D_1
0 1 0 0 0 0 1 1 0 1 0
1 0 1 0 0 1 0 0 1 0 1
2 0 0 1 1 0 0 0 1 0 1
>>> pd.Series(ohe.get_feature_names_out()).str.rsplit('_', 1).str[0]
0 A
1 A
2 A
3 B
4 B
5 B
6 C
7 C
8 D
9 D
dtype: object

Method chaining with pandas function

Why can't I chain the get_dummies() function?
import pandas as pd
df = (pd
.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
.drop(columns=['sepal_length'])
.get_dummies()
)
This works fine:
df = (pd
.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
.drop(columns=['sepal_length'])
)
df = pd.get_dummies(df)
DataFrame.pipe can be helpful in chaining methods or function calls which are not natively attached to the DataFrame, like pd.get_dummies:
df = df.drop(columns=['sepal_length']).pipe(pd.get_dummies)
Or with lambda:
df = (
df.drop(columns=['sepal_length'])
.pipe(lambda current_df: pd.get_dummies(current_df))
)
Sample DataFrame:
df = pd.DataFrame({'sepal_length': 1, 'a': list('ABACC'), 'b': list('ACCAB')})
df:
sepal_length a b
0 1 A A
1 1 B C
2 1 A C
3 1 C A
4 1 C B
Sample Output:
df = df.drop(columns=['sepal_length']).pipe(pd.get_dummies)
df:
a_A a_B a_C b_A b_B b_C
0 1 0 0 1 0 0
1 0 1 0 0 0 1
2 1 0 0 0 0 1
3 0 0 1 1 0 0
4 0 0 1 0 1 0
You can't chain the pd.get_dummies() method since it is not a pd.DataFrame method. However, assuming -
You have a single column left after you drop your columns in the previous step in the chain.
Your column is a string column dtype.
... you can use pd.Series.str.get_dummies() which is a series level method.
### Dummy Dataframe
# A B
# 0 1 x
# 1 2 y
# 2 3 z
pd.read_csv(path).drop(columns=['A'])['B'].str.get_dummies()
x y z
0 1 0 0
1 0 1 0
2 0 0 1
NOTE: Make sure that before you call the get_dummies() method, the data type of the object is series. In this case, I fetch column ['B'] to do that, which kinda makes the previous pd.DataFrame.drop() method unnecessary and useless :)
But this is only for example's sake.

Copy data from a row to another row in Pandas dataframe based on condition

Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?
You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method
You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0
import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0

Convert dataframe to pivot table with booleans(0, 1) with Pandas [duplicate]

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

How do I convert a DataFrame from long to wide format, aggregating a column's values by counts

My setup is as follows
import numpy as np
import pandas as pd
df = pd.DataFrame({'user_id':[1, 1, 1, 2, 3, 3], 'action':['b', 'b', 'c', 'a', 'c', 'd']})
df
action user_id
0 b 1
1 b 1
2 c 1
3 a 2
4 c 3
5 d 3
What is the best way to generate a dataframe from this where there is one row for each unique user_id, one column for each unique action and the column values are the count of each action per user_id?
I've tried
df.groupby(['user_id', 'action']).size().unstack('action')
action a b c d
user_id
1 NaN 2 1 NaN
2 1 NaN NaN NaN
3 NaN NaN 1 1
which comes close, but this seems to make user_id the index which is not what I want (I think). Maybe there's a better way involving pivot, pivot_table or even get_dummies?
You could use pd.crosstab:
In [37]: pd.crosstab(index=[df['user_id']], columns=[df['action']])
Out[37]:
action a b c d
user_id
1 0 2 1 0
2 1 0 0 0
3 0 0 1 1
Having user_id as the index seems appropriate to me, but if you'd like to drop the user_id, you could use reset_index:
In [39]: pd.crosstab(index=[df['user_id']], columns=[df['action']]).reset_index(drop=True)
Out[39]:
action a b c d
0 0 2 1 0
1 1 0 0 0
2 0 0 1 1

Categories