Pandas one-hot-encode columns to dummies, including an 'other' encoding [duplicate] - python

This question already has answers here:
How can I one hot encode in Python?
(22 answers)
Closed 1 year ago.
My ultimate goal is one-hot-encoding on a Pandas column.
In this case, I want to one-hot-encode column "b" as follows: keep apples, bananas and oranges, and encode any other fruit as "other".
Example: in the code below "grapefruit" will be re-written as "other", as would "kiwi"s and "avocado"s if they appeared in my data.
This code below works:
df = pd.DataFrame({
"a": [1,2,3,4,5],
"b": ["apple", "banana", "banana", "orange", "grapefruit"],
"c": [True, False, True, False, True],
})
print(df)
def analyze_fruit(s):
if s in ("apple", "banana", "orange"):
return s
else:
return "other"
df['b'] = df['b'].apply(analyze_fruit)
df2 = pd.get_dummies(df['b'], prefix='b')
print(df2)
My question: is there a shorter way to do the analyze_fruit() business? I tried DataFrame.replace() with a negative lookahead assertion without success.

You can setup the Categorical before get_dummies then fillna anything that does not match the set categories will become NaN which can be easily filled by fillna. Another Benefit of the categorical is ordering can be defined here as well by adding ordered=True:
df['b'] = pd.Categorical(
df['b'],
categories=['apple', 'banana', 'orange', 'other']
).fillna('other')
df2 = pd.get_dummies(df['b'], prefix='b')
Standard replacement with something like np.where would also work here, but typically dummies are used with Categorical data so being able to add ordering so the dummy columns appear in a set order can be helpful:
# import numpy as np
df['b'] = np.where(df['b'].isin(['apple', 'banana', 'orange']),
df['b'],
'other')
df2 = pd.get_dummies(df['b'], prefix='b')
Both produce df2:
b_apple b_banana b_orange b_other
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1

Related

Remove duplicates after a count but also removes count

I am trying to use 2 datasets to do a count and after I want to remove the duplicates from Col 1 but i want to keep the number of calls.
Basically I have a df like this:
Client Number
Call Count
Bob
3
Bob
3
John
1
Bob
3
So what happens is the duplicates get removed but the call count also changes and turns to 1. How do I stop this from occurring?
If anyone can please help
#Count the number of times a account number comes up CallCount['Call Count'] = CallCount.groupby('Client Number').cumcount() + 1
# Remove the duplicates df2 = df1.drop_duplicates(subset=["Client Number"], keep=False)
I have tried these but its the same outcome
import pandas as pd
df1 = pd.DataFrame({'c': ['Bob', 'Bob', 'John', 'Bob'], 'n': [3, 3, 1, 0]})
df1 = df1.groupby(['c'], as_index = False).count()
df1
c
n
0
Bob
3
1
John
1
This is what you are trying to achieve right?

Adding rows based on column value

Data frame--->with only columns ['234','apple','banana','orange']
now i have a list like
l=['apple', 'banana']
extracting from another data frame column
I am taking unique values of columns from column fruits.
fruits.unique()
which results in array[()]
to get the list of items simply looping over index values and store them in list
loop over the list to check whether the values in the list are presented in columns of data frame.
If present,then add 1 for the values that match column headers else add 0 for one that matching.
In the above case data frame after matching should look like:
234 apple banana orange
0 1 1 0
If need one row DataFrame compare columns names converted to DataFrame by Index.to_frame with DataFrame.isin, then for mapping True, False to 1,0 convert to integers and transpose:
df = pd.DataFrame(columns=['234','apple','banana','orange'])
l=['apple', 'banana']
df = df.columns.to_frame().isin(l).astype(int).T
print (df)
234 apple banana orange
0 0 1 1 0
If it is nested list use MultiLabelBinarizer:
df = pd.DataFrame(columns=['234','apple','banana','orange'])
L= [['apple', 'banana'], ['apple', 'orange', 'apple']]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(L),columns=mlb.classes_)
.reindex(df.columns, fill_value=0, axis=1))
print (df)
234 apple banana orange
0 0 1 1 0
1 0 1 0 1
EDIT: If data are from another DataFrame column solution is very similar like second one:
df = pd.DataFrame(columns=['234','apple','banana','orange'])
df1 = pd.DataFrame({"col":[['apple', 'banana'],['apple', 'orange', 'apple']]})
print (df1)
col
0 [apple, banana]
1 [apple, orange, apple]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(df1['col']),columns=mlb.classes_)
.reindex(df.columns, fill_value=0, axis=1))
print (df)
234 apple banana orange
0 0 1 1 0
1 0 1 0 1

Label Encoder and Inverse_Transform on SOME Columns

Suppose I have a dataframe like the following
df = pd.DataFrame({'animal': ['Dog', 'Bird', 'Dog', 'Cat'],
'color': ['Black', 'Blue', 'Brown', 'Black'],
'age': [1, 10, 3, 6],
'pet': [1, 0, 1, 1],
'sex': ['m', 'm', 'f', 'f'],
'name': ['Rex', 'Gizmo', 'Suzy', 'Boo']})
I want to use label encoder to encode "animal", "color", "sex" and "name", but I don't need to encode the other two columns. I also want to be able to inverse_transform the columns afterwards.
I have tried the following, and although encoding works as I'd expect it to, reversing does not.
to_encode = ["animal", "color", "sex", "name"]
le = LabelEncoder()
for col in to_encode:
df[col] = fit_transform(df[col])
## to inverse:
for col in to_encode:
df[col] = inverse_transform(df[col])
The inverse_transform function results in the following dataframe:
animal
color
age
pet
sex
name
Rex
Boo
1
1
Gizmo
Rex
Boo
Gizmo
10
0
Gizmo
Gizmo
Rex
Rex
3
1
Boo
Suzy
Gizmo
Boo
6
1
Boo
Boo
It's obviously not right, but I'm not sure how else I'd accomplish this?
Any advice would be appreciated!
As you can see in your output, when you are trying to inverse_transfom, it seems that the code is only using the information he obtained for the last column "name". You can see that because now, all the rows of your columns have values related to names. You should have one LabelEncoder() for each column.
The key here is to have one LabelEncoder fitted for each different column. To do this, I recommend you save them in a dictionary:
to_encode = ["animal", "color", "sex", "name"]
d={}
for col in to_encode:
d[col]=preprocessing.LabelEncoder().fit(df[col]) #For each column, we create one instance in the dictionary. Take care we are only fitting now.
If we print the dictionary now, we will obtain something like this:
{'animal': LabelEncoder(),
'color': LabelEncoder(),
'sex': LabelEncoder(),
'name': LabelEncoder()}
As we can see, for each column we want to transform, we have his LabelEncoder() information. This means, for example, that for the animal LabelEncoder it saves that 0 is equal to bird, 1 equal to cat, ... And the same for each column.
Once we have every column fitted, we can proceed to transform, and then, if we want to inverse_transform. The only thing to be aware is that every transform/inverse_transform have to use the corresponding LabelEncoder of this column.
Here we transform:
for col in to_encode:
df[col] = d[col].transform(df[col]) #Be aware we are using the dictionary
df
animal color age pet sex name
0 2 0 1 1 1 2
1 0 1 10 0 1 1
2 2 2 3 1 0 3
3 1 0 6 1 0 0
And, once the df is transformed, we can inverse_transform:
for col in to_encode:
df[col] = d[col].inverse_transform(df[col])
df
animal color age pet sex name
0 Dog Black 1 1 m Rex
1 Bird Blue 10 0 m Gizmo
2 Dog Brown 3 1 f Suzy
3 Cat Black 6 1 f Boo
One interesting idea could be using ColumnTransformer, but unfortunately, it doesn't suppport inverse_transform().

Delete columns but keep specific values pandas df

I'm sure this is in SO somewhere but I can't seem to find it. I'm trying to remove or select designated columns in a pandas df. But I want to keep certain values or strings from those deleted columns.
For the df below I want to keep 'Big','Cat' in Col B,C but delete everything else.
import pandas as pd
d = ({
'A' : ['A','Keep','A','Value'],
'B' : ['Big','X','Big','Y'],
'C' : ['Cat','X','Cat','Y'],
})
df = pd.DataFrame(data=d)
If I do either the following it only selects that row.
Big = df[df['B'] == 'Big']
Cat = df[df['C'] == 'Cat']
My intended output is:
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value
I need something like x = df[df['B','C'] != 'Big','Cat']
Seems like you want to keep only some values and have empty string on ohters
Use np.where
keeps = ['Big', 'Cat']
df['B'] = np.where(df.B.isin(keeps), df.B, '')
df['C'] = np.where(df.C.isin(keeps), df.C, '')
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value
Another solution using df.where
cols = ['B', 'C']
df[cols] = df[cols].where(df.isin(keeps)).fillna('')
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value
IIUC
Update
df[['B','C']]=df[['B','C']][df[['B','C']].isin(['Big','Cat'])].fillna('')
df
Out[30]:
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value
You can filter on column combinations via NumPy and np.ndarray.all:
mask = (df[['B', 'C']].values != ['Big', 'Cat']).all(1)
df.loc[mask, ['B', 'C']] = ''
print(df)
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value
Or this:
df[['B','C']]=df[['B','C']].apply(lambda row: row if row.tolist()==['Big','Cat'] else ['',''],axis=1)
print(df)
Output:
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value
Perhaps a concise version:
df.loc[df['B'] != 'Big', 'B'] = ''
df.loc[df['C'] != 'Cat', 'C'] = ''
print(df)
Output:
A B C
0 A Big Cat
1 Keep
2 A Big Cat
3 Value

How to generate dummy variables for only specific values in a column?

I have a pandas dataframe column filled with country codes for 100 countries. I want to use these to do a regression, but I only want to create dummy variables for specific countries in my dataset.
I thought this would work:
dummies = pd.get_dummies(df.CountryCode, prefix='cc_')
df_and_dummies = pd.concat([df,dummies[dummies['cc_US', 'cc_GB']]], axis=1)
df_and_dummies
But it gives me the error:
KeyError: ('cc_US', 'cc_GB')
My dataframe currently looks something like:
dframe = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'CountryCode': ['UK', 'US', 'RU']})
dframe
But I want it to look like this:
Is there a simple way to specify which values you want included in the get_dummies method, or is there another way to identify specific dummy variables?
The dummies is looking like this:
In [25]: dummies
Out[25]:
cc_RU cc_UK cc_US
0 0 1 0
1 0 0 1
2 1 0 0
To select certain columns of this, you can provide a list of column names within the [] getitem:
In [27]: dummies[['cc_US', 'cc_UK']]
Out[27]:
cc_US cc_UK
0 0 1
1 1 0
2 0 0
So you actually missed just a [ bracket.
Full code becomes then:
In [29]: pd.concat([df, dummies[['cc_US', 'cc_UK']]], axis=1)
Out[29]:
A B CountryCode cc_US cc_UK
0 a b UK 0 1
1 b a US 1 0
2 a c RU 0 0

Categories