Suppose I have a dataframe like the following
df = pd.DataFrame({'animal': ['Dog', 'Bird', 'Dog', 'Cat'],
'color': ['Black', 'Blue', 'Brown', 'Black'],
'age': [1, 10, 3, 6],
'pet': [1, 0, 1, 1],
'sex': ['m', 'm', 'f', 'f'],
'name': ['Rex', 'Gizmo', 'Suzy', 'Boo']})
I want to use label encoder to encode "animal", "color", "sex" and "name", but I don't need to encode the other two columns. I also want to be able to inverse_transform the columns afterwards.
I have tried the following, and although encoding works as I'd expect it to, reversing does not.
to_encode = ["animal", "color", "sex", "name"]
le = LabelEncoder()
for col in to_encode:
df[col] = fit_transform(df[col])
## to inverse:
for col in to_encode:
df[col] = inverse_transform(df[col])
The inverse_transform function results in the following dataframe:
animal
color
age
pet
sex
name
Rex
Boo
1
1
Gizmo
Rex
Boo
Gizmo
10
0
Gizmo
Gizmo
Rex
Rex
3
1
Boo
Suzy
Gizmo
Boo
6
1
Boo
Boo
It's obviously not right, but I'm not sure how else I'd accomplish this?
Any advice would be appreciated!
As you can see in your output, when you are trying to inverse_transfom, it seems that the code is only using the information he obtained for the last column "name". You can see that because now, all the rows of your columns have values related to names. You should have one LabelEncoder() for each column.
The key here is to have one LabelEncoder fitted for each different column. To do this, I recommend you save them in a dictionary:
to_encode = ["animal", "color", "sex", "name"]
d={}
for col in to_encode:
d[col]=preprocessing.LabelEncoder().fit(df[col]) #For each column, we create one instance in the dictionary. Take care we are only fitting now.
If we print the dictionary now, we will obtain something like this:
{'animal': LabelEncoder(),
'color': LabelEncoder(),
'sex': LabelEncoder(),
'name': LabelEncoder()}
As we can see, for each column we want to transform, we have his LabelEncoder() information. This means, for example, that for the animal LabelEncoder it saves that 0 is equal to bird, 1 equal to cat, ... And the same for each column.
Once we have every column fitted, we can proceed to transform, and then, if we want to inverse_transform. The only thing to be aware is that every transform/inverse_transform have to use the corresponding LabelEncoder of this column.
Here we transform:
for col in to_encode:
df[col] = d[col].transform(df[col]) #Be aware we are using the dictionary
df
animal color age pet sex name
0 2 0 1 1 1 2
1 0 1 10 0 1 1
2 2 2 3 1 0 3
3 1 0 6 1 0 0
And, once the df is transformed, we can inverse_transform:
for col in to_encode:
df[col] = d[col].inverse_transform(df[col])
df
animal color age pet sex name
0 Dog Black 1 1 m Rex
1 Bird Blue 10 0 m Gizmo
2 Dog Brown 3 1 f Suzy
3 Cat Black 6 1 f Boo
One interesting idea could be using ColumnTransformer, but unfortunately, it doesn't suppport inverse_transform().
Related
I am looking for more elegant approach to replace the values for categorical column based on category codes. I am not able to use map method as the original values are not known in advance.
I am currently using the following approach:
df['Gender'] = pd.Categorical.from_codes(df['Gender'].cat.codes.fillna(-1), categories=['Female', 'Male'])
This approach feels inelegant because I convert categorical column to integer, and then convert it back to categorical. Full code is below.
import pandas as pd
df = pd.DataFrame({
'Name': ['Jack', 'John', 'Jil', 'Jax'],
'Gender': ['M', 'M', 'F', pd.NA],
})
df['Gender'] = df['Gender'].astype('category')
# don't want to do this as original values may not be known to establish the dict
# df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})
# offline, we know 0 = Female, 1 = Male
# what is more elegant way to do below?
df['Gender'] = pd.Categorical.from_codes(df['Gender'].cat.codes.fillna(-1), categories=['Female', 'Male'])
here is one way to do that
create a dictionary of unique items and using enumerate assign an index
d = {item: i for i, item in enumerate(df['Gender'].unique())}
use map to map the values
df['cat'] = df['Gender'].map(d)
df
Name Gender cat
0 Jack M 0
1 John M 0
2 Jil F 1
3 Jax <NA> 2
What about using cat.rename_categories?
df['Gender'] = (df['Gender'].astype('category')
.cat.rename_categories(['Female', 'Male'])
)
output:
Name Gender
0 Jack Male
1 John Male
2 Jil Female
3 Jax NaN
I want to convert a stringcolumn with multiple labels into separate columns for each label and rearrange the dataframe that identical labels are in the same column. For e.g.:
ID
Label
0
apple, tom, car
1
apple, car
2
tom, apple
to
ID
Label
0
1
2
0
apple, tom, car
apple
car
tom
1
apple, car
apple
car
None
2
tom, apple
apple
None
tom
df["Label"].str.split(',',3, expand=True)
0
1
2
apple
tom
car
apple
car
None
tom
apple
None
I know how to split the stringcolumn, but I can't really figure out how to sort the label columns, especially since the number of labels per sample is different.
Here's a way to do this.
First call df['Label'].apply() to replace the csv strings with lists and also to populate a Python dict mapping labels to new column index values.
Then create a second data frame df2 that fills new label columns as specified in the question.
Finally, concatenate the two DataFrames horizontally and drop the 'Label' column.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID' : [0,1,2],
'Label' : ['apple, tom, car', 'apple, car', 'tom, apple']
})
labelInfo = [labels := {}, curLabelIdx := 0]
def foo(x, labelInfo):
theseLabels = [s.strip() for s in x.split(',')]
labels, curLabelIdx = labelInfo
for label in theseLabels:
if label not in labels:
labels[label] = curLabelIdx
curLabelIdx += 1
labelInfo[1] = curLabelIdx
return theseLabels
df['Label'] = df['Label'].apply(foo, labelInfo=labelInfo)
df2 = pd.DataFrame(np.array(df['Label'].apply(lambda x: [s if s in x else 'None' for s in labels]).to_list()),
columns = list(labels.values()))
df = pd.concat([df, df2], axis=1).drop(columns=['Label'])
print(df)
Output:
ID 0 1 2
0 0 apple tom car
1 1 apple None car
2 2 apple tom None
If you'd prefer to have the new columns named using the labels they contain, you can replace the df2 assignment line with this:
df2 = pd.DataFrame(np.array(df['Label'].apply(lambda x: [s if s in x else 'None' for s in labels]).to_list()),
columns = list(labels))
Now the output is:
ID apple tom car
0 0 apple tom car
1 1 apple None car
2 2 apple tom None
Try:
df = df.assign(xxx=df.Label.str.split(r"\s*,\s*")).explode("xxx")
df["Col"] = df.groupby("xxx").ngroup()
df = (
df.set_index(["ID", "Label", "Col"])
.unstack(2)
.droplevel(0, axis=1)
.reset_index()
)
df.columns.name = None
print(df)
Prints:
ID Label 0 1 2
0 0 apple, tom, car apple car tom
1 1 apple, car apple car NaN
2 2 tom, apple apple NaN tom
I believe what you want is something like this:
import pandas as pd
data = {'Label': ['apple, tom, car', 'apple, car', 'tom, apple']}
df = pd.DataFrame(data)
print(f"df: \n{df}")
def norm_sort(series):
mask = []
for line in series:
mask.extend([l.strip() for l in line.split(',')])
mask = sorted(list(set(mask)))
labels = []
for line in series:
labels.append(', '.join([m if m in line else 'None' for m in mask]))
return labels
df.Label = norm_sort(df.loc[:, 'Label'])
df = df.Label.str.split(', ', expand=True)
print(f"df: \n{df}")
The goal of your program is not clear. If you are curious which elements are present in the different rows, then we can just get them all and stack the dataframe like such:
df = pd.DataFrame({'label': ['apple, banana, grape', 'apple, banana', 'banana, grape']})
final_df = df['label'].str.split(', ', expand=True).stack()
final_df.reset_index(drop=True, inplace=True)
>>> final_df
0 apple
1 banana
2 grape
3 apple
4 banana
5 banana
6 grape
At this point we can drop the duplicates or count the occurrence of each, depending on your use case.
Data frame--->with only columns ['234','apple','banana','orange']
now i have a list like
l=['apple', 'banana']
extracting from another data frame column
I am taking unique values of columns from column fruits.
fruits.unique()
which results in array[()]
to get the list of items simply looping over index values and store them in list
loop over the list to check whether the values in the list are presented in columns of data frame.
If present,then add 1 for the values that match column headers else add 0 for one that matching.
In the above case data frame after matching should look like:
234 apple banana orange
0 1 1 0
If need one row DataFrame compare columns names converted to DataFrame by Index.to_frame with DataFrame.isin, then for mapping True, False to 1,0 convert to integers and transpose:
df = pd.DataFrame(columns=['234','apple','banana','orange'])
l=['apple', 'banana']
df = df.columns.to_frame().isin(l).astype(int).T
print (df)
234 apple banana orange
0 0 1 1 0
If it is nested list use MultiLabelBinarizer:
df = pd.DataFrame(columns=['234','apple','banana','orange'])
L= [['apple', 'banana'], ['apple', 'orange', 'apple']]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(L),columns=mlb.classes_)
.reindex(df.columns, fill_value=0, axis=1))
print (df)
234 apple banana orange
0 0 1 1 0
1 0 1 0 1
EDIT: If data are from another DataFrame column solution is very similar like second one:
df = pd.DataFrame(columns=['234','apple','banana','orange'])
df1 = pd.DataFrame({"col":[['apple', 'banana'],['apple', 'orange', 'apple']]})
print (df1)
col
0 [apple, banana]
1 [apple, orange, apple]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(df1['col']),columns=mlb.classes_)
.reindex(df.columns, fill_value=0, axis=1))
print (df)
234 apple banana orange
0 0 1 1 0
1 0 1 0 1
This question already has answers here:
How can I one hot encode in Python?
(22 answers)
Closed 1 year ago.
My ultimate goal is one-hot-encoding on a Pandas column.
In this case, I want to one-hot-encode column "b" as follows: keep apples, bananas and oranges, and encode any other fruit as "other".
Example: in the code below "grapefruit" will be re-written as "other", as would "kiwi"s and "avocado"s if they appeared in my data.
This code below works:
df = pd.DataFrame({
"a": [1,2,3,4,5],
"b": ["apple", "banana", "banana", "orange", "grapefruit"],
"c": [True, False, True, False, True],
})
print(df)
def analyze_fruit(s):
if s in ("apple", "banana", "orange"):
return s
else:
return "other"
df['b'] = df['b'].apply(analyze_fruit)
df2 = pd.get_dummies(df['b'], prefix='b')
print(df2)
My question: is there a shorter way to do the analyze_fruit() business? I tried DataFrame.replace() with a negative lookahead assertion without success.
You can setup the Categorical before get_dummies then fillna anything that does not match the set categories will become NaN which can be easily filled by fillna. Another Benefit of the categorical is ordering can be defined here as well by adding ordered=True:
df['b'] = pd.Categorical(
df['b'],
categories=['apple', 'banana', 'orange', 'other']
).fillna('other')
df2 = pd.get_dummies(df['b'], prefix='b')
Standard replacement with something like np.where would also work here, but typically dummies are used with Categorical data so being able to add ordering so the dummy columns appear in a set order can be helpful:
# import numpy as np
df['b'] = np.where(df['b'].isin(['apple', 'banana', 'orange']),
df['b'],
'other')
df2 = pd.get_dummies(df['b'], prefix='b')
Both produce df2:
b_apple b_banana b_orange b_other
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
I have two dataframes:
Row No. Subject
1 Apple
2 Banana
3 Orange
4 Lemon
5 Strawberry
row_number Subjects Special?
1 Banana Yes
2 Lemon No
3 Apple No
4 Orange No
5 Strawberry Yes
6 Cranberry Yes
7 Watermelon No
I want to change the Row No. of the first dataframe to match the second. It should be like this:
Row No. Subject
3 Apple
1 Banana
4 Orange
2 Lemon
5 Strawberry
I have tried this code:
for index, row in df1.iterrows():
if df1['Subject'] == df2['Subjects']:
df1['Row No.'] = df2['row_number']
But I get the error:
ValueError: Can only compare identically-labeled Series objects
Does that mean the dataframes have to have the same amount of rows and columns? Do they have to be labelled the same too? Is there a way to bypass this limitation?
Edit: I have found a promising alternative formula:
for x in df1['Subject']:
if x in df2['Subjects'].values:
df2.loc[df2['Subjects'] == x]['row_number'] = df1.loc[df1['Subject'] == x]['Row No.']
But it appears it doesn't modify the first dataframe like I want it to. Any tips why?
Furthermore, I get this warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would avoid using for loops especially when pandas has such great methods to handle these types of problems already.
Using pd.Series.replace
Here is a vectorized way of doing this -
d is the dictionary that maps the fruit to the number in second dataframe
You can use df.Subject.replace(d) to now simply replace the keys in the dict d to their values.
Overwrite the Row No. column with this now.
d = dict(zip(df2['Subjects'], df2['row_number']))
df1['Row No.'] = df1.Subject.replace(d)
print(df1)
Subject Row No.
0 Apple 3
1 Banana 1
2 Orange 4
3 Lemon 2
4 Strawberry 5
Using pd.merge
Let's try simply merging the 2 dataframe and replace the column completely.
ddf = pd.merge(df1['Subject'],
df2[['row_number','Subjects']],
left_on='Subject',
right_on='Subjects',
how='left').drop('Subjects',1)
ddf.columns = df1.columns[::-1]
print(ddf)
Subject Row No.
0 Apple 3
1 Banana 1
2 Orange 4
3 Lemon 2
4 Strawberry 5
Assuming the first is df1 and the second is df2, this should do what you want it to:
import pandas as pd
d1 = {'Row No.': [1, 2, 3, 4, 5], 'Subject': ['Apple', 'Banana', 'Orange',
'Lemon', 'Strawberry']}
df1 = pd.DataFrame(data=d1)
d2 = {'row_number': [1, 2, 3, 4, 5, 6, 7], 'Subjects': ['Banana', 'Lemon', 'Apple',
'Orange', 'Strawberry', 'Cranberry', 'Watermelon'], 'Special?': ['Yes', 'No',
'No', 'No',
'Yes', 'Yes', 'No']}
df2 = pd.DataFrame(data=d2)
for x in df1['Subject']:
if x in df2['Subjects'].values:
df1.loc[df1['Subject'] == x, 'Row No.'] = (df2.loc[df2['Subjects'] == x]['row_number']).item()
#print(df1)
#print(df2)
In your edited answer it looks like you had the dataframes swapped and you were missing the item() to get the actual row_number value and not the Series object.