Labelencoder and OneHotEncoder within the same for loop - python

I am writing a for loop to try to do an encoding for all of my values in a dataset. I have plenty of categorical values and initially the for loop works for the label encoder but I am trying to include a onehotencoder instead of using get_dummies on a separate line.
sample data:
STYP_DESC Gender RACE_DESC DEGREE MAJR_DESC1 FTPT Target
0 New Female White BA Business Administration FT 1
1 New 1st Time Freshmn Female White BA Studio Art FT 1
2 New Male White MBAX Business Administration FT 1
3 New Female Unknown JD Juris Doctor PT 1
4 New Female Asian-American MBAX Business Administration PT 1
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in X_train.columns[1:]:
if X_train[col].dtype == 'object':
if len(list(X_train[col].unique())) <= 2:
le.fit(X_train[col])
X_train[col] = le.transform(X_train[col])
le_count += 1
else:
enc.fit(X_train[[col]])
X_train[[col]] = enc.transform(X_train[[col]])
enc_count +=1
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))
but when I run it, I don't get errors but the encoding is super weird with a slew of tuples being inserted into my new dataset.
When I run the code without the everything in the else clause, it runs fine and I can simply use get_dummies to encode the other variables.
The only issue is when I use get_dummies, I drop_first is set to true; but I lose track of what is supposed to be 0 and what's supposed to be 1. (i.e. this problem is a major issue for tracking Gender and FTPT.
Any suggestions on this? I would use get_dummies but since I'm doing the preprocessing stage after splitting my data I'm worried about a category possibly being dropped out.

Change the transform line encoding else part as below
X_train[col] = enc.transform(X_train[[col]]).toarray()
Here I'm copying the full code, you may try it directly.
So error may be some other part of your code, please check.
styp = ['New','New 1st Time Freshmn','New','New','New']
gend = ['Female','Female','Male','Female','Female']
race = ['White','White','Unknown','Unknown','Asian-American']
deg = ['BA','BA','MBAX','JD','MBAX']
maj = ['Business Administration','Studio Art','Business Administration','Juris Doctor','Business Administration']
ftpt = ['FT','FT','FT','PT','PT']
df = pd.DataFrame({'STYP_DESC':styp, 'Gender':gend, 'RACE_DESC':race,'DEGREE':deg,\
'MAJR_DESC1':maj, 'FTPT':ftpt})
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in df.columns[1:]:
if df[col].dtype == 'object':
if len(list(df[col].unique())) <= 2:
le.fit(df[col])
df[col] = le.transform(df[col])
le_count += 1
else:
enc.fit(df[[col]])
df[col] = enc.transform(df[[col]]).toarray()
enc_count +=1
print(df)
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))

Related

DataFrame has two features how to add a row to split them

I have a DataFrame that contains a column called feature that can have more than one of them as illustrated in the image below row 3 & 4. How do a add a row to the DataFrame that splits the two features:
so for row 3 as an example having:
sentiment = neg
feature = screen[-1], picture quality[-1]
attribute =
category = screen
sentence = when the screen was n't contracting or glitch...
and row 4:
sentiment = neg
feature = screen[-1], picture quality[-1]
attribute =
category = picture quality
sentence = when the screen was n't contracting or glitch...
so the idea is to add a row with the same information except for the category that now contains the second feature. The features can be up to 10.
Thank you in advance, would truly appreciated assistance on this.
You can try split the column value by , then explode on feature column.
df['feature'] = df['feature'].str.split(', ')
# If there is not always a space after comma, use `apply`
#df['feature'] = df['feature'].apply(lambda feature: list(map(str.strip, feature.split(','))))
df = df.explode('feature')
Try using pandas.DataFrame.explode:
df.explode(column='feature')
Maybe a bit late but try this:
features = df['feature'].str.replace(r'\[.*?\]', '', regex=True) \
.str.get_dummies(', ')
out = pd.concat([df, features], axis=1)
print(out)
# Output
feature inexpensive picture quality screen
0 inexpensive[+1][a] 1 0 0
1 screen[-1], picture quality[-1] 0 1 1
2 screen[-1] 0 0 1

Pandas finding a text in row and assign a dummy variable value based on this

I have a data frame which contains a text column i.e. df["input"],
I would like to create a new variable which checks whether df["input"] column contains any of the word in a given list and assigns a value of 1 if previous dummy variable is equal to 0 (logic is 1) create a dummy variable that equals to zero 2) replace it to one if it contains any word in a given list and it was not contained in the previous lists.)
# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle", "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
which looks like:
input
amazon listing subtitle
medical
film biotechnology dentist
final dataset should look like:
input listings scripting medical
amazon listing subtitle 1 0 0
medical 0 0 1
film biotechnology dentist 0 1 0
One possible implementation is to use str.contains in a loop to create the 3 columns, then use idxmax to get the column name (or the list name) of the first match, then create a dummy variable from these matches:
import numpy as np
d = {'listings':listings, 'scripting':scripting, 'medical':medical}
for k,v in d.items():
df[k] = df['input'].str.contains('|'.join(v))
arr = df[list(d)].to_numpy()
tmp = np.zeros(arr.shape, dtype='int8')
tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)
But in this case, it might be more efficient to use a nested for-loop:
import re
def get_dummy_vars(col, lsts):
out = []
len_lsts = len(lsts)
for row in col:
tmp = []
# in the nested loop, we use the any function to check for the first match
# if there's a match, break the loop and pad 0s since we don't care if there's another match
for lst in lsts:
tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
if tmp[-1]:
break
tmp += [0] * (len_lsts - len(tmp))
out.append(tmp)
return out
lsts = [listings, scripting, medical]
out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))
Output:
input listings medical scripting
0 amazon listing subtitle 1 0 0
1 medical 0 1 0
2 film biotechnology dentist 0 0 1
Here is a simpler - more pandas vector style solution:
patterns = {} #<-- dictionary
patterns["listings"] = ["amazon listing", "ecommerce", "products"]
patterns["scripting"] = ["subtitle", "film", "dubbing"]
patterns["medical"] = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
#---------------------------------------------------------------#
# step 1, for each column create a reg-expression
for col, items in patterns.items():
# create a regex pattern (word1|word2|word3)
pattern = f"({'|'.join(items)})"
# find the pattern in the input column
df[col] = df['input'].str.contains(pattern, regex=True).astype(int)
# step 2, if the value to the left is 1, change its value to 0
## 2.1 create a mask
## shift the rows to the right,
## --> if the left column contains the same value as the current column: True, otherwise False
mask = (df == df.shift(axis=1)).values
# substract the mask from the df
## and clip the result --> negative values will become 0
df.iloc[:,1:] = np.clip( df[mask].iloc[:,1:] - mask[:,1:], 0, 1 )
print(df)
Result
input listings scripting medical
0 amazon listing subtitle 1 0 0
1 medical 0 0 1
2 film biotechnology dentist 0 1 0
Great question and good answers (I somehow missed it yesterday)! Here's another variation with .str.extractall():
search = {"listings": listings, "scripting": scripting, "medical": medical, "dummy": []}
pattern = "|".join(
f"(?P<{column}>" + "|".join(r"\b" + s + r"\b" for s in strings) + ")"
for column, strings in search.items()
)
result = (
df["input"].str.extractall(pattern).assign(dummy=True).groupby(level=0).any()
.idxmax(axis=1).str.get_dummies().drop(columns="dummy")
)

Why PCA output some components duplicately?

I'm working on CTU-13 dataset, which you can see the overview of its distributions in the dataset here. I'm using the 11th scenario of CTU-13 dataset which is (S11.csv) and you can access here.
Concerning the synthetic nature of the dataset, I need to understand the top most important features for feature engineering stage.
#dataset loading
df = pd.read_csv('/content/drive/My Drive/s11.csv')
#Keep events/rows which have 'Normal' or 'Bot'
df = df.loc[(df['Label'].str.contains('Normal') == True) | (df['Label'].str.contains('Bot') == True)]
#binary labeling
df.loc[(df['Label'].str.contains('Normal') == True),'Label'] = 0
df.loc[(df['Label'].str.contains('Bot') == True),'Label'] = 1
#data cleaning
null_columns = df.columns[df.isnull().any()]
#omit columns have more than 70% missing values
for i in null_columns:
B = df[i].isnull().sum()
if B > (df.shape[0]*70)//100:
del df[i]
name_columns = list(df.columns)
for i in name_columns:
if df[i].dtype == object:
df[i] = pd.factorize(df[i])[0]+1
#impute mean of each column for missing values
name_columns = list(df.columns)
for i in name_columns:
mean1 = df[i].mean()
df[i] = df[i].replace(np.nan, mean1)
#Apply PCA
arr = df.to_numpy()
arr=arr[:,:-1]
pca=PCA(n_components=10)
x_pca=pca.fit_transform(arr)
explain=pca.explained_variance_ratio_
#sort and index pca top 10
n_pcs= pca.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(pca.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = []
for col in df.columns:
initial_feature_names.append(col)
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# LIST COMPREHENSION HERE AGAIN
print('important column by order: ')
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
top_components = pd.DataFrame(dic.items())
print(top_components)
Problem: I was wondering why the output of PCA duplicate some components?!
important column by order:
0 1
0 PC0 TotBytes
1 PC1 SrcBytes
2 PC2 Load
3 PC3 Seq
4 PC4 DstLoad
5 PC5 DstLoad
6 PC6 Sport
7 PC7 Load
8 PC8 Rate
9 PC9 Rate
Any help to debug this problem will be appreciated! Probably I'm missing something in the implementation.

How can I check what value is assigned to what label while using sklearns' LabelEncoder()?

I am transforming categorical data to numeric values for machine learning purposes.
To give an example, the buying price (= "buying" variable) of a car is categorized in: "vhigh, high, med, low".
To transform it into numeric values, I used:
le = preprocessing.LabelEncoder()
buying = le.fit_transform(list(data["buying"]))
Is there a way to check how exactly Python transformed each of those labels into numeric value since this is done randomly (e.g. vhigh = 0, high = 2)?
You can create an extra column in your dataframe to map the values:
mapping_df = data[['buying']].copy() #Create an extra dataframe which will be used to address only the encoded values
mapping_df['buying_encoded'] = le.fit_transform(data['buying'].values) #Using values is faster than using list
Here's a full working example:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data = pd.DataFrame({'index':[0,1,2,3,4,5,6],
'buying':['Luffy','Nami','Luffy','Franky','Sanji','Zoro','Luffy']})
data['buying_encoded'] = le.fit_transform(data['buying'].values)
data = data.drop_duplicates('buying').set_index('index')
print(data)
Output:
buying buying_encoded
index
0 Luffy 1
1 Nami 2
3 Franky 0
4 Sanji 3
5 Zoro 4
You can also get the dictionary of what mapped to the categories as follow
starting from where #celius-stingher stopped,
d1 = data.drop_duplicates('buying').drop('index', axis=1).set_index('buying')
print(d1)
Output:
buying_encoded
buying
Luffy 1
Nami 2
Franky 0
Sanji 3
Zoro 4
To transform this output to a dictionary,
dict_map = d1.to_dict()
print(dict_map)
Output:
{'buying_encoded': {'Luffy': 1, 'Nami': 2, 'Franky': 0, 'Sanji': 3, 'Zoro': 4}}
So we can get the dictionary by taking the buying_encoded attribute from the returned Dict
print(dict_map['buying_encoded'])
Output:
{'Luffy': 1, 'Nami': 2, 'Franky': 0, 'Sanji': 3, 'Zoro': 4}

Claasification of testdata containg string columns

So I am using Machine Learning to predict class of some data as given below sample.
My data is related to some scheduler running on server and by submission time and server_type I am labeling the class
Dataframe: df= sch_name server_type subit_time submit_by Class
RCALCAPP X3333 165703 AAAA 1
RCALCAPP X3333 105703 BBBB 0
PCALCAPP X3333 165703 AAAA 1
.
.
TCALCAPP X3344 095703 CCCC 0
TO run classifier I am doing lableencoding for string column values. Not sure if it is correct approch to ecode or not but it is working for me
le = preprocessing.LabelEncoder()
df = df.apply(le.fit_transform)
Also I dont need submit_by column to train classifier so I am removing it
featureNames = [col for col in df.columns if col not in ['submit_by','status']]
TO prepare a model I have splitted above dataframe into training, cv, test and using below
trainFeatures = training[featureNames].values
trainClasses = training['status'].values
testFeatures= test[featureNames].values
testClasses = test['status'].values
clf = RandomForestClassifier()
clf.fit(trainFeatures, trainClasses)
score = clf.score(testFeatures, testClasses)
print(score) #.99823742
Till here every thing is okay.classifier is running on data. But nowI want to test new new record for classification. I tried to do following:
test_sch = ['TCALCAPP', 'X3344', '075703']
class_code = clf.predict(test_sch) # [1]
It gave error
ValueError: could not convert string to float: 'TCALCAPP'
And I know the reason as It has not been encoded to number. Here is my problem how to do that exactly. I need to pass encoded value for 'TCALCAPP', 'X3344'. But How I would know encoded value for a new test data. My approach could be wrong but requirement is same as above. Kindly help.

Categories