How to classify String Data into Integers? - python

I need to classify the String Values of a feature of my dataset, so that I can further use it for other things, let's say predicting or plotting.
How do I convert it?
I found this solution, but here I've to manually type in the code for every unique value of the feature. For 2-3 unique values, it's alright, but I've got a feature with more than 50 unique values of countries, I can't write code for every country.
def sex_class(x):
if x == 'male':
return 1
else:
return 0
This changes the male values to 1 and female values to 0 in the feature - sex.

You can make use of the scikit-learn LabelEncoder
#given a list containing all possible labels
sex_classes = ['male', 'female']
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sex_classes)
This will assign labels to all the unique values in the given list. You can save this label encoder object as a pickle file for later use as well.

rank or pd.factorize
df['ID_int'] = df['id'].rank(method='dense').astype(int)
df['ID_int2'] = pd.factorize(df['id'])[0]
Output:
id ID_int ID_int2
0 a 2 0
1 b 3 1
2 c 4 2
3 a 2 0
4 b 3 1
5 c 4 2
6 A 1 3
7 b 3 1
The labels are different, but consistent.

You can use a dictionary instead.
sex_class = {'male': 1, 'female': 0}

Related

Filter dataframe based on matching values from two columns

I have a dataframe like as shown below
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'Label':[1,2,3,0,0]})
I would like to filter the dataframe based on the below criteria
cdf['Id']==cdf['Label'] # first 3 rows are matching for both columns in cdf
I tried the below
flag = np.where[cdf['Id'].eq(cdf['Label'])==True,1,0]
final_df = cdf[cdf['flag']==1]
but I got the below error
TypeError: 'function' object is not subscriptable
I expect my output to be like as shown below
Id Label
0 1 1
1 2 2
2 3 3
I think you're overthinking this. Just compare the columns:
>>> cdf[cdf['Id'] == cdf['Label']]
Id Label
0 1 1
1 2 2
2 3 3
Your particular error though is coming from the fact that you're using square brackets to call np.where, e.g. np.where[...], which is wrong. You should be using np.where(...) instead, but the above solution is bound to be as fast as it gets ;)
Also you can check query
cdf.query('Id == Label')
Out[248]:
Id Label
0 1 1
1 2 2
2 3 3

How to change the value of a column items using pandas?

This is my fist question on stackoverflow.
I'm implementing a Machine Learning classification algorithm and I want to generalize it for any input dataset that have their target class in the last column. For that, I want to modify all values of this column without needing to know the names of each column or rows using pandas in python.
For example, let's suppose I load a dataset:
dataset = pd.read_csv('random_dataset.csv')
Let's say the last column has the following data:
0 dog
1 dog
2 cat
3 dog
4 cat
I want to change each "dog" appearence to 1 and each cat appearance to 0, so that the column would look:
0 1
1 1
2 0
3 1
4 0
I have found some ways of changing the values of specific cells using pandas, but for this case, what would be the best way to do that?
I appreciate each answer.
You can use pandas.Categorical:
df['column'] = pd.Categorical(df['column']).codes
You can also use the built in functionality for this too:
df['column'] = df['column'].astype('category').cat.codes
use the map and map the values as per requirement:
df['col_name'] = df['col_name'].map({'dog' : 1 , 'cat': 0})
OR -> Use factorize(Encode the object as an enumerated type) -> if you wanna assign random numeric values
df['col_name'] = df['col_name'].factorize()[0]
OUTPUT:
0 1
1 1
2 0
3 1
4 0

Re-categorize a column in a pandas dataframe

I am trying to build a simple classification model for my data stored in the pandas dataframe train. To make this model more efficient, I created a list of column names of columns I know to store categorical data, called category_cols. I categorize these columns as follows:
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')
# Convert train[category_cols] to a categorical type
train[category_cols] = train[category_cols].apply(categorize_label, axis=0)
My target variable, material, is categorical, and has 64 unique labels it can be assigned to. However, some of these labels only appear once in train, which is too few to train the model well. So I'd like to filter any observations in train that have these rare material labels. This answer provided a useful groupby+filter combination:
print('Num rows: {}'.format(train.shape[0]))
print('Material labels: {}'.format(len(train['material'].unique())))
min_count = 5
filtered = train.groupby('material').filter(lambda x: len(x) > min_count)
print('Num rows: {}'.format(filtered.shape[0]))
print('Material labels: {}'.format(len(filtered['material'].unique())))
----------------------
Num rows: 19999
Material labels: 64
Num rows: 19963
Material labels: 45
This works great in that it does filter the observations with rare material labels. However, something under the hood in the category type seems to maintain all the previous values for material even after they've been filtered. This becomes a problem when trying to create dummy variables, and happens even if I try to rerun my same categorize method:
filtered[category_cols] = filtered[category_cols].apply(categorize_label, axis=0)
print(pd.get_dummies(train['material']).shape)
print(pd.get_dummies(filtered['material']).shape)
----------------------
(19999, 64)
(19963, 64)
I would have expected the shape of the filtered dummies to be (19963, 45). However, pd.get_dummies includes columns for labels that have no appearances in filtered. I assume this has something to do with how the category type works. If so, could someone please explain how to re-categorize a column? Or if that is not possible, how to get rid of the unnecessary columns in the filtered dummies?
Thank you!
You can use category.cat.remove_unused_categories:
Usage
df['category'].cat.remove_unused_categories(inplace=True)
Example
df = pd.DataFrame({'label': list('aabbccd'),
'value': [1] * 7})
print(df)
label value
0 a 1
1 a 1
2 b 1
3 b 1
4 c 1
5 c 1
6 d 1
Lets set label as type category
df['label'] = df.label.astype('category')
print(df.label)
0 a
1 a
2 b
3 b
4 c
5 c
6 d
Name: label, dtype: category
Categories (4, object): [a, b, c, d]
Filter DataFrame to remove label d
df = df[df.label.ne('d')]
print(df)
label value
0 a 1
1 a 1
2 b 1
3 b 1
4 c 1
5 c 1
Remove unused_categories
df.label.cat.remove_unused_categories(inplace=True)
print(df.label)
0 a
1 a
2 b
3 b
4 c
5 c
Name: label, dtype: category
Categories (3, object): [a, b, c]
Per this answer, this can be solved via reindexing and transposing the dummy dataframe:
labels = filtered['material'].unique()
dummies = pd.get_dummies(filtered['material'])
dummies = dummies.T.reindex(labels).T
print(dummies.shape)
----------------------
(19963, 45)

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

Inconsistent labeling in sklearn LabelEncoder?

I have applied a LabelEncoder() on a dataframe, which returns the following:
The order/new_carts have different label-encoded numbers, like 70, 64, 71, etc
Is this inconsistent labeling, or did I do something wrong somewhere?
LabelEncoder works on one-dimensional arrays. If you apply it to multiple columns, it will be consistent within columns but not across columns.
As a workaround, you can convert the dataframe to a one dimensional array and call LabelEncoder on that array.
Assume this is the dataframe:
df
Out[372]:
0 1 2
0 d d a
1 c a c
2 c c b
3 e e d
4 d d e
5 d b e
6 e e b
7 a e b
8 b c c
9 e a b
With ravel and then reshaping:
pd.DataFrame(LabelEncoder().fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Out[373]:
0 1 2
0 3 3 0
1 2 0 2
2 2 2 1
3 4 4 3
4 3 3 4
5 3 1 4
6 4 4 1
7 0 4 1
8 1 2 2
9 4 0 1
Edit:
If you want to store the labels, you need to save the LabelEncoder object.
le = LabelEncoder()
df2 = pd.DataFrame(le.fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Now, le.classes_ gives you the classes (starting from 0).
le.classes_
Out[390]: array(['a', 'b', 'c', 'd', 'e'], dtype=object)
If you want to access the integer by label, you can construct a dict:
dict(zip(le.classes_, np.arange(len(le.classes_))))
Out[388]: {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
You can do the same with transform method, without building a dict:
le.transform('c')
Out[395]: 2
Your LabelEncoder object is being re-fit to each column of your DataFrame.
Because of the way the apply and fit_transform functions work, you are accidentally calling the fit function on each column of your frame. Let's walk through whats happening in the following line:
labeled_df = String_df.apply(LabelEncoder().fit_transform)
create a new LabelEncoder object
Call apply passing in the fit_transform method. For each column in your DataFrame it will call fit_transform on your encoder passing in the column as an argument. This does two things:
A. refit your encoder (modifying its state)
B. return the codes for the elements of your column based on your encoders new fitting.
The codes will not be consistent across columns because each time you call fit_transform the LabelEncoder object can choose new transformation codes.
If you want your codes to be consistent across columns, you should fit your LabelEncoder to your whole dataset.
Then pass the transform function to your apply function, instead of the fit_transform function. You can try the following:
encoder = LabelEncoder()
all_values = String_df.values.ravel() #convert the dataframe to one long array
encoder.fit(all_values)
labeled_df = String_df.apply(encoder.transform)

Categories