Inconsistent labeling in sklearn LabelEncoder? - python

I have applied a LabelEncoder() on a dataframe, which returns the following:
The order/new_carts have different label-encoded numbers, like 70, 64, 71, etc
Is this inconsistent labeling, or did I do something wrong somewhere?

LabelEncoder works on one-dimensional arrays. If you apply it to multiple columns, it will be consistent within columns but not across columns.
As a workaround, you can convert the dataframe to a one dimensional array and call LabelEncoder on that array.
Assume this is the dataframe:
df
Out[372]:
0 1 2
0 d d a
1 c a c
2 c c b
3 e e d
4 d d e
5 d b e
6 e e b
7 a e b
8 b c c
9 e a b
With ravel and then reshaping:
pd.DataFrame(LabelEncoder().fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Out[373]:
0 1 2
0 3 3 0
1 2 0 2
2 2 2 1
3 4 4 3
4 3 3 4
5 3 1 4
6 4 4 1
7 0 4 1
8 1 2 2
9 4 0 1
Edit:
If you want to store the labels, you need to save the LabelEncoder object.
le = LabelEncoder()
df2 = pd.DataFrame(le.fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Now, le.classes_ gives you the classes (starting from 0).
le.classes_
Out[390]: array(['a', 'b', 'c', 'd', 'e'], dtype=object)
If you want to access the integer by label, you can construct a dict:
dict(zip(le.classes_, np.arange(len(le.classes_))))
Out[388]: {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
You can do the same with transform method, without building a dict:
le.transform('c')
Out[395]: 2

Your LabelEncoder object is being re-fit to each column of your DataFrame.
Because of the way the apply and fit_transform functions work, you are accidentally calling the fit function on each column of your frame. Let's walk through whats happening in the following line:
labeled_df = String_df.apply(LabelEncoder().fit_transform)
create a new LabelEncoder object
Call apply passing in the fit_transform method. For each column in your DataFrame it will call fit_transform on your encoder passing in the column as an argument. This does two things:
A. refit your encoder (modifying its state)
B. return the codes for the elements of your column based on your encoders new fitting.
The codes will not be consistent across columns because each time you call fit_transform the LabelEncoder object can choose new transformation codes.
If you want your codes to be consistent across columns, you should fit your LabelEncoder to your whole dataset.
Then pass the transform function to your apply function, instead of the fit_transform function. You can try the following:
encoder = LabelEncoder()
all_values = String_df.values.ravel() #convert the dataframe to one long array
encoder.fit(all_values)
labeled_df = String_df.apply(encoder.transform)

Related

Python Pandas convert multiple string columns to specified integer values

I have a dataframe with thousands of rows, some columns all have ratings like A,B,C,D. I am trying to do some machine learning and would like to give the ratings certain values,
Like A=32,B=16,C=4,D=2. I have read some post on using factorize and labelEncoder
I got a simple method to work (while trying to explain the problem) from the link, but would like to know how to use other methods, I don't know how to tell those methods to use certain values, they seem just to put their own values to the data. The method below works if only a few columns need to be transformed.
import pandas as pd
df = pd.DataFrame({'Studentid':['12','40','36'],
'history':['A','C','C'],
'math':['B','C','D'],
'biology':['A','C','B']})
print(df)
Studentid history math biology
0 12 A B A
1 40 C C C
2 36 C D B
df['history1'] = df['history'].replace(to_replace=['A', 'B', 'C','D'], value=[32, 16, 4,2])
df['math1'] = df['math'].replace(to_replace=['A', 'B', 'C','D'], value=[32, 16, 4,2])
df['biology1'] = df['biology'].replace(to_replace=['A', 'B', 'C','D'], value=[32, 16, 4,2])
Studentid history math biology history1 math1 biology1
0 12 A B A 32 16 32
1 40 C A C 4 32 4
2 36 C D B 4 2 16
If you need to transform a relatively large number of columns, probably you don't want to quote all the column names one by one in the program codes. You can do it this way:
Assuming the column Studentid is not going to be transformed:
grade_map = {'A': 32, 'B': 16, 'C': 4, 'D': 2}
df_transformed = df.drop('Studentid', axis=1).replace(grade_map).add_suffix('1')
df = df.join(df_transformed)
We exclude the column Studentid in the transformation by dropping the column first by .drop() and then use .replace() to translate the gradings. As such, we will never translate Studentid if in case the student id contains the characters same as the gradings. We add suffix 1 to all transformed columns by using .add_suffix()
After the transformation, we join the original dataframe with these transformed columns by using .join()
Result:
print(df)
Studentid history math biology history1 math1 biology1
0 12 A B A 32 16 32
1 40 C C C 4 4 4
2 36 C D B 4 2 16

Python One Hot Encoder: 'PandasArray' object has no attribute 'reshape'

I have the following program:
cat_feats = ['x', 'y', 'z', 'a', 'b',
'c', 'd', 'e']
onehot_encoder = OneHotEncoder(categories='auto')
# convert each categorical feature from integer
# to one-hot
for feature in cat_feats:
data[feature] = data[feature].array.reshape(len(data[feature]), 1)
data[feature] = onehot_encoder.fit_transform(data[feature])
I am having issues with this. I get:
'PandasArray' object has no attribute 'reshape'
The output of data.head() before using the encoder is this:
0 2 1 4 6 3 2 1 37
2 1 7 2 10 0 4 1 37
3 2 15 2 6 0 2 1 37
5 2 0 4 7 1 4 1 37
7 4 14 2 9 0 4 1 37
This output is of type DataFrame and contains only integers which I am trying to convert to one-hot. I have tried .array, .values, .array.reshape(-1, 1), but none of these things are working. I found that trying .values seemed to work in the first line of the for loop, but I got garbage from my one-hot conversion.
Please help.
These following informations might be helpful:
The type of some of the objects:
data[feature]: pandas.Series
data[feature].values: numpy.ndarray
You can reshape a numpy.ndarray but not a pandas.Series, so you need to use .values to get a numpy.ndarray
When you assign a numpy.ndarray to data[feature], automatic type conversion occurs, so data[feature] = data[feature].values.reshape(-1, 1) doesn't seem to do anything.
fit_transform takes an array-like(Need to be a 2D array, e.g. pandas.DataFrame or numpy.ndarray) object as argument because sklearn.preprocessing.OneHotEncoder is designed to fit/transform multiple features at the same time, input pandas.Series(1D array) will cause error.
fit_transform will return sparse matrix(or 2-d array), assign it to a pandas.Series may cause a disaster.
(Not Recommended) If you insist on processing one feature after another:
for feature in categorical_feats:
encoder = OneHotEncoder()
tmp_ohe_data = pd.DataFrame(
encoder.fit_transform(data[feature].values.reshape(-1, 1)).toarray(),
columns=encoder.get_feature_names(),
)
data = pd.concat([tmp_ohe_data, data], axis=1).drop([feature], axis=1)
I Recommended do encoding like this:
encoder = OneHotEncoder()
ohe_data = pd.DataFrame(
encoder.fit_transform(data[categorical_feats]).toarray(),
columns=encoder.get_feature_names(),
)
res = pd.concat([ohe_data, data], axis=1).drop(categorical_feats, axis=1)
pandas.get_dummies is also a good choice.

How to classify String Data into Integers?

I need to classify the String Values of a feature of my dataset, so that I can further use it for other things, let's say predicting or plotting.
How do I convert it?
I found this solution, but here I've to manually type in the code for every unique value of the feature. For 2-3 unique values, it's alright, but I've got a feature with more than 50 unique values of countries, I can't write code for every country.
def sex_class(x):
if x == 'male':
return 1
else:
return 0
This changes the male values to 1 and female values to 0 in the feature - sex.
You can make use of the scikit-learn LabelEncoder
#given a list containing all possible labels
sex_classes = ['male', 'female']
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sex_classes)
This will assign labels to all the unique values in the given list. You can save this label encoder object as a pickle file for later use as well.
rank or pd.factorize
df['ID_int'] = df['id'].rank(method='dense').astype(int)
df['ID_int2'] = pd.factorize(df['id'])[0]
Output:
id ID_int ID_int2
0 a 2 0
1 b 3 1
2 c 4 2
3 a 2 0
4 b 3 1
5 c 4 2
6 A 1 3
7 b 3 1
The labels are different, but consistent.
You can use a dictionary instead.
sex_class = {'male': 1, 'female': 0}

Having columns with subscripts/indices in pandas

Let's say that I have som data from a file where some columns are "of the same kind", only of different subscripts of some mathematical variable, say x:
n A B C x[0] x[1] x[2]
0 1 2 3 4 5 6
1 2 3 4 5 6 7
Is there some way I can load this into a pandas dataframe df and somehow treat the three x-columns as an indexable, array-like entity (I'm new to pandas)? I believe it would be convenient, because I could do operations on the data-series contained in x such as sum(df.x).
Kind regards.
EDIT:
Admittedly, my original post was not clear enough. I'm not just interested in getting the sum of three columns. That was just an example. I'm looking for a generally applicable abstraction that I hope is built into pandas.
I'd like to have multiple columns accessible through (sub-)indices of one entity, e.g. df.x[0], such that I (or any other user of the data) can do whichever operation he/she wants (sum/max/min/avg/standard deviation, you name it). You can consider the x's as an ensamble of time-dependent measurements if you like.
Kind regards.
Consider, you define your dataframe like this
df = pd.DataFrame([[1, 2, 3, 4, 5, 6],
[2, 3, 4, 5, 6, 7]], columns=['A', 'B', 'C', 'x0', 'x1', 'x2'])
Then with
x = ['x0', 'x1', 'x2']
You use the following notation allowing a quite general definition of x
>>> df[x].sum(axis=1)
0 15
1 18
dtype: int64
Look of column which starts with 'x' and perform operations you need
column_num=[col for col in df.columns if col.startswith('x')]
df[column_num].sum(axis=1)
I'll give you another answer which will defer from you initial data structure in exchange for addressing the values of the dataframe by df.x[0] etc.
Consider you have defined your dataframe like this
>>> dv = pd.DataFrame(np.random.randint(10, size=20),
index=pd.MultiIndex.from_product([range(4), range(5)]), columns=['x'])
>>> dv
x
0 0 8
1 3
2 4
3 6
4 1
1 0 8
1 9
2 1
3 8
4 8
[...]
Then you can exactly do this
dv.x[1]
0 8
1 9
2 1
3 8
4 8
Name: x, dtype: int64
which is your desired notation. Requires some changes to your initial set-up but will give you exactly what you want.

Importing Data and Columns from Another Python Pandas Data Frame

I have been trying to select a subset of a correlation matrix using the Pandas Python library.
For instance, if I had a matrix like
0 A B C
A 1 2 3
B 2 1 4
C 3 4 1
I might want to select a matrix where some of the variables in the original matrix are correlated with some of the other variables, like :
0 A C
A 1 3
C 3 1
To do this, I tried using the following code to slice the original correlation matrix using the names of the desired variables in a list, transpose the correlation matrix, reassign the original column names, and then slice again.
data = pd.read_csv("correlationmatrix.csv")
initial_vertical_axis = pd.DataFrame()
for x in var_list:
a = data[x]
initial_vertical_axis = initial_vertical_axis.append(a)
print initial_vertical_axis
initial_vertical_axis = pd.DataFrame(data=initial_vertical_axis, columns= var_list)
initial_matrix = pd.DataFrame()
for x in var_list:
a = initial_vertical_axis[x]
initial_matrix = initial_matrix.append(a)
print initial_matrix
However, this returns an empty correlation matrix with the right row and column labels but no data like
0 A C
A
C
I cannot find the error in my code that would lead to this. If there is a simpler way to go about this, I am open to suggestions.
Suppose data contains your matrix,
In [122]: data
Out[122]:
A B C
0
A 1 2 3
B 2 1 4
C 3 4 1
In [123]: var_list = ['A','C']
In [124]: data.loc[var_list,var_list]
Out[124]:
A C
0
A 1 3
C 3 1

Categories