SyntaxError while trying to perform RobustScaler on Pandas Dataframe - python

I am working with the House Prices Kaggle dataset. I am trying to use the RobustScaler from sklearn only on numerical features in the dataset (LotFrontage, LotArea, etc.). First, I fit the data to the numerical values of my dataframe by calling select_dtypes(exclude=['object']. Once the transformer has been fit to those values, I call the transform function, trying to transform those same values I just fit the data on by setting the transformer equal to object excluded attributes. Once I attempt that, I get the following error message:
SyntaxError: can't assign to function call
Data has already been rid of null values. What has worked is when I set the transform results equal to some variable, I get the results back as a numpy.ndarray
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(df_train.select_dtypes(exclude=['object']))
df_train.select_dtypes(exclude=['object']) = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This doesn't work
test = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This DOES work, but not in the format I need
All I want is for the transformed attributes to go back into the original pandas data frame at their corresponding locations. Is there some workaround I can implement if I can't convert the original dataframe results directly?

I managed to get it to work. Not sure how Pythonic this solution is, but it got me back on track:
df_train[list(df_train.select_dtypes(exclude=['object']).columns)] = RobustScaler().fit_transform(df_train[list(df_train.select_dtypes(exclude=['object']).columns)])

Related

Label Encoder - Use of Inverse_transform function

I'm trying to figure it out how to use the inverse_transform function from LabelEncoder(). For example, in the below code,
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Label'] = le.fit_transform(df[['Actual']]
If i want to reverse, i can simply call:
le.inverse_transform(df['Label'])
However, i need to apply that same transformation/inverse into a new dataset, which might be predicted from the model above. I.e, it is been done in a new notebook, so, it seems like i have to store the labels. Any ideas how to do this? My only idea is to export a dataframe with 2 columns, and use pd.merge.
Make a dictionary containing the inverse transform of that LabelEncoder that you used in 1st notebook. here
And then use that dictionary to remap values in 2nd notebook. here

Can sklearn.compose.ColumnTransformer be used with a custom-defined transformer?

I would like to take the logarithm of specific columns of my dataframe.
I created a new transformer object:
from sklearn.preprocessing import FunctionTransformer
log_transformer = FunctionTransformer(np.log10, validate=True)
and it works nicely on one specific column of my dataframe:
log_transformer.transform(df['_column'])
Also, one could also overwrite the specific column of the original dataframe like:
df[['_column']]=log_transformer.transform(df[['_column']])
However, this operation then changes the original dataframe and wouldn't be useful in a pipeline.
When I try to include this transformer object into ColumnTransformer, I get an error message:
columnTransformer = ColumnTransformer([('log_transform', log_transformer.transform(), [0,5)], remainder='passthrough')
How should I pass on the custom-defined transformer object to ColumnTransformer? (The same syntax would work very nicely for built-in transformers, as suggested in this article: https://towardsdatascience.com/columntransformer-in-scikit-for-labelencoding-and-onehotencoding-in-machine-learning-c6255952731b)
Thank you for your help!

How to handle string data in ML classification

Hello I am a beginner in Machine Learning, I have previously worked with some binary ml tasks where the data was numerical. Now I am facing an issue where I have to find the probability of a particular combination. I can not disclose the dataset or the code at this point. My data is a dataframe of 10 columns. I have to train my model on 8 columns and predict the possibility of the last 2 columns. That is my labels are a combination of the last 2 columns. What I am facing a problem with is, these column values are not numerical. I have tried everything I came across but can't find any suitable means of converting this to numerical values. I have tried LabelEncoder from sklearn,which works with the labels, but throws memory error if I use it again. I have tried to_numeric from pandas, which reads all the values as Nan. The values are in the form '2be74fad-4d4'. Any suggestions would be highly appreciated about how to handle this issue.
To convert categorical data to numerical, you can try these approaches in sklearn:
Label Encoding
Label Binarizer
OneHot Encoding
Now, for your problem, you can use LabelEncoder. But there is a catch. In other sklearn models, you can declare it once and then use it to fit and then transform on a number of columns.
In LabelEncoding, you have to fit_transform the model on one column in train data and then transform the same column in test data. Then the same process for the next categorial column.
You can iterate over a list of categorical columns to make it simple. Consider the snippet below:
cat_cols = ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Type_Combined']
enc = LabelEncoder()
for col in cat_cols:
train[col] = train[col].astype('str')
test[col] = test[col].astype('str')
train[col] = enc.fit_transform(train[col])
test[col] = enc.transform(test[col])
You can create a dictionary with the mapping form a string to integer. An example can be found here: enter link description here. Then you use onehot encoding or just feed the integer to the neural network. If the characters have some meaning you could also do it on a per character base instead of wordbased. But that depends on the task. If this String is a unique identifier of the column or so, just leave it away and don't feed it to your model.

How can I reference additional fields from a Pandas DataFrame when vectorizing text documents in scikit-learn?

I'm building a supervised learning application using scikit-learn. My input data comes from a table. The text document is essentially one column ('description') of this table, but I think I can improve my accuracy by referencing other columns in the vectorization (for example to remove street addresses from the description field).
I think I can do this by specifying my own preprocessor and tokenizer functions when I construct the vectorizer. However, I run into trouble with the vectorizer's fit() method. I'm trying to pass a DataFrame containing my columns as the raw_documents.
When the raw_documents get to CountVectorizer._count_vocab() method to build the vocabulary, the code iterates through each record using "for doc in raw_documents:". I was expecting this to walk through each row in the DataFrame and provide a Pandas Series containing that record as the "doc". This "doc" would then get passed to the analyzer and then to my preprocessor and tokenizer where I could reference the associated fields in the Series by name.
Unfortunately, the default behavior for DataFrame is that iter() iterates along the information axis instead of the index axis. This means my vectorizer is now walking along the list of column headings instead of each record row (as a Pandas Series). The data that gets to the analyzer as the "doc" is just the column heading strings.
This simple example shows what I am trying to do. the preprocessor is dumb and incomplete, but shows how I am trying to access the adjacent field on a record. (I could also use direction on how to properly update the description value on the input to avoid the Pandas "SettingWithCopyWarning" problem. I tried to follow the recommendation to use .loc[], but still get the warning.)
import re
from io import StringIO
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
def my_preprocessor(record):
try:
description = record.loc['description']
# try to update the description field on this record by removing the street address
description = re.sub(record['street'], '', description.lower())
# need help here with SettingWithCopyWarning
record.loc['description'] = description
return record
except:
return record.lower()
data = StringIO('''"id","street","description","label_1","label_2"
"2341324","123 Elm Street","Pine Point was situated at 123 Elm Street in Boston.",1,1''')
df = pd.read_csv(data)
vect = CountVectorizer(preprocessor=my_preprocessor)
vect.fit_transform(df)
print(vect.vocabulary_)
Results in the column headings as my vocabulary.
{'id': 1, 'street': 4, 'description': 0, 'label_1': 2, 'label_2': 3}
I looked at a couple of options:
Wrap my input data in a DataFrame subclass (RowIterableDataFrame) that overrides iter() with a row-wise iterator implementation. I can make the iterator work, but scikit-learn GridSearchCV does a bunch of slicing of the input data so the RowIterableDataFrame I pass in has been sliced into a subset of data rows as a regular DataFrame again by the time it gets into the _count_vocab() method with the "for doc in raw_documents:"
Pass in the records using DataFrame's iterrows() or itertuples() methods. This gets the right data in on a row-by-row basis, but fails the check_consistent_length() test when the fit() methods call indexable().
Subclass CountVectorizer and write my own version of the _count_vocab() method that iterates through raw_documents differently in the case of DataFrames (i.e. using .iloc[] indexing). I'd rather not do this because _count_vocab() does a bunch of other stuff I don't want to risk breaking.
Pre-process my records outside of scikit learn to build a delimited string as input. Pass a list of these in as the raw_documents and then parse them in my preprocessor. This means extra passes through the data.
Pass in the records using DataFrame.to_dict(orient='records'). This gets me the right data on a row-by-row basis and keeps the column indices for referencing by my preprocessor and tokenizer. The downside appears to be that I have to copy out the data for each row into that dictionary instead of referencing the original data in the DataFrame as a Series.
I would welcome some guidance on how to do this. Perhaps changing the iteration behavior of a Pandas DataFrame or the simplest approach to extending the CountVectorizer.

Patsy: New levels in categorical fields in test data

I am trying to use Patsy (with sklearn, pandas) for creating a simple regression model. The R style formula creation is a major draw.
My data contains a field called 'ship_city' which can have any city from India. Since I am partitioning the data into train and test sets, there are several cities which appear only in one of the sets. A code snippet is given below:
df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe')
df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info
df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')
The last line throws the following error:
patsy.PatsyError: Error converting data to categorical: observation
with value 'Kolkata' does not match any of the expected levels
I believe this is a very common use case where training data will not have all levels of all categorical fields. Sklearn's DictVectorizer handles this quite well.
Is there any way I can make this work with Patsy?
The problem of course is that if you just give patsy a raw list of values, it has no way to know that there are other values that could potentially happen as well. You have to somehow tell it what the complete set of possible values is.
One way is by using the levels= argument to C(...), like:
# If you have a data frame with all the data before splitting:
all_cities = sorted(df_all["Cities"].unique())
# Alternative approach:
all_cities = sorted(set(df_train["Cities"]).union(set(df_test["Cities"])))
dmatrices("y ~ C(Cities, levels=all_cities)", data=df_train)
Another option if you're using pandas's default categorical support is to record the set of possible values when you set up your data frame; if patsy detects that the object you've passed it is a pandas categorical then it automatically uses the pandas categories attribute instead of trying to guess what the possible categories are by looking at the data.
I ran into a similar problem and I built the design matrices prior to splitting the data.
df_Y, df_X = dmatrices(formula, data=df, return_type='dataframe')
df_train_X, df_test_X, df_train_Y, df_test_Y = \
train_test_split(df_X, df_Y, test_size=test_size)
Then as an example of applying a fit:
model = smf.OLS(df_train_Y, df_train_X)
model2 = model.fit()
predicted = model2.predict(df_test_X)
Technically I haven't built a test case, but I haven't run into the Error converting data to categorical error again since implementing the above.

Categories