Using scikit learn's GaussianNB with nltk doesn't work

Using scikit learn's GaussianNB with nltk doesn't work - python

I am trying to use nltk's wrapper for scikit-learn's classifiers. I use this code to train the classifier:
classifier = SklearnClassifier(GaussianNB())
classifier.train(self.training_set)
Where training_set looks like
[({'name':'Alpha Hotel', 'clicks':765, 'zip_code':75025},'no bookings')]
The error I am getting is
TypeError: A sparse matrix was passed, but dense data is required. Use
X.toarray() to convert to a dense numpy array.
I don't know how to convert to a dense array, especially since nltk's documentation for the train method requires A list of (featureset, label) where each featureset is a dict mapping strings to either numbers, booleans or strings.

You have three features just two of them is in numerical format.You first should convert the 'name' feature to a number. If the name variable is categorical then you can encode it in a meaningful manner as described here:
http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
i think your labels also limited, so you can encode them too. The last step is really easy you just need to convert nltk format to numpy array format. just read each feature in a loop and then insert your desired features in X (features) and Y (labels):
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

Maybe it's was late, but maybe help other who get same problem(cz i got this problem yesterday).
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Like error said, it's need to convert to array so i just convert this to array as the error said
vector = vectorizer.transform(corpus).toarray()
So just add .toarray() solve this problem.
;)
when i switch to MultinomialNB or BernoulliNB, neither they didn't error. with or without toarray().
note: dont forget to convert to fit and transform your text to word representation(numeric values).

Related

SyntaxError while trying to perform RobustScaler on Pandas Dataframe

I am working with the House Prices Kaggle dataset. I am trying to use the RobustScaler from sklearn only on numerical features in the dataset (LotFrontage, LotArea, etc.). First, I fit the data to the numerical values of my dataframe by calling select_dtypes(exclude=['object']. Once the transformer has been fit to those values, I call the transform function, trying to transform those same values I just fit the data on by setting the transformer equal to object excluded attributes. Once I attempt that, I get the following error message:
SyntaxError: can't assign to function call
Data has already been rid of null values. What has worked is when I set the transform results equal to some variable, I get the results back as a numpy.ndarray
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(df_train.select_dtypes(exclude=['object']))
df_train.select_dtypes(exclude=['object']) = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This doesn't work
test = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This DOES work, but not in the format I need
All I want is for the transformed attributes to go back into the original pandas data frame at their corresponding locations. Is there some workaround I can implement if I can't convert the original dataframe results directly?

I managed to get it to work. Not sure how Pythonic this solution is, but it got me back on track:
df_train[list(df_train.select_dtypes(exclude=['object']).columns)] = RobustScaler().fit_transform(df_train[list(df_train.select_dtypes(exclude=['object']).columns)])

How to handle string data in ML classification

Hello I am a beginner in Machine Learning, I have previously worked with some binary ml tasks where the data was numerical. Now I am facing an issue where I have to find the probability of a particular combination. I can not disclose the dataset or the code at this point. My data is a dataframe of 10 columns. I have to train my model on 8 columns and predict the possibility of the last 2 columns. That is my labels are a combination of the last 2 columns. What I am facing a problem with is, these column values are not numerical. I have tried everything I came across but can't find any suitable means of converting this to numerical values. I have tried LabelEncoder from sklearn,which works with the labels, but throws memory error if I use it again. I have tried to_numeric from pandas, which reads all the values as Nan. The values are in the form '2be74fad-4d4'. Any suggestions would be highly appreciated about how to handle this issue.

To convert categorical data to numerical, you can try these approaches in sklearn:
Label Encoding
Label Binarizer
OneHot Encoding
Now, for your problem, you can use LabelEncoder. But there is a catch. In other sklearn models, you can declare it once and then use it to fit and then transform on a number of columns.
In LabelEncoding, you have to fit_transform the model on one column in train data and then transform the same column in test data. Then the same process for the next categorial column.
You can iterate over a list of categorical columns to make it simple. Consider the snippet below:
cat_cols = ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Type_Combined']
enc = LabelEncoder()
for col in cat_cols:
train[col] = train[col].astype('str')
test[col] = test[col].astype('str')
train[col] = enc.fit_transform(train[col])
test[col] = enc.transform(test[col])

You can create a dictionary with the mapping form a string to integer. An example can be found here: enter link description here. Then you use onehot encoding or just feed the integer to the neural network. If the characters have some meaning you could also do it on a per character base instead of wordbased. But that depends on the task. If this String is a unique identifier of the column or so, just leave it away and don't feed it to your model.

feed composite inputs to model

I need to feed an image and a vector sampled from normal distribution simultaneously. As the image dataset I'm using is too large, I create a ImageDeserializer for that part. But I also need to add random vector (sampled from numpy normal distribution), to the input map before feed it to the network. Is there any way to achieve this?
I also test:
mb_data = reader_train.next_minibatch(mb_size, input_map=input_map)
mb_data[random_input_node] = np.random.normal((mb_size, 100))
but get the following error:
TypeError: cannot convert value of dictionary to N4CNTK13MinibatchDataE

The problem solved with the following snippet to feed data to trainer:
mb_data = reader_train.next_minibatch(mb_size, input_map=input_map)
z = np.random.normal(mb_size)
my_trainer.train_minibatch({feature_image: mb_data[image].data, feature_z: z})
Also thanks to #mewahl. Defining new reader is another suitable way to solve the problem, and I think it must be faster than what I have done.

How to combine tfidf features with selfmade features

For a simple web page classification system I am trying to combine some selfmade features (frequency of HTML tags, frequency of certain word collocations) with the features obtained after applying tfidf. I am facing the following problem, however, and I don't really know how to proceed from here.
Right now I am trying to put all of these together in one dataframe, mainly by following the code from the following link :
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
vectorizer = TfidfVectorizer(stop_words="english")
X_train_counts = vectorizer.fit_transform(train_data['text_no_punkt'])
feature_names = vectorizer.get_feature_names()
dense = X_train_counts.todense()
denselist = dense.tolist()
tfidf_df = pd.DataFrame(denselist, columns=feature_names, index=train_data['text_no_punkt'])
But this doesn't return the index (from 0 to 2464) I had in my original dataframe with the other features, neither does it seem to produce readable column names and instead of using the different words as titles, it uses numbers.
Furthermore I am not sure if this is the right way to combine features as this will result in an extremely high-dimensional dataframe which will probably not benefit the classifiers.

You can use hstack to merge the two sparse matrices, without having to convert to dense format.
from scipy.sparse import hstack
hstack([X_train_counts, X_train_custom])

How do you fix inconsistent numbers of samples when using GaussianNB()?

How do you fix inconsistent numbers of samples when using GaussianNB()? Also, is it possible for input pandas dataframe as arguments for model.fit function?

The issue is that GaussianNB is expecting weather to be in the shape (n_samples, n_features). You currently have it as a one-dimensional array, so GaussianNB is interpreting it as a 1 sample with 14 features.
To convert to the right shape, you can use weather[:,None] as described in this answer. So, the following should do the trick:
model.fit(weather[:,None], play)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using scikit learn's GaussianNB with nltk doesn't work - python

Related

SyntaxError while trying to perform RobustScaler on Pandas Dataframe

How to handle string data in ML classification

feed composite inputs to model

How to combine tfidf features with selfmade features

How do you fix inconsistent numbers of samples when using GaussianNB()?

Categories

Resources