Fitting RandomForest Model But Getting Pandas Error - python

I have 3 columns: id, sentiment, review. I crating vectors and I am putting it thru a RandomForest in order to make a prediction of the sentiment.
On the following line:
forest = forest.fit(trainDataVecs, train["sentiment"])
I keep getting the following error:
Error is: ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
I got it working in a very small sample file but it refuses to work on my large main one. I have checked and I am 100% certain there are no NULL entries. Some of the reviews are very long and I thing what must be happening is that the review length is a problem somewhere.
Please help!

The issue seems to be when you're reading one of the numerical columns. I would suggest that when you're reading the data from the source, you change the type to something more precise like np.float64, or greater, and also remove an invalid values like follows:
# A is the vector you want to clean
A[~np.isnan(A)] = 0.0

Related

Python data not being converted

I created a decision tree model in python by training the data set but found the data conversion did not happen from string to float.
Even though after trying to convert the float manually still prompts some arrays cannot be converted to float. any solutions?
I have practiced with this dataset before and I think what is going wrong for you, is trying to shift days before you are getting 'Close' column as a dataframe. Try:
df = df[['Close']]
before you shift days (Which is the 45th execution of your screenshots). It could do the trick.
(Next time, please add code in text instead of screenshots.)
Your x_train or y_train are not supposed to be strings. They should be of type numpy.ndarray. Can you check or provide us the code for the place where you are splitting the data?
This is occurring due something which was done wrong previously. Need more insight to the code.
Your string data needs some pre-processing before it can be converted to float. You can convert your data to categorial variables(if you haven't already done so). For example, if using pandas:
x_train = pd.get_dummies(x_train)
tree = DecisionTreeRegressor().fit(x_train, y_train)
# more actions
Furthermore, I can see from the error, that you have datetime data. You should convert these to a timestamp.
x_train['Date'] = pd.to_datetime(x_train['Date'])
The rest of the preprocessing is up to you. There is a plethora of relevant tutorials.

My data has no nan but I keep getting the finite error

Context
I am trying to normalise my data to run a ML model. I am using np.log on my data
plt.hist(np.log(Portfolio_rtns['Aveva Returns']))
I also tried this way:
log_Aveva = np.log(Portfolio_rtns['Aveva Returns'])
log_Aveva.hist();
But get this error
ValueError: supplied range of [-inf, -1.2977785811129585] is not finite
I checked my data and even made sure to replace any nan values with 0.
I found this which states to use the np.isfinite. But I feel as though my data distribution is messed up because of it.
Port = np.isfinite(Portfolio_rtns['Aveva Returns'])
plt.hist(np.log(Port));
I also ran this function
# Square root can also make normal distributed data
plt.hist(np.sqrt(Portfolio_rtns['Aveva Returns']));
Though I got the graph, I got this error with it:
358: RuntimeWarning: invalid value encountered in sqrt
result = getattr(ufunc, method)(*inputs, **kwargs)
Problem
Is there a problem with my data?
So having perused around Coursera, I found that using log1p adds 1 to all the zero numbers in the entirety of the data. Also gets rid of the negative. So the solution is:
plt.hist(np.log1p(Portfolio_rtns['Aveva Returns']));
or
data.describe() #'data' being your data.
Find the minimum of the data
Add this minimum to the entirety of the data to convert all zero and negative to positive and more than 0.
Run the data
After running ML model, convert back to the normal value (this is for the above also)

How to handle string data in ML classification

Hello I am a beginner in Machine Learning, I have previously worked with some binary ml tasks where the data was numerical. Now I am facing an issue where I have to find the probability of a particular combination. I can not disclose the dataset or the code at this point. My data is a dataframe of 10 columns. I have to train my model on 8 columns and predict the possibility of the last 2 columns. That is my labels are a combination of the last 2 columns. What I am facing a problem with is, these column values are not numerical. I have tried everything I came across but can't find any suitable means of converting this to numerical values. I have tried LabelEncoder from sklearn,which works with the labels, but throws memory error if I use it again. I have tried to_numeric from pandas, which reads all the values as Nan. The values are in the form '2be74fad-4d4'. Any suggestions would be highly appreciated about how to handle this issue.
To convert categorical data to numerical, you can try these approaches in sklearn:
Label Encoding
Label Binarizer
OneHot Encoding
Now, for your problem, you can use LabelEncoder. But there is a catch. In other sklearn models, you can declare it once and then use it to fit and then transform on a number of columns.
In LabelEncoding, you have to fit_transform the model on one column in train data and then transform the same column in test data. Then the same process for the next categorial column.
You can iterate over a list of categorical columns to make it simple. Consider the snippet below:
cat_cols = ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Type_Combined']
enc = LabelEncoder()
for col in cat_cols:
train[col] = train[col].astype('str')
test[col] = test[col].astype('str')
train[col] = enc.fit_transform(train[col])
test[col] = enc.transform(test[col])
You can create a dictionary with the mapping form a string to integer. An example can be found here: enter link description here. Then you use onehot encoding or just feed the integer to the neural network. If the characters have some meaning you could also do it on a per character base instead of wordbased. But that depends on the task. If this String is a unique identifier of the column or so, just leave it away and don't feed it to your model.

about lightFM,How to embed the feature matrix and recommend new projects to user

my goal is to fit a model that suit my personal DATA,and I have processed the data and made it three files :
interaction matrix(394*2188); item feature matrix(5241*5241); user feature matrix(1043*1043); I have converted all of them into sparse matrix, and my both feature matrixs contain more content than interactions. when I fit the model by these data and make prediction,there are my codes and errors:
codes:
model = LightFM(loss='warp')
model.fit(data,item_features=items,user_features=users,epochs=30, num_threads=2)
evaluation.auc_score(model,data)
errors:
raise ValueError('Incorrect number of features in item_features')
ValueError: Incorrect number of features in item_features
how can I convert data that the interaction matrix’s users and items are less than features matrix
how can I recommend new and old items to all users(including the new)
I'm going to start by explaining why you're getting the error:
raise ValueError('Incorrect number of features in item_features') ValueError: Incorrect number of features in item_features
This error has likely occurred because of the last line of your code
evaluation.auc_score(model,data)
If you check the docs here: https://lyst.github.io/lightfm/docs/lightfm.evaluation.html
You'll see that you need to also provide user_features=users and item_features=items as parameters to auc_score (When you're using user and item features in model.fit)
i.e. a fixed version may look like
evaluation.auc_score(model, data, user_features=users and item_features=items)
Once you fix that error, we can take a closer look at your additional 2 questions.

incorrect mean from PANDAS dataframe

So here's an interesting thing:
Using python 2.7:
I've got a dataframe of about 5,100 entries, each with a number (melting point) in a column titled 'Tm'. Using the code:
self.sort_df[['Tm']].mean(axis=0)
I get a mean of:
Tm 92.969204
dtype: float64
This doesn't make sense because no entry has a Tm of greater than 83.
Does .mean() not work for this many values? I've tried pairing down the dataset and it seems to work for ~1,000 entries but considering I have full dataset of 150,000 to run at once, I'd like to know if I need to find a different way to calculate the mean.
A more readable syntax would be :
sort_df['Tm'].mean()
Try to do a sort_df['Tm'].value_counts() or sort_df['Tm'].max() to see what values are present. Some unexpected values must have crept up.
The .mean function gives accurate result irrespective of the size.

Categories