Calculate Tf-Idf Scores in pandas? - python

I want to calculate tf and idf seperately from the documents below. I'm using python and pandas.
import pandas as pd
df = pd.DataFrame({'docId': [1,2,3],
'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})
I want to calculate using formula for Tf-Idf not using Sklearn library.
After tokenization,I have used this for TF calculation:
tf = df.sent.apply(pd.value_counts).fillna(0)
but this giving me a count but I want ratio of (count/total number of words).
For Idf:
df[df['sent'] > 0] / (1 + len(df['sent'])
but it doesn't seems to work.
I want both Tf and Idf as pandas series format.
Edit
for tokenization I used df['sent'] = df['sent'].apply(word_tokenize)
I got idf scores as :
tfidf = TfidfVectorizer()
feature_array = tfidf.fit_transform(df['sent'])
d=(dict(zip(tfidf.get_feature_names(), tfidf.idf_)))
How I can get tf scores seperately?

You'll need to do a little more work to compute this.
import numpy as np
df = pd.DataFrame({'docId': [1,2,3],
'sent': ['This is the first sentence',
'This is the second sentence',
'This is the third sentence']})
# Tokenize and generate count vectors
word_vec = df.sent.apply(str.split).apply(pd.value_counts).fillna(0)
# Compute term frequencies
tf = word_vec.divide(np.sum(word_vec, axis=1), axis=0)
# Compute inverse document frequencies
idf = np.log10(len(tf) / word_vec[word_vec > 0].count())
# Compute TF-IDF vectors
tfidf = np.multiply(tf, idf.to_frame().T)
print(tfidf)
is the first This sentence second third
0 0.0 0.0 0.095424 0.0 0.0 0.000000 0.000000
1 0.0 0.0 0.000000 0.0 0.0 0.095424 0.000000
2 0.0 0.0 0.000000 0.0 0.0 0.000000 0.095424
Depending on your situation, you may want to normalize:
# L2 (Euclidean) normalization
l2_norm = np.sum(np.sqrt(tfidf), axis=1)
# Normalized TF-IDF vectors
tfidf_norm = (tfidf.T / l2_norm).T
print(tfidf_norm)
is the first This sentence second third
0 0.0 0.0 0.308908 0.0 0.0 0.000000 0.000000
1 0.0 0.0 0.000000 0.0 0.0 0.308908 0.000000
2 0.0 0.0 0.000000 0.0 0.0 0.000000 0.308908

Here is my solution:
first tokenize, for convenience as a separate column:
df['tokens'] = [x.lower().split() for x in df.sent.values]
then TF as you did, but with normalize parameter (for technical reasons you need a lambda func):
tf = df.tokens.apply(lambda x: pd.Series(x).value_counts(normalize=True)).fillna(0)
then IDF (one per word in vocabulary):
idf = pd.Series([np.log10(float(df.shape[0])/len([x for x in df.tokens.values if token in x])) for token in tf.columns])
idf.index = tf.columns
then if you want TFIDF:
tfidf = tf.copy()
for col in tfidf.columns:
tfidf[col] = tfidf[col]*idf[col]

I think I had the same issue as you.
I wanted to use TfIdfVectorizer but their default tf-idf definition is not standard (tf-idf = tf + tf*idf instead of the normal tf-idf = tf*idf)
TF = the term "frequency" is generally used to mean count. For that you can use CountVectorizer() from sklearn.
Need to log transform and normalize if needed.
The option using numpy was much longer in processing time (> 50 times slower).

Related

ValueError when attempting to create dataframe with OneHotEncoder results

I have recently started learning python to develop a predictive model for a research project using machine learning methods. I have used OneHotEncoder to encode all the categorical variables in my dataset
# Encode categorical data with oneHotEncoder
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
Z = ohe.fit_transform(Z)
I now want to create a dataframe with the results from the OneHotEncoder. I want the dataframe columns to be the new categories that resulted from the encoding, that is why I am using the categories_ attribute. When running the following line of code:
ohe_df = pd.DataFrame(Z, columns=ohe.categories_)
I get the error: ValueError: all arrays must be same length
I understand that the arrays being referred to in the error message are the arrays of categories, each of which has a different length depending on the number of categories it contains, but am not sure what the correct way of creating a dataframe with the new categories as columns is (when there are multiple features).
I tried to do this with a small dataset that contained one feature only and it worked:
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
df = pd.DataFrame(['Male', 'Female', 'Female'])
results = ohe.fit_transform(df)
ohe_df = pd.DataFrame(results, columns=ohe.categories_)
ohe_df.head()
Female Male
0 0.0 1.0
1 1.0 0.0
2 1.0 0.0
So how do I do the same for my large dataset with numerous features.
Thank you in advance.
EDIT:
As requested, I have come up with a MWE to demonstrate how it is not working:
import numpy as np
import pandas as pd
# create dataframe
df = pd.DataFrame(np.array([['Male', 'Yes', 'Forceps'], ['Female', 'No', 'Forceps and ventouse'],
['Female','missing','None'], ['Male','Yes','Ventouse']]),
columns=['gender', 'diabetes', 'assistance'])
df.head()
# encode categorical data
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
results = ohe.fit_transform(df)
print(results)
By this step, I have created a dataframe of categorical data and encoded it. I now want to create another dataframe such that the columns of the new dataframe are the categories created by the OneHotEncoder and rows are the encoded data. To do this I tried two things:
ohe_df = pd.DataFrame(results, columns=np.concatenate(ohe.categories_))
And I tried:
ohe_df = pd.DataFrame(results, columns=ohe.get_feature_names(input_features=df.columns))
Which both resulted in the error:
ValueError: Shape of passed values is (4, 1), indices imply (4, 9)
IIUC,
import numpy as np
import pandas as pd
# create dataframe
df = pd.DataFrame(np.array([['Male', 'Yes', 'Forceps'], ['Female', 'No', 'Forceps and ventouse'],
['Female','missing','None'], ['Male','Yes','Ventouse']]),
columns=['gender', 'diabetes', 'assistance'])
df.head()
# encode categorical data
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
results = ohe.fit_transform(df)
df_results = pd.DataFrame.sparse.from_spmatrix(results)
df_results.columns = ohe.get_feature_names(df.columns)
df_results
Output:
gender_Female gender_Male diabetes_No diabetes_Yes diabetes_missing assistance_Forceps assistance_Forceps and ventouse assistance_None assistance_Ventouse
0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
1 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
2 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
3 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
Note, the output of ohe.fit_transform(df) is a sparse matrix.
print(type(results))
<class 'scipy.sparse.csr.csr_matrix'>
You can convert this to a dataframe using pd.DataFrame.sparse.from_spmatrix. Then, we can use ohe.get_feature_names and passing the original dataframe columns to name your columns in the results dataframe, df_results.
ohe.categories_ is a list of arrays, one array for each feature. You need to flatten that into a 1D list/array for pd.DataFrame, e.g. with np.concatenate(ohe.categories_).
But probably better, use the builtin method get_feature_names.

How to get coefficients of multinomial logistic regression?

I need to calculate coefficients of a multiple logistic regression using sklearn:
X =
x1 x2 x3 x4 x5 x6
0.300000 0.100000 0.0 0.0000 0.5 0.0
0.000000 0.006000 0.0 0.0000 0.2 0.0
0.010000 0.678000 0.0 0.0000 2.0 0.0
0.000000 0.333000 1.0 12.3966 0.1 4.0
0.200000 0.005000 1.0 0.4050 1.0 0.0
0.000000 0.340000 1.0 15.7025 0.5 0.0
0.000000 0.440000 1.0 8.2645 0.0 4.0
0.500000 0.055000 1.0 18.1818 0.0 4.0
The values of y are categorical in range [1; 4].
y =
1
2
1
3
4
1
2
3
This is what I do:
import pandas as pd
from sklearn import linear_modelion
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
h = .02
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X, y)
# print the coefficients
print(logreg.intercept_)
print(logreg.coef_)
However, I get 6 columns in the output of logreg.intercept_ and 6 columns in the output of logreg.coef_ How can I get 1 coefficient per feature, e.g. a - f values?
y = a*x1 + b*x2 + c*x3 + d*x4 + e*x5 + f*x6
Also, probably I am doing something wrong, because y_pred = logreg.predict(X) gives me the value of 1 for all rows.
Check the online documentation:
coef_ : array, shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary.
As #Xochipilli has already mentioned in comments you are going to have (n_classes, n_features) or in your case (4,6) coefficients and 4 intercepts (one for each class)
Probably I am doing something wrong, because y_pred =
logreg.predict(X) gives me the value of 1 for all rows.
yes, you shouldn't try to use data that you've used for training your model for prediction. Split your data into training and test data sets, train your model using train data set and check it's accuracy using test data set.

Python - How to do prediction and testing over multiple files using sklearn

I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y.
Let's say we have 5 traces of dataset (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv) in the current directory. For a single trace (dataset) (for example, a1.csv) – I can do the prediction of a 5 window as the following
import pandas as pd
import numpy as np
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
df = pd.read_csv('a1.csv')
for i in range(1,5):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X=pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
reg = RandomForestRegressor(criterion='mse')
reg.fit(X,y)
modelPred = reg.predict(X)
print(modelPred)
print("Number of predictions:",len(modelPred))
modelPred.tofile('predictedValues1.txt',sep="\n",format="%s")
meanSquaredError=mean_squared_error(y, modelPred)
print("Mean Square Error (MSE):", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("Root-Mean-Square Error (RMSE):", rootMeanSquaredError)
I have solved this problem with random forest, which yields df:
rolling_regression')
time X Y X_t1 X_t2 X_t3 X_t4
0 0.000543 0 10 NaN NaN NaN NaN
1 0.000575 0 10 0.0 NaN NaN NaN
2 0.041324 1 10 0.0 0.0 NaN NaN
3 0.041331 2 10 1.0 0.0 0.0 NaN
4 0.041336 3 10 2.0 1.0 0.0 0.0
5 0.041340 4 10 3.0 2.0 1.0 0.0
6 0.041345 5 10 4.0 3.0 2.0 1.0
7 0.041350 6 10 5.0 4.0 3.0 2.0
.........................................................
[2845 rows x 7 columns]
[ 10. 10. 10. ..., 20. 20. 20.]
RMSE: 0.5136564734333562
However, now I want to do the prediction over all of the files (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv)by dividing the training into 60% of the datasets whose file name start with a and the remaining 40% for testing whose file name start with a using sklearn in Python (meaning 3 traces will be used for training and 2 files for testing)?
PS: All the files have the same structure but they are with different lengths for they are generated with different parameters.
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "a*.csv"))))
# get your X and Y Df's
x_train,x_test,y_train,y_test=sklearn.cross_validation.train_test_split(X,Y,test_size=0.40)
To read in multiple files, you'll need a slight extension. Aggregate data from each csv, then call pd.concat to join them:
df_list = []
for i in range(1, 6):
df_list.append(pd.read_csv('a%d.csv' %i))
df = pd.concat(df_list)
This will read in all your csvs, and you can carry on as usual. Get X and y:
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
Use sklearn.cross_validation.train_test_split to segment your data:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
You can also look at StratifiedKFold.

How to train and predict a model using Random Forest?

How can we predict a model using random forest? I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y. Similarly, using a simple rolling OLS regression model, we can do it as in the following but I wanted to do it using random forest model.
import pandas as pd
df = pd.read_csv('data_pred.csv')
model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']],
window_type='rolling', window=5, intercept=True)
I have solved this problem with random forest, which yields df:
t_stamp X Y X_t1 X_t2 X_t3 X_t4 X_t5
0.000543 0 10 NaN NaN NaN NaN NaN
0.000575 0 10 0.0 NaN NaN NaN NaN
0.041324 1 10 0.0 0.0 NaN NaN NaN
0.041331 2 10 1.0 0.0 0.0 NaN NaN
0.041336 3 10 2.0 1.0 0.0 0.0 NaN
0.041340 4 10 3.0 2.0 1.0 0.0 0.0
0.041345 5 10 4.0 3.0 2.0 1.0 0.0
0.041350 6 10 5.0 4.0 3.0 2.0 1.0
0.041354 7 10 6.0 5.0 4.0 3.0 2.0
.........................................................
[ 10. 10. 10. 10. .................................]
MSE: 1.3273548431
This seems to work fine for ranges 5, 10, 15, 20, 22. However, it doesn't seem to work fine for ranges greater than 23 (it prints MSE: 0.0) and this is because, as you can see from the dataset the values of Y are fixed (10) from row 1 - 23 and then changes to another value (20, and so on) from row 24. How can we train and predict a model of such cases based on the last data points?
It seems with the existing code, when calling dropna, you truncate X but not y. You also train and test on the same data.
Fixing this will give non-zero MSE.
Code:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
df = pd.read_csv('/Users/shivadeviah/Desktop/estimated_pred.csv')
df1 = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(25)})
df1['Y'] = df['Y']
df1 = df1.sample(frac=1).reset_index(drop=True)
df1.dropna(inplace=True)
X = df1.iloc[:, :-1].values
y = df1.iloc[:, -1].values
x = int(len(X) * 0.66)
X_train = X[:x]
X_test = X[x:]
y_train = y[:x]
y_test = y[x:]
reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train, y_train)
modelPred = reg.predict(X_test)
print(modelPred)
print("Number of predictions:",len(modelPred))
meanSquaredError = mean_squared_error(y_test, modelPred)
print("MSE:", meanSquaredError)
print(df1.size)
df2 = df1.iloc[x:, :].copy()
df2['pred'] = modelPred
df2.head()
Output:
[ 267.7 258.26608241 265.07037249 ..., 267.27370169 256.7 272.2 ]
Number of predictions: 87891
MSE: 1954.9271256
6721026
X_0 pred
170625 48 267.700000
170626 66 258.266082
170627 184 265.070372
170628 259 294.700000
170629 271 281.966667

Create a DTM from large corpus

I have a set of texts contained in a list, which I loaded from a csv file
texts=['this is text1', 'this would be text2', 'here we have text3']
and I would like to create a document-term matrix, by using stemmed words.
I have also stemmed them to have:
[['text1'], ['would', 'text2'], ['text3']]
What I would like to do is to create a DTM that counts all the stemmed terms (then I would need to do some operations on the rows).
For what concerns the unstemmed texts, I am able to make the DTM for short texts, by using the function fn_tdm_df reported here.
What would be more practical for me, though, is to make a DTM of the stemmed words. Just to be clearer, the output I have from applying "fn_tdm_df":
be have here is text1 text2 text3 this we would
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1 1.0 1.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0
First, I do not know why I have only two rows, instead of three. Second, my desired output would be something like:
text1 would text2 text3
0 1 0 0 0
1 0 1 1 0
2 0 0 0 1
I am sorry but I am really desperate on this output. I also tried to export and reimport the stemmed texts on R, but it doesn't encode correctly. I would probably need to handle DataFrames, as for the huge amount of data. What would you suggest me?
----- UPDATE
Using CountVectorizer I am not fully satisfied, as I do not get a tractable matrix in which I can normalize and sum rows/columns easily.
Here is the code I am using, but it is blocking Python (dataset too large). How can I run it efficiently?
vect = CountVectorizer(min_df=0., max_df=1.0)
X = vect.fit_transform(texts)
print(pd.DataFrame(X.A, columns=vect.get_feature_names()).to_string())
df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names())
Why don't you use sklearn? The CountVectorizer() method converts a collection of text documents to a matrix of token counts. What's more it gives a sparse representation of the counts using scipy.
You can either give your raw entries to the method or preprocess it as you have done (stemming + stop words).
Check this out : CountVectorizer()

Categories