SGDClassifier on Big Data (sparse) - python

Hello everyone i am relatively new to the space of data science. I am trying actually to train a SGDClassifier with over 4000000 samlpes of data without any possitive results.
X vector has 6 features and looks like : [ 2 , 4 , 56431555 , 1 , 0 , 33]
Y vector has 1 feature which is the category. It can be 1 or 0... eg [1]
These are some examples of my data records :
X :
[[ 2 4 56431555 1 0 33]
[ 2 1 71716268 1 0 623]
[ 0 1 302 0 1 33]
...
[ 0 4 3707 0 1 33]
[ 0 1 733126 1 0 33]
[ 0 4 30960953 1 0 33]]
Y:
[0 0 1 ... 1 1 0]
When i use the .predict() on test data the only result i get is that every test vector belongs to the class 0. So i get an array full of zeros.
This are the parameters for which i initialize the classifier
from sklearn import linear_model
model = linear_model.SGDClassifier(max_iter=1000, tol=1e-3)
model.fit(data_train, target_train)
Any suggestions on how to approach this problem? ( i have already tried to use standard scaling on my data)
Note : Avg: Loss on training is huge and when i scale my data is 0.97 which i dont know if this can show anything about my dataset or model

Related

Classification based on categorical data

I have a dataset
Inp1 Inp2 Output
A,B,C AI,UI,JI Animals
L,M,N LI,DO,LI Noun
X,Y AI,UI Extras
For these values, I need to apply a ML algorithm. Which algorithm would be best suited to find relations in between these groups to assign an output class to them?
Assuming each cell is a list (as you have multiple strings stored in each), and that you are not looking for a specific encoding. The following should work. It can also be adjusted to suit different encodings.
import pandas as pd
A = [["Inp1", "Inp2", "Inp3", "Output"],
[["A","B","C"], ["AI","UI","JI"],["Apple","Bat","Dog"],["Animals"]],
[["L","M","N"], ["LI","DO","LI"], ["Lawn", "Moon", "Noon"], ["Noun"]]]
dataframe = pd.DataFrame(A[1:], columns=A[0])
def my_encoding(row):
encoded_row = []
for ls in row:
encoded_ls = []
for s in ls:
sbytes = s.encode('utf-8')
sint = int.from_bytes(sbytes, 'little')
encoded_ls.append(sint)
encoded_row.append(encoded_ls)
return encoded_row
print(dataframe.apply(my_encoding))
output:
Inp1 ... Output
0 [65, 66, 67] ... [32488788024979009]
1 [76, 77, 78] ... [1853189966]
if my assumptions are incorrect or this is not what you're looking for let me know.
As you mentioned, you are going to apply ML algorithm (say classification), I think One Hot Encoding is what you are looking for.
Requested format:
Inp1 Inp2 Inp3 Output
7,44,87 4,65,2 47,36,20 45
This format can't help you to train your model as multiple labels in a single cell. However you have to pre-process again like OHE.
Suggesting format:
A B C L M N X Y AI DO JI LI UI Apple Bat Dog Lawn Moon Noon Yemen Zombie
1 1 1 0 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0
0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1
Hereafter you can label encode / ohe the output field as per your model requires.
Happy learning !
BCE is for multi-label classifications, whereas categorical CE is for multi-class classification where each example belongs to a single class. In your task you need to understand if for a single example you end in a single class only (CE) or single example may end in multiple classes (BCE). Probable the second is true since animal can be a noun. ;)

How to prioritize certain features with max_features parameter in countvectorizer

I have a working program but I realized that some important n-grams in the test data were not a part of the 6500 max_features I had allowed in the training data. Is it possible to add a feature like "liar" or "pathetic" as a feature that I would train with my training data?
This is what I currently have for making the vectorizer:
vectorizer = CountVectorizer(ngram_range=(1, 2)
,max_features=6500)
X = vectorizer.fit_transform(train['text'])
feature_names = vectorizer.get_feature_names()
This is hacky, and you probably cannot count on it working in the future, but CountVectorizer primarily relies on the learned attribute vocabulary_, which is a dictionary with tokens as keys and "feature index" as values. You can add to that dictionary and everything appears to work as intended; borrowing from the example in the docs:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(X2.toarray())
## Output:
# [[0 0 1 1 0 0 1 0 0 0 0 1 0]
# [0 1 0 1 0 1 0 1 0 0 1 0 0]
# [1 0 0 1 0 0 0 0 1 1 0 1 0]
# [0 0 1 0 1 0 1 0 0 0 0 0 1]]
# Now we tweak:
vocab_len = len(vectorizer2.vocabulary_)
vectorizer2.vocabulary_['new token'] = vocab_len # append to end
print(vectorizer2.transform(["And this document has a new token"]).toarray())
## Output
# [[1 0 0 0 0 0 0 0 0 0 1 0 0 1]]

Confusion matrix fails to show all labels

I have created a classification matrix for multi-label classification to evaluate the performance of the MLPClassifier model. The confusion matrix output should be 10x10 but at times I get 8x8 as it doesn't show label values for either 1 or 2 class labels as you can see from the confusion matrix heatmap below the code whenever I run the whole Jupyter notebook. The class labels of true and predicted labels are from 1 to 10 (unordered). Is it because of a code bug or it just depends on the random input samples the test dataset accepts when the data is split into train and test sets? How should I fix this? The implementation of the code looks like this:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
str(cm)
Out: [[20 0 0 1 0 5 1 0]
[ 3 0 0 0 0 0 0 0]
[ 1 1 0 1 0 1 0 0]
[ 3 0 0 0 0 3 1 1]
[ 0 0 0 0 0 1 0 0]
[ 3 0 0 1 0 2 1 1]
[ 3 0 0 0 0 0 0 2]
[ 1 0 0 0 0 0 0 1]]
'[[20 0 0 1 0 5 1 0]\n [ 3 0 0 0 0 0 0 0]\n [ 1 1 0 1 0
1 0 0]\n [ 3 0 0 0 0 3 1 1]\n [ 0 0 0 0 0 1 0 0]\n [ 3 0
0 1 0 2 1 1]\n [ 3 0 0 0 0 0 0 2]\n [ 1 0 0 0 0 0 0
1]]'
import matplotlib.pyplot as plt
import seaborn as sns
side_bar = [1,2,3,4,5,6,7,8,9,10]
f, ax = plt.subplots(figsize=(12,12))
sns.heatmap(cm, annot=True, linewidth=.5, linecolor="r", fmt=".0f", ax = ax)
ax.set_xticklabels(side_bar)
ax.set_yticklabels(side_bar)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
confusion matrix heatmap
I think there is confusion here! a confusion matrix is set(y_test) + set(y_pred). so if this comes to be 8. then the matrix will be 8 X 8. Now, if you have more labels you want to print even though it does not matter if it's all zero. then while building the matrix , you need to feed "labels" parameter.
y_true = [2, 0, 2, 2, 0, 1,5]
y_pred = [0, 0, 2, 2, 0, 2,4]
confusion_matrix = confusion_matrix(y_true, y_pred,labels=[0,1,2,3,4,5,6])
as you can see, 6 is really not there in y_true or y_pred, you will zeros for it.

how to handle ValueError: Classification metrics can't handle a mix of multilabel-indicator and multiclass targets error

I got this error when I want to get the prediction accuracy and I try every possible way and all stack issues but finally I can not solve the bugs...
the code snippest with bug is:
author_pred1 = model1.predict([ThreeGramTest, ThreeGramTest, ThreeGramTest,ThreeGramTest])
print("class prediction without argmax:",author_pred1)
author_pred1=np.argmax(author_pred1, axis=1)
# Evaluate
print("test data one hot lable", TestAuthorHot)
print("class prediction with argmax:",author_pred1)
# author_pred1 = author_pred1.astype("int64")
print("type of prediction output",type(author_pred1))
print("type of test data", type(TestAuthorHot))
print(np.array(np.unique(author_pred1, return_counts=True)).T)
print(np.array(np.unique(TestAuthorHot, return_counts=True)).T)
# accuracy = accuracy_score(TestAuthorHot, author_pred1.round(), normalize=False)# the bug is here
precision, recall, f1, support = score(TestAuthorHot, author_pred1)
ave_precision = np.average(precision, weights=support / np.sum(support))
ave_recall = np.average(recall, weights=support / np.sum(support))
to know the shapes, the value of data is:
class prediction without argmax: [[3.9413989e-02 8.4685171e-03 2.7781539e-03 ... 5.0324947e-03
6.2263450e-07 3.1461464e-10]
[1.1533947e-02 4.0361892e-02 1.4060171e-02 ... 4.7175577e-05
1.4333490e-01 2.0528505e-07]
[4.5363868e-06 3.1557463e-03 1.4047540e-02 ... 1.3272668e-03
4.6724287e-07 5.9454552e-10]
...
[1.9417159e-04 1.7364822e-02 2.9031632e-03 ... 5.0036388e-04
1.3315305e-04 9.0704253e-07]
[1.8054984e-09 2.9453583e-08 2.3744430e-08 ... 2.7137769e-03
7.7114571e-08 4.9026494e-10]
[7.8946296e-06 5.9516740e-05 8.2868773e-10 ... 3.1905161e-04
2.5262805e-06 2.0384558e-09]]
test data one hot lable [[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 1 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 1 ... 0 0 0]]
class prediction with argmax: [ 7 37 37 ... 39 4 4]
how can I handle the bugs???
The error happens because you are passing to accuracy_score a 2D matrix (TestAuthorHot is a 2D one-hot matrix of labels). accuracy_score accepts only 1D vectors, so you need to transform TestAuthorHot into 1D in order to match it with author_pred1 (which is 1D)
To do this u can simply do:
accuracy_score(np.argmax(TestAuthorHot, axis=1), author_pred1)

regression coefficient calculation in python

I have a Dataframe and an input text file of activity.Dataframe is produced via pandas.I want to find out the regression coefficient of each term using following formula
Y=C1aX1a+C1bX1b+...+C2aX2a+C2bX2b+....C0 ,
where Y is the activity Cna the regression coefficient for the residue choice a at position n, X the dummy variable coding (xna= 1 or 0) corresponding to the presence or absence of residue choice a at position n, and C0 the mean value of the activity.
My dataframe look likes
2u 2s 4r 4n 4m 7h 7v
0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0
Here 1 and 0 represents the presence and absence of residues respectively.
Using MLR(multiple linear regression) how can i find out the regression coefficient of each residue ie, 2u,2s,4r,4n,4m,7h,7v.
C1a represents the regression coefficient of residue a at 1st position(here 1a is 2u,1b is 2s, 2a is 4r...) X1a represents the dummy value ie 0 or 1 corresponding to 1a.
Activity file contain following data
6.5
5.9
5.7
6.4
5.2
So first equation will look like
6.5=C1a*0+C1b*1+C2a*1+C2b*0+C2c*0+C3a*0+C3b*1+C0
…
Can I get regression coefficient using numpy?.Please help me, All suggestions will be appreciated.
Let A be your dataframe (you can get it as a pure and simple numpy array. Read it in using np.loadtxt if it's CSV), and y be your activity file (again, a numpy array), and use np.linalg.lstsq
DF = """0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0"""
res = """6.5, 5.9, 5.7, 6.4, 5.2"""
A = np.fromstring ( DF, sep=" " ).reshape((5,7))
y = np.fromstring(res, sep=" ")
(x, res, rango, svals ) = np.linalg.lstsq(A, y )
print x
# 2.115625, 2.490625, 1.24375 , 1.19375 , 2.16875 , 2.115625, 2.490625
print np.sum(A.dot(x)**2) # Sum of squared residuals:
# 177.24750000000003
print A.dot(x) # Print predicition
# 6.225, 6.175, 5.425, 6.4 , 5.475

Categories