scikit-learn logistic regression feature importance

scikit-learn logistic regression feature importance - python

I'm looking for a way to get an idea of the impact of the features I'm using in a classification problem. Using sklearn's logistic regression classifier (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), I understood that the .coef_ attribute gets me the information I'm after (as also discussed in this thread: How to find the importance of the features for a logistic regression model?).
The first few lines of my matrix:
phrase_type,type,complex_np,np_form,referentiality,grammatical_role,ambiguity,anaphor_type,dir_speech,length_of_span,length_of_coref_chain,position_in_coref_chain,position_in_sentence,is_topic
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,1,-1,18,True
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,2,1,1,True
np,none,no,defnp,discourse-new,sbj,not_ambig,_unspecified_,text_level,2,1,-1,9,True
Where the first line is the header, followed by the data (using the preprocessor's LabelEncoder in my code to convert this to ints).
Now, when I do a
print(classifier.coef_)
I get
[[ 0.84768459 -0.56344453 0.00365928 0.21441586 -1.70290447 -0.18460676
1.6167634 0.08556331 0.02152226 -0.05111953 0.07310608 -0.073653 ]]
which contains 12 columns/elements. I'm confused by this, since my data contains 13 columns (plus the 14th one with the label, I'm separating the features from the labels later on in my code).
I was wondering if maybe sklearn expects/assumes the first column to be the id and doesn't actually use the value of this column? But I cannot find any info on this.
Any help here would be much appreciated!

Not sure how to edit my original question in a way that it would still make sense for future reference, so I'll post a minimal example here:
import pandas
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from collections import defaultdict
import numpy
headers = ['phrase_type','type','complex_np','np_form','referentiality','grammatical_role','ambiguity','anaphor_type','dir_speech','length_of_span','length_of_coref_chain','position_in_coref_chain','position_in_sentence','is_topic']
matrix = [
['np','none','no','no,pds','referring','dir-obj','not_ambig','_unspecified_','text_level','1','1','-1','1','True'],
['np','none','no','pds','not_specified','sbj','not_ambig','_unspecified_','text_level','1','1','-1','21','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','1','-1','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','2','0','6','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','2','0','4','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','21','1','-1','1','True'],
['np','anaphoric','no','ne','referring','other','not_ambig','anaphor_nominal','text_level','1','9','4','2','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','3','9','5','1','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','9','7','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','2','1','1','True'],
['np','anaphoric','no','ne','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','2','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','9','1','13','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','3','0','5','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','1','-1','1','False'],
['np','none','no','ne','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','9','0','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','5','1','-1','5','False'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','1','5','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','3','3','0','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','3','1','1','True'],
['np','anaphoric','no','pds','referring','sbj','not_ambig','anaphor_nominal','text_level','1','1','-1','2','True']
]
df = pandas.DataFrame(matrix, columns=headers)
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
df = df.apply(lambda x: d[x.name].transform(x))
testrows = []
trainrows = []
splitIndex = len(matrix)/10
for index, row in df.iterrows():
if index < splitIndex:
testrows.append(row)
else:
trainrows.append(row)
testdf = pandas.DataFrame(testrows)
traindf = pandas.DataFrame(trainrows)
train_labels = traindf.is_topic
labels = list(set(train_labels))
train_labels = numpy.array([labels.index(x) for x in train_labels])
train_features = traindf.iloc[:,0:len(headers)-1]
train_features = numpy.array(train_features)
print('train features shape:', train_features.shape)
test_labels = testdf.is_topic
labels = list(set(test_labels))
test_labels = numpy.array([labels.index(x) for x in test_labels])
test_features = testdf.iloc[:,0:len(headers)-1]
test_features = numpy.array(test_features)
classifier = LogisticRegression()
classifier.fit(train_features, train_labels)
print(classifier.coef_)
results = classifier.predict(test_features)
f1 = f1_score(test_labels, results)
print(f1)
I think I may have found the source of the error (thanks #Alexey Trofimov for pointing me in the right direction). My code at first contained:
train_features = traindf.iloc[:,1:len(headers)-1]
Which was copied from another script, where I did have id's as the first column in my matrix, hence didn't want to take these into account. The len(headers)-1 then, if I understand things correctly, is to not take into account the actual label. Testing this in a real world scenario, deleting the -1 results in perfect f-score, which would make sense, since it would just only look at the actual label and always predict correctly...
So I now changed this to
train_features = traindf.iloc[:,0:len(headers)-1]
as in the code snippet, and now get 13 columns (in X_train.shape, and consequently in classifier.coef_).
I think this solved my issue, but am still not 100% convinced, so if someone could point out an error in this line of reasoning/my code above, I'd be grateful to hear about it.

Related

Random Forest Regressor Feature Importance all zero

I'm running a random forest regressor using scikit learn, but all the predictions end up being the same.
I realized that when I fit the data, all the feature importance are zero which is probably why all the predictions are the same.
This is the code that I'm using:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
merged_df = pd.read_csv("/home/jovyan/efs/vliu/combined_data.csv")
target = merged_df["400kmDensity"]
merged_df.drop("400kmDensity", axis = 1, inplace = True)
features_list = list(merged_df.columns)
#Set training and testing groups
train_features, test_features, train_target, test_target = train_test_split(merged_df, target, random_state = 16)
#Train model
rf = RandomForestRegressor(n_estimators = 150, random_state = 16)
ran = rf.fit(train_features, train_target)
print("Feature importances: ", rf.feature_importances_)
#Make predictions and calculate error
predictions = ran.predict(test_features)
print("Predictions: ", predictions)
Here's a link to the data file:
https://drive.google.com/file/d/1ECgKAH82wxIvt2OCv4W5ir1te_Vr3N_r/view?usp=sharing
If anybody can see what I did wrong before fitting the data that would result in the feature importances all being zero, that would be much appreciated.

Both your variables "400kmDensity" and "410kmDensity" share a correlation coefficient of >0.99:
np.corrcoef(merged_df["400kmDensity"], merged_df["410kmDensity"])
This practically means that you can predict "400kmDensity" almost exclusively with "410kmDensity". On a scatter plot they form an almost perfect line:
In order to actually explore what affects the values of "400kmDensity", you should exclude "410kmDensity" as a regressor (an explanatory variable). The feature importance can help to identify explanatory variables afterward.
Note that feature importance may not be a perfect metric to determine actual feature importance. Maybe you want to take a look into other available methods like Boruta Algorithm/Permutation Importance/...
In regard to the initial question: I'm not really sure why, but RandomForestRegressor seems to have a problem with your very low target variable(?). I was able to get feature importances after I scaled train_target and train_features in rf.fit(). However, this should not actually be necessary at all in order to apply Random Forest! You maybe want to take a look into the respective documentation or extend your search in this direction. Hope this serves as a hint.
fitted.rf = rf.fit(scale(train_features), scale(train_target))
As mentioned before, the feature importances after this change unsurprisingly look like this:
Also, the column "second" holds only the value zero, which does not explain anything! Your first step should be always EDA (Explanatory Data Analysis) to get a feeling for the data, like checking correlations between columns or generating histograms in order to explore data distributions [...].
There is much more to it, but I hope this gives you a leg-up!

Error with scikit-learn training: Unable to allocate array with shape

I have the following columns: Col1: string, Col2:float, Col3:float. During prediction I want to predict the value of Col3:
import pickle
import numpy as np
from sklearn import model_selection
from sklearn import linear_model
from sklearn.preprocessing import OneHotEncoder
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
x_columns_to_encode = ['Col1']
x_columns_to_scale = ['Col2']
y_columns_to_scale = ['Col3']
# Instantiate encoder/scaler
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False)
# Scale and Encode Separate Columns
x_scaled_columns = scaler.fit_transform(df1[x_columns_to_scale])
x_encoded_columns = ohe.fit_transform(df1[x_columns_to_encode])
y_scaled_columns = scaler.fit_transform(df1[y_columns_to_scale])
df = np.concatenate([x_scaled_columns, x_encoded_columns], axis=1)
validation_size = 0.50
seed = 7
x_train, x_validation, y_train, y_validation = model_selection.train_test_split(df, y_scaled_columns, test_size=validation_size, random_state=seed)
bestScore = 0.0
model = linear_model.LinearRegression()
score = model.fit(x_train, y_train).score(x_validation, y_validation)
print(score)
When I run this code I get error:
"Unable to allocate array with shape (2763330, 25380) and data type float64"
Can someone please help me understand as to where am I making mistake?

One hot encoding generates a new column for every unique class in the categorical column. If your categorical column has too many unique classes, you can run out of memory.
It might help to show us what your data looks like, so we can provide better suggestions.
In the meantime, you could try the following options:
Using a sparse representation for the one hot encoded matrix. This saves a lot of space, since a one hot encoded matrix has a lot of entries which are 0.
Using a label encoding. However, be careful, because label encoding assumes an ordinal relationship between classes. It's better to use a tree-based model if there is no ordinal relationship between classes, as they don't use a mathematical equation to build the model, and so the ordinal relationship is not relevant when the decision tree makes the split.
Refer this for a clear explanation.
Making a logical grouping of some of the classes. For example, suppose your categorical column has classes 'tiger', 'lion' and 'cheetah', you could group these together logically under a class, say, 'big_cats'. This will reduce the number of unique classes, and hence the number of new columns added will be less too.
Drop the categorical column entirely.

Time Series Classification

you can access the data set at this link https://drive.google.com/file/d/0B9Hd-26lI95ZeVU5cDY0ZU5MTWs/view?usp=sharing
My Task is to predict the price movement of a sector fund. How much it goes up or down doesn't really matter, I only want to know whether it's going up or down. So I define it as a classification problem.
Since this data set is a time-series data, I met many problems. I have read articles about these problems like I can't use k-fold cross validation since this is time series data. You can't ignore the order of the data.
my code is as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from sklearn.linear_model import LinearRegression
from math import sqrt
from sklearn.svm import LinearSVC
from sklearn.svm import SVCenter code here
lag1 = pd.read_csv(#local file path, parse_dates=['Date'])
#Trend : if price going up: ture, otherwise false
lag1['Trend'] = lag1.XLF > lag1.XLF.shift()
train_size = round(len(lag1)*0.50)
train = lag1[0:train_size]
test = lag1[train_size:]
variable_to_use= ['rGDP','interest_rate','private_auto_insurance','M2_money_supply','VXX']
y_train = train['Trend']
X_train = train[variable_to_use]
y_test = test['Trend']
X_test = test[variable_to_use]
#SVM Lag1
this_C = 1.0
clf = SVC(kernel = 'linear', C=this_C).fit(X_train, y_train)
print('XLF Lag1 dataset')
print('Accuracy of Linear SVC classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
#Check prediction results
clf.predict(X_test)
First of all, is my method right here : first generating a column of true and false? I am afraid the machine can't understand this column if I simply feed this column to it. Should I first perform a regression then compare the numeric result to generate a list of going up or down?
The accuracy on training set is very low at : 0.58 I am getting an array with all trues with clf.predict(X_test) which I don't know why I would get all trues.
And I don't know whether the resulting accuracy is calculated in which way: for example, I think my current accuracy only counts the number of true and false but ignoring the order of them? Since this is time-series data, ignoring the order is not right and gives me no information about predicting price movement. Let's say I have 40 examples in test set, and I got 20 Tures I would get 50% accuracy. But I guess the trues are no in the right position as it appears in the ground truth set. (Tell me if I am wrong)
I am also considering using Gradient Boosted Tree to do the classification, would it be better?

Some preprocessing of this data would probably be helpful. Step one might go something like:
df = pd.read_csv('YOURLOCALFILEPATH',header=0)
#more code than your method but labels rows as 0 or 1 and easy to output to new file for later reference
df['Date'] = pd.to_datetime(df['date'], unit='d')
df = df.set_index('Date')
df['compare'] = df['XLF'].shift(-1)
df['Label'] np.where(df['XLF']>df['compare'), 1, 0)
df.drop('compare', axis=1, inplace=True)
Step two can use one of sklearn's built in scalers, such as the MinMax scaler, to preprocess the data by scaling your feature inputs before feeding it into your model.

Feature selection with sklearn - ValueError: X has a different shape than during fitting

:) Very sorry in advance if my code looks like something a total newbie would write. Down below is a portion of my code in python. I am fiddling with sklearn and machine learning techniques.
I trained several Naive Bayes Model based on different datasets and stored them in trained_models
Prior this step i created an object chi_squared of the SelectPercentile class using the chi2 function for feature selection. From my understanding, i should write data_feature_reduced = chi_squared.transform(some_data) then use data_feature_reduced at the time of training like this, ie: nb.fit(data_feature_reduced, data.target)
This is what did, and stored the results objects nb ( and some other informations in the list trained_models.
I am now attempting to apply these models on a different set of data ( actually from the same source, if that matters to the question )
for name, model, intra_result, dev, training_data, chi_squarer in trained_models:
cross_results = []
new_vect= StemmedVectorizer(ngram_range=(1, 4), stop_words='english', max_df=0.90, min_df=2)
for data in demframes:
data_name = data[0]
X_test_data = new_vect.fit_transform(data[1].values.astype('U'))
Y_test_data = data[2]
chi_squared_test_data = chi_squarer.transform(X_test_data)
final_results.append((name, "applied to", data[0], model.score(X_test_data,Y_test_data)))
I have to admit that I am a bit of stranger to the feature selection part.
Here is the error that i get :
ValueError: X has a different shape than during fitting.
at line chi_squared_test_data = chi_squarer.transform(X_test_data)
I am assuming I am doing feature selection in an incorrect manner, Where did I go wrong ?

Thanks to everyone for their help!
I will just paste the comment that helped me solve my problem from #Vivek-Kumar.
This error is due to this line new_vect.fit_transform(). Like your
trained models, you should use the same StemmedVectorizer which was
used at training time.
The same StemmedVectorize object will transform the X_test_data to same shape, what it had during the training. Currently, you are using different object and fitting on it (fit_transform is fit and transform), hence the shape is different. Hence the error.

why not use a pipeline to make it simple? that way you dont have to transform twice and take care of the shapes.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
chi_squarer = SelectKBest(chi2, k=100) # change accordingly
lr = LogisticRegression() # or naive bayes
clf = pipeline.Pipeline([('chi_sq', chi_squarer), ('model', lr)])
# for training:
clf.fit(training_data, targets)
# for predictions:
clf.predict(test_data)
you can also add the new_vect in the pipeline

Using ranking data in Logistic Regression

I will be putting the max bounty on this as I am struggling to learn these concepts! I am trying to use some ranking data in a logistic regression. I want to use machine learning to make a simple classifier as to whether a webpage is "good" or not. It's just a learning exercise so I don't expect great results; just hoping to learn the "process" and coding techniques.
I have put my data in a .csv as follows :
URL WebsiteText AlexaRank GooglePageRank
In my Test CSV we have :
URL WebsiteText AlexaRank GooglePageRank Label
Label is a binary classification indicating "good" with 1 or "bad" with 0.
I currently have my LR running using only the website text; which I run a TF-IDF on.
I have a two questions which I need help with. I'll be putting a max bounty on this question and awarding it to the best answer as this is something I'd like some good help with so I, and others, may learn.
How can I normalize my ranking data for AlexaRank? I have a set of
10,000 webpages, for which I have the Alexa rank of all of them;
however they aren't ranked 1-10,000. They are ranked out of the
entire Internet, so while http://www.google.com may be ranked #1,
http://www.notasite.com may be ranked #83904803289480. How do I
normalize this in Scikit learn in order to get the best possible
results from my data?
I am running my Logistic Regression in this way; I am nearly sure I have done this incorrectly. I am trying to do the TF-IDF on the website text, then add the two other relevant columns and fit the Logistic Regression. I'd appreciate if someone could quickly verify that I am taking in the three columns I want to use in my LR correctly. Any and all feedback on how I can improve myself would also be appreciated here.
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')
print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:,2])#Reading WebsiteText column for TF-IDF.
testdata = list(np.array(p.read_table('test.tsv'))[:,2])
y = np.array(p.read_table('train.tsv'))[:,-1] #reading label
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',
token_pattern=r'\w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)
X_all = traindata + testdata
lentrain = len(traindata)
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
X = X_all[:lentrain]
X_test = X_all[lentrain:]
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
#Add Two Integer Columns
AlexaAndGoogleTrainData = list(np.array(p.read_table('train.tsv'))[2:,3])#Not sure if I am doing this correctly. Expecting it to contain AlexaRank and GooglePageRank columns.
AlexaAndGoogleTestData = list(np.array(p.read_table('test.tsv'))[2:,3])
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData + AlexaAndGoogleTrainData
#Add two columns to X.
X = np.append(X, AllAlexaAndGoogleInfo, 1) #Think I have done this incorrectly.
print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."`
Thank you very much for all feedback - please post if you need any further information!

I guess sklearn.preprocessing.StandardScaler would be the first thing you want to try. StandardScaler transforms all of your features into Mean-0-Std-1 features.
This definitely gets rid of your first problem. AlexaRank will be guaranteed to be spread around 0 and bounded. (Yes, even massive AlexaRank values like 83904803289480 are transformed to small floating point numbers). Of course, the results will not be integers between 1 and 10000 but they will maintain same order as the original ranks. And in this case, keeping the rank bounded and normalized will help solve your second problem like follows.
In order to understand why normalization would help in LR, let's revisit the logit formulation of LR.
In your case, X1, X2, X3 are three TF-IDF features and X4, X5 are Alexa/Google rank related features. Now, the linear form of equation suggest that the coefficients represent the change in logit of y with one unit change in a variable. Think what happens when your X4 is kept fixed at a massive rank value, say 83904803289480. In that case, the Alexa Rank variable dominates your LR fit and a small change in TF-IDF value has almost no effect on the LR fit. Now one might think that the coefficient should be able to adjust to small/large values to account for differences between these features. Not in this case --- It's not only the magnitude of variables that matter but also their range. Alexa Rank definitely has a large range and should definitely dominate your LR fit in this case. Therefore, I guess normalizing all variables using StandardScaler to adjust their range will improve the fit.
Here is how you can scale the X matrix.
sc = proprocessing.StandardScaler().fit(X)
X = sc.transform(X)
Don't forget to use same scaler to transform X_test.
X_test = sc.transform(X_test)
Now you can use the fitting procedure etc.
rd.fit(X, y)
re.predict_proba(X_test)
Check this out for more on sklearn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html
Edit: Parsing and column merging part can be easily done using pandas, i.e., there is no need to convert the matrices into list and then append them. Moreover, pandas dataframes can be directly indexed by their column names.
AlexaAndGoogleTrainData = p.read_table('train.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AlexaAndGoogleTestData = p.read_table('test.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData.append(AlexaAndGoogleTrainData)
Note that we are passing header=0 argument to read_table to maintain original header names from tsv file. And also note how we can index using entire set of columns. Finally, you can stack this new matrix with X using numpy.hstack.
X = np.hstack((X, AllAlexaAndGoogleInfo))
hstack horizontally combined two multi-dimensional array-like structures provided their lengths are same.

Regarding normalizing the numeric ranks either scikit StandardScaler or a logarithmic transform (or both) should work well enough.
For building up a working pipeline, I find my sanity greatly benefits from using the Pandas package and the sklearn.pipeline utilities. Here is a simple script that should do what you need.
First a couple of utlitlty classes I always seem to need. It would be nice to have something like these in sklearn.pipeline or sklearn.utilities.
from sklearn import base
class Columns(base.TransformerMixin, base.BaseEstimator):
def __init__(self, columns):
super(Columns, self).__init__()
self.columns_ = columns
def fit(self, *args, **kwargs):
return self
def transform(self, X, *args, **kwargs):
return X[self.columns_]
class Text(base.TransformerMixin, base.BaseEstimator):
def fit(self, *args, **kwargs):
return self
def transform(self, X, *args, **kwargs):
return (X.apply("\t".join, axis=1, raw=False))
Now set up the pipeline.
I used the SGDClassifier implementation of logistic regression since it tends to be more eficcient for high dimensional data like text classification also I usually find that hinge loss usually gives better results than logistic regression anyway.
from sklearn import linear_model as lin
from sklearn import metrics
from sklearn.feature_extraction import text as txt
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing as prep
import numpy as np
from pandas.io import parsers
import pandas as pd
pipe = Pipeline([
('feat', FeatureUnion([
('txt', Pipeline([
('txtcols', Columns(["WebsiteText"])),
('totxt', Text()),
('vect', txt.TfidfVectorizer()),
])),
('num', Pipeline([
('numcols', Columns(["AlexaRank", "GooglePageRank"])),
('scale', prep.StandardScaler()),
])),
])),
('clf', lin.SGDClassifier(loss="log")),
])
Next train the model:
train=parsers.read_csv("train.csv")
pipe.fit(train, train.Label)
Finally evaluate on test data:
test=parsers.read_csv("test.csv")
tstlbl=np.array(test.Label)
print pipe.score(test, tstlbl)
pred = pipe.predict(test)
print metrics.confusion_matrix(tstlbl, pred)
print metrics.classification_report(tstlbl, pred)
print metrics.f1_score(tstlbl, pred)
prob = pipe.decision_function(test)
print metrics.roc_auc_score(tstlbl, prob)
print metrics.average_precision_score(tstlbl, prob)
You will probably not get very good results with everything using default setting like this,
but it should give you a working baseline to work from. I can suggest some parameter settings that usually work for me if you like.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.