I am new to machine learning.
I created a data, random numbers in two sets. I am trying how to find a sample, however when doing following, I receive very low accuracy score:
from random import randint as R
from matplotlib import pyplot as plt
import numpy as np
from sklearn.neighbors import KNeighborsClassifier as KNC
from sklearn.cross_validation import train_test_split as tts
from sklearn.metrics import accuracy_score
a = [R(100,200) for x in range(100)]
b = [R(1000,2000) for x in range(100)]
c = a+b
X = np.array(c).reshape(len(c),1)
y = np.arange(len(c))
train_X, test_X, train_y,test_y = tts(X,y,test_size=0.4)
mimi = KNC()
mimi.fit(train_X, train_y)
y__pred = mimi.predict(train_X)
print(accuracy_score(train_y,y__pred))
print(mimi.score(train_X,train_y))
I receive a result of 0.18... What exactly does this mean? That a prediction score is just 18%? Please, can you explain to me in most simple way. I would really appreciate it.
By doing y = np.arange(len(c)) you have c different classes (here 200 classes) with only one example for each class. Learning the nearest neighbors on such a setup does not have any sense.
What you want (If I'm guessing right) is to have one class for data a and another class for data b.
Change y to:
y = np.concatenate([[0] *len(a), [1] *len(b)])
You'll see that you obtain an accuracy score of 1.0, which means that you successfully classify all your testing example.
Related
I have a csv file with 10.000 entries and 4 columns. I need to detect the outliers in this data and I am using Sklearn's isolation forest. I get decent results, but I have no idea how to print/calculate the score(between 0.0 and 1.0) and also draw the confusion matrix. I will post the code here so you can see my implementation of Isolation forest.
Please try to keep your explanation on a lower level since my skill with Machine Learning is minimal. Thank you.
Also, my code is from a Jupyter logbook so I might have pasted it wrong, but this is the code.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
#Reading CSV
df = pd.read_csv('usecase999.csv')
df.info()
df
anomaly_inputs = ['c0', 'ya0'] #name of the columns, ya0 shows if the value(c0) is an anomaly or not(not reliable enough)
model_IF = IsolationForest(contamination=0.01,random_state=37)
model_IF.fit(df[anomaly_inputs])
df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
df['anomaly'] = model_IF.predict(df[anomaly_inputs])
df.loc[:, ['c0', 'ya0', 'anomaly_scores', 'anomaly']]
#plotting fucntion
def outlier_plot(data, name, x_var, y_var, xaxis=[0,1], yaxis=[0,1])
print(f'Algorithm: {name}')
method = f'{name}_anomaly'
print(f"Nr of anomalies {len(data[data['anomaly']==-1])}")
print(f"Nr of non anomalies {len(data[data['anomaly']==1])}")
print(f"Nr of values {len(data)}")
g = sns.FacetGrid(data, col='anomaly', height=4, hue='anomaly', hue_order=[1,-1])
g.map(sns.scatterplot, x_var, y_var)
g.fig.suptitle(f'Algorithm: {name}', y=1.10, fontweight='bold')
g.set(xlim=xaxis, ylim=yaxis)
axes=g.axes.flatten()
axes[0].set_title(f"Outliers\n{len(data[data['anomaly']==-1])} points")
axes[1].set_title(f"Inliers\n{len(data[data['anomaly']==1])} points")
return g
outlier_plot(df, "Isolation Forest", "c0", "ya0", [0, 5], [0,1.5])
plt.show(sns)
I did not try yet anything since I do not have a clue on how to access the confusion matrix and the algorithm score.
Isolation Forest is an unsupervised algorithm, which means it doesn't have a notion of true and predicted labels. Therefore, common evaluation metrics such as accuracy score, precision, recall, F1-score, etc are not applicable.
you can use the confusion_matrix() method:
true_labels = df['ya0'] # true labels
pred_labels = df['anomaly'] # predicted labels
conf_matrix = confusion_matrix(true_labels, pred_labels)
print(conf_matrix)
let me know if it's working. Also you can calculate the mean, standard deviation, and the histogram of the anomaly scores:
import numpy as np
# mean anomaly scores
mean_scores = np.mean(df['anomaly_scores'])
print("Mean of anomaly scores: ", mean_scores)
# standard deviation
std_scores = np.std(df['anomaly_scores'])
print("Standard deviation of anomaly scores: ", std_scores)
# histogram
plt.hist(df['anomaly_scores'], bins=50)
plt.xlabel('Anomaly Scores')
plt.ylabel('Frequency')
plt.show()
for calculate the AUC using the roc_auc_score() function (you need to have a notion of true labels in your data, otherwise, you can't calculate AUC.):
from sklearn.metrics import roc_auc_score
# true labels
y_true = df['ya0']
# predicted scores
y_scores = df['anomaly_scores']
# calculate the AUC
auc = roc_auc_score(y_true, y_scores)
print("AUC: ", auc)
The use of df.head() is to return first five rows the why [0:5] is used in the code and what is the use of X[0:5] written after 'preprocessing' line? And in KNeighbourClassifier after fitting 'neigh' why another single 'neigh' is used in the very next line? Please help, Thank You.
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/teleCust1000t.csv')
df.head()
df['custcat'].value_counts()
df.hist(column='income', bins=50)
df.columns
X = df[['region', 'tenure','age', 'marital', 'address', 'income', 'ed', 'employ','retire', 'gender', 'reside']] .values #.astype(float)
X[0:5]
y = df['custcat'].values
y[0:5]
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
from sklearn.neighbors import KNeighborsClassifier
k = 4
#Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh
yhat = neigh.predict(X_test)
yhat[0:5]
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))
http://localhost:8888/notebooks/Downloads/ML0101EN-Clas-K-Nearest-neighbors-CustCat-py-v1.ipynb
Step 3 of CRISP-DM is "Explore and clean your data". During this step you need to have a look at the data in order to judge about its quality. So you need some visible output after each step.
This code is used in a Jupyter notebook. Jupyter notebooks will display the result of an expression. It acts like a print(), but smarter, so it will display images, tables etc.
However, it will only generate that output it it's the last line of a cell. Copying and pasting all code into a single cell will give you less output and it'll be harder to follow the analysis, because you don't see all the steps in between.
Here's a short example of what df.head() might output
Similarly, other seemingless useless statements will generate output:
From the comments:
Would you please answer why they used first 3 [0:5], more importantly 3rd [0:5]? And why 'neigh' is used in "KNeighbourClassifier" part after the line "neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)"?
X[0:5] will display the first 5 values of X. Y[0:5] will display the first 5 values of Y. And neigh will display whatever neigh is.
And last one please. What does bins=50 mean?
It tells hist() to create 50 bars in the diagram instead of 10. That way you can better judge about the distribution of values.
The nice thing about a Jupyter notebook (which you dson't seem to have discovered yet) is that you can easily change the statements and immediately get new output. Get a fresh Jupyter notebook, type one line of code. Press Ctrl+Enter, then do the same for the next line. You'll see what it does.
I would like to perform a simple linear regression using statsmodels and I've tried several different methods by now but I just don't get it to work. The code that I have constructed now doesn't give me any errors but it also doesn't show me the result
I am trying to create a model for the variable "Direction" which takes the value 0 if the return for the corresponding date was negative and 1 if it was positive. The explinatory variables are the (5) lags of the returns. The df13 contains the lags and also the direction for each observed date. I tried this code and as I mentioned it doesn't give an error but says " Optimization terminated successfully.
Current function value: 0.682314
Iterations 5
However, I would like to see the typical table with all the beta values, their significance etc.
Also, what would you say, since Direction is a binary variable may it be better to use a logit instead of a linear model? However, in the assignment it appeared as a linear model.
And lastly, I am sorry its not displayed here correctly but I don't know how to write as code or insert my dataframe
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os
import itertools
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
...
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
X = sm.add_constant(X)
model = sm.Logit(Y.astype(float), X.astype(float)).fit()
predictions = model.predict(X)
print_model = model.summary
print(print_model)
Edit: I'm sure it has to be a logit regression so I updated that part
I don't know if this is unintentional, but it looks like you need to define X and Y separately:
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
Secondly, I'm not familiar with statsmodel, but I would try converting your dataframes to numpy arrays. You can do this with
Xnum = X.to_numpy()
ynum = y.to_numpy()
And try passing those to the regressors.
I want to get the marginal effects of a logistic regression from a sklearn model
I know you can get these for a statsmodel logistic regression using '.get_margeff()'. Is there nothing for sklearn? I want to avoid doing the calculation my self as I feel there would be a lot of room for error.
import statsmodels.formula.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np
data = load_breast_cancer()
x = data.data
y= data.target
x=add_constant(x,has_constant='add')
model = sm.Logit(y, x).fit_regularized()
margeff = model.get_margeff(dummy=True,count=True)
##print the margal effect
print(margeff.margeff)
>> [ 6.73582136e-02 2.15779589e-04 1.28857837e-02 -1.06718136e-03
-1.96032750e+00 1.36137385e+00 -1.16303369e+00 -1.37422595e+00
8.14539021e-01 -1.95330095e+00 -4.86235558e-01 4.84260993e-02
7.16675627e-02 -2.89644712e-03 -5.18982198e+00 -5.93269894e-01
3.22934080e+00 -1.28363008e+01 3.07823155e+00 5.84122170e+00
1.92785670e-02 -9.86284081e-03 -7.53298463e-03 -3.52349287e-04
9.13527446e-01 1.69938656e-01 -2.89245493e-01 -4.65659522e-01
-8.32713335e-01 -1.15567833e+00]
# manual calculation, doing this as you can get the coef_ from a sklearn model and use in the function
def PDF(XB):
var1 = np.exp(XB)
var2 = np.power((1+np.exp(XB)),2)
var3 = (var1 / var2)
return var3
arrPDF = PDF(np.dot(x,model.params))
ME=pd.DataFrame(np.dot(arrPDF[:,None],model.params[None,:]))
print(ME.iloc[:,1:].mean().to_list())
>>
[0.06735821358791198, 0.0002157795887363032, 0.012885783711597246, -0.0010671813611730326, -1.9603274961356965, 1.361373851981879, -1.1630336876543224, -1.3742259536619654, 0.8145390210646809, -1.9533009514684947, -0.48623555805230195, 0.04842609927469917, 0.07166756271689229, -0.0028964471200298475, -5.189821981601878, -0.5932698935239838, 3.229340802910038, -12.836300822253634, 3.0782315528664834, 5.8412217033605245, 0.019278567008384557, -0.009862840813512401, -0.007532984627259091, -0.0003523492868714151, 0.9135274456151128, 0.16993865598225097, -0.2892454926120402, -0.46565952159093893, -0.8327133347971125, -1.1556783345783221]
the custom function gives the same as ".get_margeff()" but there might be a lot of room for error when using the sklearn ceof_ in the custom function above.
Is there some method/function/Attribute in sklearn that can give me
the marginal effects
If there is not, is there another library get from the ceof_ and
data to the marginal effects
if the answer to both the above is no, are there any circumstances
in which the custom function will not work (e.g. with a particular solver or
penalty in sklearn)
I just hit this demand a few days ago.
My supervisor gave me this information that I want to share. Hope this can help you.
partial_dependence: This method can get the partial dependence or marginal effects you meant.
plot_partial_dependence: This method can plot the partial dependence.
Here is the sample code from the API Reference.
scikit-learn version: 0.21.2
from sklearn.inspection import plot_partial_dependence, partial_dependence
from sklearn.datasets import make_friedman1
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
%matplotlib inline
X, y = make_friedman1()
# case1: linear model
lm = LinearRegression().fit(X, y)
# plot the partial dependence
plot_partial_dependence(lm, X, [0, (0, 1)])
# get the partial dependence
partial_dependence(lm, X, [0])
# case2: classifier
clf = GradientBoostingRegressor(n_estimators=10).fit(X, y)
# plot the partial dependence
plot_partial_dependence(clf, X, [0, (0, 1)])
# get the partial dependence
partial_dependence(clf, X, [0])
I am trying to evaluate an sklearn predictor which I have made over a larger than memory dask array of inputs. I have read over the parallel post fit documentation https://dask-ml.readthedocs.io/en/latest/modules/generated/dask_ml.wrappers.ParallelPostFit.html and am still having some problems. The following code illustrates the kind issue that I am running into:
from dask.base import tokenize
import numpy as np
import dask.array as da
from dask.array import Array
from sklearn.linear_model import LinearRegression
from dask_ml.wrappers import ParallelPostFit
"""
for stack overflow question
"""
x = np.linspace(0,100,100,dtype=np.int32)
y = np.linspace(0,100,100,dtype=np.int32)
z = np.linspace(0,100,100,dtype=np.int32)
Y = np.random.normal(size=(100,))
X = np.stack([x,y,z],axis=1)
reg = LinearRegression().fit(X,Y)
#now try to compute on dask arrays over the whole space
x= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
y= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
z= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
x,y,z = da.meshgrid(x,y,z,sparse=False,indexing='ij')
stacked = da.stack([x.flatten(),y.flatten(),z.flatten()],axis=1)
clf = ParallelPostFit(estimator=reg)
clf.predict(stacked)
Excecuting clf.predict throws a value error Can't drop an axis with more than 1 block. Please use atop instead.
which I dont understand how to correct.
Thank You for any help.