I what to implement the two data set and prediction
Pre-process data
Enocode columns that contains text, Normalize numberical columns, Visualize data.
Form a problem statement: Select a model, Train a model, Evaluate result.
www.kaggle.com/alfrednieves/notebooke0d8d2a5e2/edit
https://www.kaggle.com/questions-and-answers/217299
I Need Help with the solution
import pandas as pd
import matplotlib.pyplot as plt
data2013=pd.read_csv('laser_incidents_2013.csv')
data2014=pd.read_csv('laser_incidents_2014.csv')
data2013.head()
data2014.head()
data2013.all
data2014.all
data2013.columns
data2014.columns
data2013=data2013.rename(columns={'DATE':'D1','Time_(UTC)':'T1','Aircraft ID':'AI1','No. A/C':'N1',
'ALT':'A1','MAJOR CITY':'MC1', 'COLOR':'C1', 'Injury Reported':'IR1',`enter code here`'CITY':'C1', 'STATE':'S1'})
data2014=data2014.rename(columns={'DATE':'D2','Time_(UTC)':'T2','Aircraft ID':'AI2','No. A/C':'N2',
'ALT':'A2','MAJOR CITY':'MC2', 'COLOR':'C2','Injury Reported':'IR2','CITY':'C2','STATE':'S2'})
#Lets first import the preprocessing module
from sklearn import preprocessing
#Now let's form a label Encoder model
le = preprocessing.LabelEncoder()
#Now we use feed the label column to the model
le.fit(data2013['S1'])
le.fit(data2014['S2'])
#Model will go through column and find the unique labels (Number of classes that are there)
#Following line will print the labels found in the column
list(le.classes_)
#Following line will convert the labels to an array of numbers
le.transform(data['S1'])
#Following line will convert the labels to an array of numbers
le.transform(data['S2'])
data2013.head()
data2014.head()
#Visualize Data
import matplotlib.pyplot as plt
#plot not stressed class
plt.scatter(data['X1'], data['X2'],c=data['S1'])
plt.scatter(data['X1'][data['L']==0], data['X2'][data['L']==0],label='class1',color='blue')
plt.scatter(data['X1'][data['L']==1], data['X2'][data['L']==1],label='class2',color='red')
plt.scatter(data['X1'][data['L']==2], data['X2'][data['L']==2],label='class3',color='green')
plt.legend()
from sklearn.model_selection import train_test_split
X_training, X_testing, Y_training, Y_testing = train_test_split(data[['X1','X2','X3','X4']], data['L'], test_size=0.3)
plt.scatter(X_training['X1'][Y_training==0],X_training['X2'][Y_training==0],label='class1',color='blue')
plt.scatter(X_training['X1'][Y_training==1],X_training['X2'][Y_training==1],label='class2',color='red')
plt.scatter(X_training['X1'][Y_training==2],X_training['X2'][Y_training==2],label='class3',color='green')
plt.title('Training Data')
plt.legend()
plt.figure()
plt.scatter(X_testing['X1'][Y_testing==0],X_testing['X2'][Y_testing==0],label='class1',color='blue')
plt.scatter(X_testing['X1'][Y_testing==1],X_testing['X2'][Y_testing==1],label='class2',color='red')
plt.scatter(X_testing['X1'][Y_testing==2],X_testing['X2'][Y_testing==2],label='class3',color='green')
plt.title('Test Data')
plt.legend()
MDC Classifier
Use the code that described in lectures and classify data using MDC
Then count the number of points that are missclassified in the training data and test data.
K-Nearst Neighbors Classifier
Using 5NN (5 nearest neighbors)
Then count the number of points that are missclassified in the training data and test data.
Related
I have not clustered data in a while and at the moment i have a massive list of accounts with their perspective areas (or OUs in the table below).
I have used kmeans and kmodes to try and cluster based on OU - meaning that I want the output to group the 17 OUs i have and cluster them based on the provided information. Thus far the output has provided me with clustering based on each record individually and not based on each OU. can some one help me figure out how to group the output then cluster somehow? below is the same of the code used.
# Building the model with 3 clusters
kmode = KModes(n_clusters=3, init = "random", n_init = 5, verbose=1)
clusters = kmode.fit_predict(df)
clusters
#insert the predicted cluster values in our original dataset.
df.insert(0, "Cluster", clusters, True)
df.head(10)
I don't have access to your data set, but below is a generic example of how to do clustering.
# Cluster analysis, or clustering, is an unsupervised machine learning task.
# It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling),
# clustering algorithms only interpret the input data and find natural groups or clusters in feature space.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
See the link below for more details.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below
row_to_show = 20
data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
rf_boruta.predict_proba(data_for_prediction_array)
explainer = shap.TreeExplainer(rf_boruta)
# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show])
This generated the plot as shown below
However, I want to export this to dataframe and how can I do it?
I expect my output to be like as shown below. I want to export this for the full dataframe. Can you help me please?
Let's do a small experiment:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer
X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
What is explainer here? If you do dir(explainer) you'll find out it has some methods and attributes among which is:
explainer.expected_value
which is of interest to you because this is base on which SHAP values add up.
Furthermore:
sv = explainer.shap_values(X)
len(sv)
will give a hint sv is a list consisting of 2 objects which are most probably SHAP values for 1 and 0, which must be symmetric (because what moves towards 1 moves exactly by the same amount, but with opposite sign, towards 0).
Hence:
sv1 = sv[1]
Now you have everything to pack it to the desired format:
df = pd.DataFrame(sv1, columns=X.columns)
df.insert(0, 'bv', explainer.expected_value[1])
Q: How do I know?
A: Read docs and source code.
If I recall correctly, you can do something like this with pandas
import pandas as pd
shap_values = explainer.shap_values(data_for_prediction)
shap_values_df = pd.DataFrame(shap_values)
to get the feature names, you should do something like this (if data_for_prediction is a dataframe):
feature_names = data_for_prediction.columns.tolist()
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
I'm a currenty using that :
def getShapReport(classifier,X_test):
shap_values = shap.TreeExplainer(classifier).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values[1], X_test)
return pd.DataFrame(shap_values[1])
It first displays the shap values for the model, and for each prediction after that, and finally it returns the dataframe for the positive class(i'm on an imbalance context)
It is for a Tree explainer and not a waterfall, but it is basically the same.
I have a KMeans function I made takes the input def kmeans(x,k, no_of_iterations): and returns the following return points, centroids it gets plotted perfectly, the code for that isn't very relevant. But I want to calculate for it, the silhouette score and graph this for each value.
#Load Data
data = load_digits().data
pca = PCA(2)
#Transform the data
df = pca.fit_transform(data)
X= df
#y = kmeans.fit_predict(X)
#Applying our function
label, centroids = kmeans(df,10,1000)#returns points value and centroids
y = label.fit_predict(data)
#Visualize the results
u_labels = np.unique(label)
for i in u_labels:
plt.scatter(df[label == i , 0] , df[label == i , 1] , label = i)
plt.scatter(centroids[:,0] , centroids[:,1] , s = 80, color = 'k')
plt.legend()
plt.show()
the above is code for running the KMeans plot
Below is my attempt to calculate silhouette. This is from an example that imports from KMeans but I don't really want to do that nor did it work with my code.
silhouette_avg = silhouette_score(X, y)
print("The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, y)
You may notice that there is no value here for y, as I have found y is supposed to be the amount of clusters I think? So I had it as 10 at first and it give an error message. I don't know if from this code anyone could tell me what I do next to get this value?
Try this:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
mpl.rcParams["figure.figsize"] = (9,6)
# Generate synthetic dataset with 8 blobs
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, shuffle=True, random_state=42)
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof()
# Instantiate the clustering model and visualizer
model = KMeans(8)
visualizer = SilhouetteVisualizer(model)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
Also, see this.
https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html
The use of df.head() is to return first five rows the why [0:5] is used in the code and what is the use of X[0:5] written after 'preprocessing' line? And in KNeighbourClassifier after fitting 'neigh' why another single 'neigh' is used in the very next line? Please help, Thank You.
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/teleCust1000t.csv')
df.head()
df['custcat'].value_counts()
df.hist(column='income', bins=50)
df.columns
X = df[['region', 'tenure','age', 'marital', 'address', 'income', 'ed', 'employ','retire', 'gender', 'reside']] .values #.astype(float)
X[0:5]
y = df['custcat'].values
y[0:5]
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
from sklearn.neighbors import KNeighborsClassifier
k = 4
#Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh
yhat = neigh.predict(X_test)
yhat[0:5]
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))
http://localhost:8888/notebooks/Downloads/ML0101EN-Clas-K-Nearest-neighbors-CustCat-py-v1.ipynb
Step 3 of CRISP-DM is "Explore and clean your data". During this step you need to have a look at the data in order to judge about its quality. So you need some visible output after each step.
This code is used in a Jupyter notebook. Jupyter notebooks will display the result of an expression. It acts like a print(), but smarter, so it will display images, tables etc.
However, it will only generate that output it it's the last line of a cell. Copying and pasting all code into a single cell will give you less output and it'll be harder to follow the analysis, because you don't see all the steps in between.
Here's a short example of what df.head() might output
Similarly, other seemingless useless statements will generate output:
From the comments:
Would you please answer why they used first 3 [0:5], more importantly 3rd [0:5]? And why 'neigh' is used in "KNeighbourClassifier" part after the line "neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)"?
X[0:5] will display the first 5 values of X. Y[0:5] will display the first 5 values of Y. And neigh will display whatever neigh is.
And last one please. What does bins=50 mean?
It tells hist() to create 50 bars in the diagram instead of 10. That way you can better judge about the distribution of values.
The nice thing about a Jupyter notebook (which you dson't seem to have discovered yet) is that you can easily change the statements and immediately get new output. Get a fresh Jupyter notebook, type one line of code. Press Ctrl+Enter, then do the same for the next line. You'll see what it does.
I have this plot
Now I want to add a trend line to it, how do I do that?
The data looks like this:
I wanted to just plot how the median listing price in California has gone up over the years so I did this:
# Get California data
state_ca = []
state_median_price = []
state_ca_month = []
for state, price, date in zip(data['ZipName'], data['Median Listing Price'], data['Month']):
if ", CA" not in state:
continue
else:
state_ca.append(state)
state_median_price.append(price)
state_ca_month.append(date)
Then I converted the string state_ca_month to datetime:
# Convert state_ca_month to datetime
state_ca_month = [datetime.strptime(x, '%m/%d/%Y %H:%M') for x in state_ca_month]
Then plotted it
# Plot trends
figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.plot(state_ca_month, state_median_price)
plt.show()
I thought of adding a trendline or some type of line but I am new to visualization. If anyone has any other suggestions I would appreciate it.
Following the advice in the comments I get this scatter plot
I am wondering if I should further format the data to make a clearer plot to examine.
If by "trend line" you mean a literal line, then you probably want to fit a linear regression to your data. sklearn provides this functionality in python.
From the example hyperlinked above:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
To clarify, "the overall trend" is not a well-defined thing. Many times, by "trend", people mean a literal line that "fits" the data well. By "fits the data", in turn, we mean "predicts the data." Thus, the most common way to get a trend line is to pick a line that best predicts the data that you have observed. As it turns out, we even need to be clear about what we mean by "predicts". One way to do this (and a very common one) is by defining "best predicts" in such a way as to minimize the sum of the squares of all of the errors between the "trend line" and the observed data. This is called ordinary least squares linear regression, and is one of the simplest ways to obtain a "trend line". This is the algorithm implemented in sklearn.linear_model.LinearRegression.