I'm trying to retrieve the data points belonging to a specific cluster in Spark. In the following piece of code, the data is made up but I actually obtain the predicted clustered.
Here is the code I have so far:
import numpy as np
# Example data
flight_routes = np.array([[1,3,2,0],
[4,2,1,4],
[3,6,2,2],
[0,5,2,1]])
flight_routes = sc.parallelize(flight_routes)
model = KMeans.train(rdd=flight_routes, k=500, maxIterations=10)
route_test = np.array([[0,2,3,4]])
test = sc.parallelize(route_test)
prediction = model.predict(test)
cluster_number_predicted = prediction.collect()
print cluster_number_predicted # it returns [100] <-- COOL!!
Now, I'd like to have all the data points belonging to the cluster number 100. How do I get those ?
What I want achieve is something like the answer given to this SO question: Cluster points after Means (Sklearn)
Thank you in advance.
If you both record and prediction (and not willing to switch to Spark ML) you can zip RDDs:
predictions_and_values = model.predict(test).zip(test)
and filter afterwards:
predictions_and_values.filter(lambda x: x[1] == 100)
Related
I have 100 clusters, each with a mean and standard deviation value. These clusters are predefined using the SPSS software package, by using the 2-step cluster method. Therefore, the optimisation of these cluster distributions to fit the data has already been done.
For new (unseen) data, we want to assign cluster membership by selecting the maximum log-likelihood cluster, for any given set of coordinates X. To do this, I have written my own code for comparison with what was output by SPSS using the same method: https://www.norusis.com/pdf/SPC_v19.pdf
Using data that has been correctly labelled by SPSS, about 42% of the clusters are correctly labelled by minimising the RMSE to the cluster mean (which is not what SPSS does), and less than 20% of the clusters are labelled correctly by my code when assigning the maximum log-likelihood cluster (which is what SPPSS reports to do).
I know that the maximum log-likelihood cluster should be the correct cluster ( https://www.norusis.com/pdf/SPC_v19.pdf ), but there is only a 20% success rate from this code when compared to the correct cluster labels from SPSS. What am I doing wrong?
Here is the code below.
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
import math
from scipy import stats
# importa raw files
clusters_df = pd.read_csv('ClusterCoordinates.csv') # clusters are in order of cluster numbers enabling us to use index for identification
clusters_df = clusters_df.drop(columns=['Cluster'])
print(clusters_df.shape)
clusters = clusters_df.to_numpy()
frames_df_raw = pd.read_csv('FrameCoordinates.csv')
frames_df = frames_df_raw.drop(columns=['frame','replica','voltage','system','ff','cluster'])
print(frames_df.shape)
frames = frames_df.to_numpy()
clusters_sd_df = pd.read_csv('ClusterCoordinates_SD.csv')
clusters_sd_df = clusters_sd_df.drop(columns=['Cluster'])
print(clusters_sd_df.shape)
clusters_sd = clusters_sd_df.to_numpy()
rmseCalc = []
llCalc = []
assignedCluster_RMSE = []
assignedCluster_LL = []
# create tables with RMSE and LL values
for frame in frames:
for cluster, cluster_sd in zip(clusters, clusters_sd):
# we compare cluster assignment using minimum RMSE vs maximum log likelihood methods.
rmseCalc.append(math.sqrt(mean_squared_error(np.array(cluster),np.array(frame))))
llCalc.append(-np.sum(stats.norm.logpdf(frame, loc=cluster, scale=cluster_sd)))
rmseCalc=np.array(rmseCalc)
llCalc=np.array(llCalc)
llCalc=np.nan_to_num(llCalc)
minRMSE = np.where(rmseCalc==rmseCalc.min())
maxLL = np.where(llCalc==llCalc.min())
print(maxLL[0][0]+1)
assignedCluster_RMSE.append(minRMSE[0][0]+1)
assignedCluster_LL.append(maxLL[0][0]+1)
rmseCalc=[]
llCalc=[]
frames_df_raw['predCluster_RMSE'] = np.array(assignedCluster_RMSE)
frames_df_raw['predCluster_LL'] = np.array(assignedCluster_LL)
frames_df_raw.to_csv('frames_clustered.csv')
I was expecting the cluster labels assigned by the code to match those already assigned by SPSS, since the methods used are intended to be the same.
I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below
row_to_show = 20
data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
rf_boruta.predict_proba(data_for_prediction_array)
explainer = shap.TreeExplainer(rf_boruta)
# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show])
This generated the plot as shown below
However, I want to export this to dataframe and how can I do it?
I expect my output to be like as shown below. I want to export this for the full dataframe. Can you help me please?
Let's do a small experiment:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer
X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
What is explainer here? If you do dir(explainer) you'll find out it has some methods and attributes among which is:
explainer.expected_value
which is of interest to you because this is base on which SHAP values add up.
Furthermore:
sv = explainer.shap_values(X)
len(sv)
will give a hint sv is a list consisting of 2 objects which are most probably SHAP values for 1 and 0, which must be symmetric (because what moves towards 1 moves exactly by the same amount, but with opposite sign, towards 0).
Hence:
sv1 = sv[1]
Now you have everything to pack it to the desired format:
df = pd.DataFrame(sv1, columns=X.columns)
df.insert(0, 'bv', explainer.expected_value[1])
Q: How do I know?
A: Read docs and source code.
If I recall correctly, you can do something like this with pandas
import pandas as pd
shap_values = explainer.shap_values(data_for_prediction)
shap_values_df = pd.DataFrame(shap_values)
to get the feature names, you should do something like this (if data_for_prediction is a dataframe):
feature_names = data_for_prediction.columns.tolist()
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
I'm a currenty using that :
def getShapReport(classifier,X_test):
shap_values = shap.TreeExplainer(classifier).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values[1], X_test)
return pd.DataFrame(shap_values[1])
It first displays the shap values for the model, and for each prediction after that, and finally it returns the dataframe for the positive class(i'm on an imbalance context)
It is for a Tree explainer and not a waterfall, but it is basically the same.
I am running some regression models to predict performance.
After running the models I created a variable to see the predictions (y_pred_* are lists with 2567 values):
y_pred_LR = regressor.predict(X_test)
y_pred_SVR = regressor2.predict(X_test)
y_pred_RF = regressor3.predict(X_test)
the types of these prediction lists are Array of float64, while the y_test is a DataFrame.
I wanted to create a table with the results, I tried some different ways, calling as list, trying to convert, trying to select as values, and I did not succeed so far, any one could help?
My last trial was like below:
comparison = pd.DataFrame({'Real': y_test, LR':y_pred_LR,'RF':y_pred_RF,'SVM':y_pred_SVM})
In this case the DataFrame is created but the values donĀ“t appear.
Additionally, I would like to create two new rows with the mean and standard deviation of results and this row should be located at beginning (or first row) of the Data Frame.
Thanks
import pandas as pd
import numpy as np
real = np.array([2] * 10).reshape(-1,1)
y_pred_LR = np.array([0] * 10)
y_pred_SVR = np.array([1] * 10)
y_pred_RF = np.array([5] * 10)
real = real.flatten()
comparison = pd.DataFrame({'real':real,'y_pred_LR':y_pred_LR,'y_pred_SVR':y_pred_SVR,"y_pred_RF":y_pred_RF})
Mean = comparison.mean(axis=0)
StD = comparison.std(axis=0)
Mean_StD = pd.concat([Mean,StD],axis=1).T
result = pd.concat([Mean_StD,comparison],ignore_index=True)
print(result)
I've got 10 clusters in k-modes,
data:- categorical(i converted to binary then run model).
used technology:- jupyter-python.
doubt:- 1. find accuracy.
plotting/visualising cluster in 2d and 3d.
Something like this should be a good start.
#recreate data to feed into the algorithm
data = np.asarray([np.asarray(df['field1']),np.asarray(df['field2'])]).T
So now running the following piece of code:
# computing K-Means with K = 5 (5 clusters)
centroids,_ = kmeans(data,5)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'oy',
data[idx==2,0],data[idx==2,1],'or',
data[idx==3,0],data[idx==3,1],'og',
data[idx==4,0],data[idx==4,1],'om')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
This is a great resource.
https://www.pythonforfinance.net/2018/02/08/stock-clusters-using-k-means-algorithm-in-python/
I am trying to fit OneVsAll Classification output in training data , rows of output adds upto 1 .
One possible way is to read all the rows and find which column has highest value and prepare data for training .
Eg : y = [[0.2,0.8,0],[0,1,0],[0,0.3,0.7]] can be reduced to y = [b,b,c] , considering a,b,c as corresponding class of the columns 0,1,2 respectively.
Is there a function in scikit-learn which helps to achieve such transformations?
This code does what you want:
import numpy as np
import string
y = np.array([[0.2,0.8,0],[0,1,0],[0,0.3,0.7]])
def transform(y,labels):
f = np.vectorize(lambda i : string.letters[i])
y = f(y.argmax(axis=1))
return y
y = transform(y,'abc')
EDIT: Using the comment by alko, I made it more general be letting the user supply the labels to the transform function.