How to create imbalace sample data with make_blobs? - python

I am using make_blobs from sklearn package.
from sklearn.datasets.samples_generator import make_blobs
I want to create sample data with imbalance features. Example I want 400 of FeatureA and 50 of FeatureB.
ByDefault below code is generating both features in equal numbers:-
X, y = make_blobs(n_samples=450, centers=2, cluster_std=[10.0, 2],random_state=22,n_features=2)
Following is the count plot created for the data generated from above code:-
Please suggest as how can I achieve my requirement?

I think you want to create 2 class of data with predetermined std and center one 400 and others be 50. I set "centers=None" .Am I right? I use this code and it gives what you want. please refer to this link:
sklearn.datasets.make_blobs
import numpy as np
import matplotlib.pyplot as plt
X, y = make_blobs(n_samples=[400,50], centers=None, cluster_std=[10.0, 2],random_state=22,n_features=2)
print(y)
Zero0=np.where(y == 0)[0]
One1=np.where(y == 1)[0]
print(Zero0)
print(One1)
plt.scatter(X[Zero0,0],X[Zero0,1],color=['red'])
plt.scatter(X[One1,0],X[One1,1],color=['green'])
plt.show()
plt.scatter(X[:,0],X[:,1])
plt.show()

Related

SHAP plotting waterfall using an index value in dataframe

I am working on a binary classification using random forest algorithm
Currently, am trying to explain the model predictions using SHAP values.
So, I referred this useful post here and tried the below.
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
sv = explainer(ord_test_t)
exp = Explanation(sv.values[:,:,1],
sv.base_values[:,1],
data=ord_test_t.values,
feature_names=ord_test_t.columns)
idx = 20
waterfall(exp[idx])
I like the above approach as it allows to display the feature values along with waterfall plot. So, I wish to use this approach
However, this doesn't help me get the waterfall for a specific row in ord_test_t (test data).
For example, let's consider that ord_test_t.Index.tolist() returns 3,5,8,9 etc...
Now, I want to plot the waterfall plot for ord_test_t.iloc[[9]] but when I pass exp[9], it just gets the 9th row but not the index named as 9.
When I try exp.iloc[[9]] it throws error as explanation object doesnt have iloc.
Can help me with this please?
My suggestion is as following:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
idx = 9
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X.loc[[idx]]) # corrected, pass the row of interest as df
exp = Explanation(
sv.values[:, :, 1], # class to explain
sv.base_values[:, 1],
data=X.loc[[idx]].values, # corrected, pass the row of interest as df
feature_names=X.columns,
)
waterfall(exp[0]) # pretend you have only 1 data point which is 0th
0.40.0
Proof:
model.predict_proba(X.loc[[idx]]) # corrected
array([[0.95752656, 0.04247344]])

How to add colour to graph to differentiate positive and negative-numbered data?

Plot of dataset showing banknote authentication
I don't know how to add colors to the different dots to differentiate between the positive and negative datasets. I tried following other examples, but I did not make any progress.
For the record, the Python coding I used is as follows:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('Banknote_authentication_dataset.csv')
from sklearn.cluster import KMeans
#V1 is the Variance of Wavelet Transformed image
#V2 is the Skewness of Wavelet Transformed image
V1 = data['V1']
V2 = data['V2']
V1_V2 = np.column_stack((V1, V2))
km_res = KMeans(n_clusters=2).fit(V1_V2)
clusters = km_res.cluster_centers_
plt.xlabel('Variance')
plt.ylabel('Skewness')
plt.scatter(V1, V2)
plt.scatter(clusters[:,0], clusters[:,1], s=1000, alpha = 0.50)
The link to the dataset is: https://d3c33hcgiwev3.cloudfront.net/1fXr31hcEemkYxLyQ1aU1g_50fc36ee697c4b158fe26ade3ec3bc24_Banknote-authentication-dataset-.csv?Expires=1613433600&Signature=PhnPBuxjL9TwNwXV2dmS7HN3YOtLJsJo3A26UID0CBBC13cxsBmRmpsyUVN7MXIcrte6oUCBeybrhveDMCb-6-nMsQ8JzSH8qxZgYR7mwfO32WZYDQ7S6qm2Z6hFnkw76NIeEdto5L9CDDFpKkF8OhLd81bjxnTictbS1UTOPXw_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A.
You can get the predictions by using km_res.predict(V1_V2) and then just pass that into your first call to plt.scatter. So your code would change to look like:
# ... code above
preds = km_res.predict(V1_V2)
plt.scatter(V1, V2, c=preds)
# ... code below
If you want control over what colors it uses just change the number predictions to colors (so you'd make all points that have prediction one turn to the string red for example)

sihouette score returns inconsistent number of sample

I am using scikit's silhouette_score hierarchical clustering. I am not from data science background, or python. However i do know some other languages and do know how hierarchical clustering logic works. i was told to use the scikit's silhouette_score to calculate the silhouette score. this code returns an error of
ValueError: Found input variables with inconsistent numbers of samples: [149, 150]
The data used is csv, containing 151 rows with the first row as the data's type. So in total there is 150 datas.
here is my code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.metrics import silhouette_score
iris = pd.read_csv("Iris.csv")
#iris hierarichal
iris_df = iris.iloc[:, 1:5]
plt.figure(figsize=(10, 7))
plt.title("Iris Dendograms Average Method")
link = linkage(iris_df, method='average')
dend = dendrogram(link)
plt.show()
clusters = fcluster(link, 3, criterion='maxclust')
print(silhouette_score(link, clusters))
You've got a problem here:
print(silhouette_score(link, clusters))
Change it and you're fine to go:
print(silhouette_score(X, clusters))
Please see docs for silhouette_score:
X: array [n_samples_a, n_samples_a] if metric == “precomputed”, or, [n_samples_a, n_features] otherwise.
Array of pairwise distances between samples, or a feature array.

How to use Kalman filter model in detecting peaks

For my evaluation, I wanted to use PyKalman filter library. I have created a very small time series data ready with three columns formatted as follows. The full dataset is attached here for reproduciability since I can't attach a file on stackoverflow:
http://www.mediafire.com/file/el1tkrdun0j2dk4/testdata.csv/file
time X Y
0.040662 1.041667 1
0.139757 1.760417 2
0.144357 1.190104 1
0.145341 1.047526 1
0.145401 1.011882 1
0.148465 1.002970 1
.... ..... .
I have read the PyKalman library documetation for Python and managed to do a simple linear filtering using Kalman Filterand here is my code
import matplotlib.pyplot as plt
from pykalman import KalmanFilter
import numpy as np
import pandas as pd
df = pd.read_csv('testdata.csv')
print(df)
pd.set_option('use_inf_as_null', True)
df.dropna(inplace=True)
X = df.drop('Y', axis=1)
y = df['Y']
estimated_value= np.array(X)
real_value = np.array(y)
measurements = np.asarray(estimated_value)
kf = KalmanFilter(n_dim_obs=1, n_dim_state=1,
transition_matrices=[1],
observation_matrices=[1],
initial_state_mean=measurements[0,1],
initial_state_covariance=1,
observation_covariance=5,
transition_covariance=1)
state_means, state_covariances = kf.filter(measurements[:,1])
state_std = np.sqrt(state_covariances[:,0])
print (state_std)
print (state_means)
print (state_covariances)
fig, ax = plt.subplots()
ax.margins(x=0, y=0.05)
plt.plot(measurements[:,0], measurements[:,1], '-r', label='Real Value Input')
plt.plot(measurements[:,0], state_means, '-b', label='Kalman-Filter')
plt.legend(loc='best')
ax.set_xlabel("Time")
ax.set_ylabel("Value")
plt.show()
Which gives the following plot as an output
As we can see from the plot and my dataset, my input is non-linear. Therefore, I wanted to use Kalman Filter and see if I can detect and track the drops of the filtered signal (blue color in the above plot). But since I am so new to Kalman Filter, I seem to have a hardtime understanding the mathematical formulation and and to get started with Unscented Kalman Filter. I found a good example on basic use of PyKalman UKF - but it doesn't show how to defined the percentage of the drop (peaks). I would, therefore, appreciate for any help at least which detects how big the drop from the peaks of the filtered one is (for example, 50% or 80% of the previous drop of the blue line in the plot). Any help would be appreciated.

using dask Parallel post fit on sklearn predictors (ParallelPostFit wrapper)

I am trying to evaluate an sklearn predictor which I have made over a larger than memory dask array of inputs. I have read over the parallel post fit documentation https://dask-ml.readthedocs.io/en/latest/modules/generated/dask_ml.wrappers.ParallelPostFit.html and am still having some problems. The following code illustrates the kind issue that I am running into:
from dask.base import tokenize
import numpy as np
import dask.array as da
from dask.array import Array
from sklearn.linear_model import LinearRegression
from dask_ml.wrappers import ParallelPostFit
"""
for stack overflow question
"""
x = np.linspace(0,100,100,dtype=np.int32)
y = np.linspace(0,100,100,dtype=np.int32)
z = np.linspace(0,100,100,dtype=np.int32)
Y = np.random.normal(size=(100,))
X = np.stack([x,y,z],axis=1)
reg = LinearRegression().fit(X,Y)
#now try to compute on dask arrays over the whole space
x= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
y= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
z= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
x,y,z = da.meshgrid(x,y,z,sparse=False,indexing='ij')
stacked = da.stack([x.flatten(),y.flatten(),z.flatten()],axis=1)
clf = ParallelPostFit(estimator=reg)
clf.predict(stacked)
Excecuting clf.predict throws a value error Can't drop an axis with more than 1 block. Please use atop instead.
which I dont understand how to correct.
Thank You for any help.

Categories