sklearn.transform and sklearn.fit_transform give different results - python

I am trying to plot my data in a 2-dimensional space using sklearn PCA. I want to re-use the same PCA representation to plot several data afterwards, but let us focus on one set first.
When I run a sklearn.fit_transform on my data I get the following result:
sklearn_pca = sklearnPCA(n_components = 2, random_state = 55)
X_train_proj = sklearn_pca.fit_transform(X_train)
plt.scatter(X_train_proj[:, 0],
X_train_proj[:, 1],
c = dic[y_train.astype(int)],
s = y_train * 10 + 1)
Output 1: https://i.ibb.co/B4FcV08/capture-1.png
When I run a sklearn.transform on the same data, using the PCA object created before thanks to the fit_transform, here is what I get:
X_train_proj_2 = sklearn_pca.transform(X_train)
plt.scatter(X_train_proj_2[:, 0],
X_train_proj_2[:, 1],
c = dic[y_train.astype(int)],
s = y_train * 10 + 1)
Output 2: https://i.ibb.co/0MS3Jhy/capture-2.png
My data contains absolutely no NAs and is already scaled. The size, however, is quite big, as I have ~11,000 lines and ~20 columns.
I have also rapidly checked that my columns are not correlated by computing the correlation matrix.

Related

How do I color clusters after k-means and TSNE in either seaborn or matplotlib?

I have a dataframe that look something like this:
transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[-true_k:, 0]
df["comp-2"] = transformed_centroids[-true_k:, 1]
The 'y' are the k-means labels I want to color by, and "comp-1" and "comp-2" are the results from the TSNE model. When I try to plot like this:
sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df['y'])
plt.show()
It gives me this error:
ValueError: Length of values (2) does not match length of index (35104) (from this line: df["comp-1"] = transformed_centroids[-true_k:, 0])
This happens even if I do this:
sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df.y.astype('category').cat.codes)
plt.show()
I've tried several other pieces of code scattered around random tutorials and here, but I haven't found a solution. I feel silly having successfully completed the clustering but failing on the colors.
EDIT: I realized I was using the wrong plot-points. The updates code and error is below:
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:, 0]
df["comp-2"] = transformed_centroids[:, 1]
ValueError: Length of values (35106) does not match length of index (35104)
I'm not sure where the two dropped data-points are being... dropped.
EDIT2: Here is the TSNE code:
centroids = model.cluster_centers_
tweets_df2['labels'] = model.labels_
everything = np.concatenate((X.todense(), centroids))
tsne_init = 'pca' # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
model2 = TSNE(n_components=2, random_state=0, init=tsne_init, perplexity=tsne_perplexity,
early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)
transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()
I took this code from another stacked overflow post and fit it to my data so I can't explain it 100%, I just know I needed to use TSNE to get my data-points to become 2D plottable since the data was words vectorized using TD-IDF
With help from #tdy, I realized one of the solutions tried a little while ago was the solution I needed. My main problem was my edit 2, I wasn't graphing the right set of data. I changed the df to this:
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:-2, 0]
df["comp-2"] = transformed_centroids[:-2, 1]
of course, this is the same as this for my 2-cluster code:
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:true_k, 0]
df["comp-2"] = transformed_centroids[:true_k, 1]
where true_k is the variable representing how many k-means clusters I have. I had this solution but changed it because I thought getting rid of the true_k would solve my 2-variable problem and I never reverted it. I just needed to do this with the right transformed_centroids[] slice and everything should run smoothly in 7 minutes when it's done melting my CPU... :)

pyclustering visualising xmeans when the matrix has more than three dimensions

I'm trying to cluster and visualise some data with xmeans from the pyclustering lib.
I copied the code directly from the example in the documentation,
from pyclustering.cluster import cluster_visualizer
from pyclustering.cluster.xmeans import xmeans
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.utils import read_sample
from pyclustering.samples.definitions import SIMPLE_SAMPLES
sample = X # read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
# Prepare initial centers - amount of initial centers defines amount of clusters from which X-Means will
# start analysis.
amount_initial_centers = 2
initial_centers = kmeans_plusplus_initializer(sample, amount_initial_centers).initialize()
# Create instance of X-Means algorithm. The algorithm will start analysis from 2 clusters, the maximum
# number of clusters that can be allocated is 20.
xmeans_instance = xmeans(sample, initial_centers, 20)
xmeans_instance.process()
# Extract clustering results: clusters and their centers
clusters = xmeans_instance.get_clusters()
centers = xmeans_instance.get_centers()
# Print total sum of metric errors
print("Total WCE:", xmeans_instance.get_total_wce())
# Visualize clustering results
visualizer = cluster_visualizer()
visualizer.append_clusters(clusters, sample)
visualizer.append_cluster(centers, None, marker='*', markersize=10)
visualizer.show()
The only difference is that I assigned to sample the value of my matrix X instead of loading a sample dataset.
When I try to visualise the clustering result I get this error:
Only objects with size dimension 1 (1D plot), 2 (2D plot) or 3 (3D plot) can be displayed. For multi-dimensional data use 'cluster_visualizer_multidim'.
My X matrix is generated in this way:
features = ["I", "Iu", other 7 column names]
data = df[features]
...
X = scaler.fit_transform(data)
Is there a way to visualise the clusters and plotting only two/three features at a time?
I can't find anything on the documentation.
I tried this:
visualizer.append_clusters(clusters, sample[:,[0,1]])
in order to visualise only the first two features and got this error
Only clusters with the same dimension of objects can be displayed on canvas.
EDIT:
I updated the code as suggested in the answer by annoviko but now I get the following error:
ValueError Traceback (most recent call last)
<ipython-input-69-6fd7d2ce5fcd> in <module>
20 visualizer.append_clusters(clusters, X)
21 visualizer.append_cluster(centers, None, marker='*', markersize=10)
---> 22 visualizer.show(pair_filter=[[0, 1], [0, 2]])
/usr/local/lib/python3.8/site-packages/pyclustering/cluster/__init__.py in show(self, pair_filter, **kwargs)
224 raise ValueError("There is no non-empty clusters for visualization.")
225
--> 226 cluster_data = self.__clusters[0].data or self.__clusters[0].cluster
227 dimension = len(cluster_data[0])
228
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
It is raised by visualizer.show(), and it happens even if I remove the pair_filter from within the function call.
In line with the error that you got:
Only objects with size dimension 1 (1D plot), 2 (2D plot) or 3 (3D plot) can be displayed. For multi-dimensional data use 'cluster_visualizer_multidim'.
You have to use cluster_visualizer_multidim as it was mentioned. There is a documentation (pyclustering 0.10.1) with an example: https://pyclustering.github.io/docs/0.10.1/html/dc/d6b/classpyclustering_1_1cluster_1_1cluster__visualizer__multidim.html
For example, if you have a data (D > 3) and you want to display (x0, x1) and (x0, x2) then you can display it in the following way:
visualizer = cluster_visualizer_multidim()
visualizer.append_clusters(clusters, sample_4d)
visualizer.show(pair_filter=[[0, 1], [0, 2]])
Where pair_filter specifies which features should be shown. In example above, it will show only (x0, x1) - [0, 1] and (x0, x2) - [0, 2].
So, in your particular case where you have to display only first two features it should be:
visualizer = cluster_visualizer_multidim()
visualizer.append_clusters(clusters, sample)
visualizer.show(pair_filter=[[0, 1]])
I think I have to make error more readable and make a proposal to use another class in the first sentence. Let me know if it helps (if it is still relevant for you).

get the size of dataset after applying a filter from tf.data.Dataset

I wonder how I can get the size or the len of the dataset after applying a filter. Using tf.data.experimental.cardinality give -2, and this is not what I am looking for!! I want to know how many filtered samples exist in my dataset in order to be able to split it to training and validation datasets using take() and skip().
Example:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5])
dataset = dataset.filter(lambda x: x < 4)
size = tf.data.experimental.cardinality(dataset).numpy()
#size here is equal to -2 but I want to get the real size which is 3
My dataset contains images and their labels, this is just an illustrative example
Taking a look at the documentation reveals that a cardinality of -2 shows that Tensorflow is unable to determine the cardinality of the data set. You can find this in here. For your example, you can do
dataset = dataset.as_numpy_iterator()
dataset = list(dataset)
print(len(dataset))

How to create an SVM with multiple features for classification?

I am writing a piece of code to identify different 2D shapes using opencv. I get 4 sets of data from each image of a 2D shape and these are stored in the multidimensional array featureVectors.
I am trying to write an svm/svc that takes into account all 4 features obtained from the image. I have been able to make it work with just 2 features but when i try all 4 my graph comes out looking like this.
My Graph which is incorrect
My values for featureVectors are:
[[ 4.00000000e+00 1.74371349e-03 6.49705560e-01 9.07957236e+01]
[ 4.00000000e+00 4.60937436e-02 1.97642179e-01 9.02041472e+01]
[ 1.00000000e+00 1.18553450e-03 3.03491372e-01 6.03489082e+01]
[ 1.00000000e+00 1.54552898e-02 8.38091425e-01 1.09021207e+02]
[ 3.00000000e+00 1.69961646e-02 4.13691915e+01 1.36838300e+02]]
And my Labels are:
[[2]
[2]
[0]
[0]
[1]]
Here is my code for the SVM:
#Saving featureVectors to a csv file
values1 = featureVectors
header1 = ["Number of Sides", "Standard Deviation of Number of Sides/Perimeter",
"Standard Deviation of the Angles", "Largest Angle"]
my_df = pd.DataFrame(featureVectors)
my_df.to_csv('featureVectors.csv', index=True, header=header1)
#Saving labels to a csv file
values2 = labels
header2 = ["Label"]
my_df = pd.DataFrame(labels)
my_df.to_csv('labels.csv', index=True, header=header2)
#Writing the SVM
def Build_Data_Set(features = header1, features1 = header2):
data_df = pd.DataFrame.from_csv("featureVectors.csv")
#data_df = data_df[:250]
X = np.array(data_df[features].values)
data_df2 = pd.DataFrame.from_csv("labels.csv")
y = np.array(data_df2[features1].values)
#print(X)
#print(y)
return X,y
def Analysis():
X,y = Build_Data_Set()
clf = svm.SVC(kernel = 'linear', C = 1.0)
clf.fit(X, y)
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(0,5)
yy = np.linspace(0,185)
h0 = plt.plot(xx,yy, "k-", label="non weighted")
plt.scatter(X[:, 0],X[:, 1],c=y)
plt.ylabel("Maximum Angle (Degrees)")
plt.xlabel("Number Of Sides")
plt.title('Shapes')
plt.legend()
plt.show()
Analysis()
I have only used 5 data sets(shapes) so far because I knew it wasn't working correctly.
The SVM part of your code is actually correct. The plotting part around it is not, and given the code I'll try to give you some pointers.
First of all:
another example I found(i cant find the link again) said to do that
Copying code without understanding it will probably cause more problems than it solves. Given your code, I'm assuming you used this example as a starter.
plt.scatter(X[:, 0],X[:, 1],c=y)
In the sk-learn example, this snippet is used to plot data points, coloring them according to their label. This works because in the example we're dealing with 2-dimensional data, so this is fine. The data you're dealing with is 4-dimensional, so you're actually just plotting the first two dimensions.
plt.scatter(X[:, 0], y, c=y)
on the other hand makes no sense.
xx = np.linspace(0,5)
yy = np.linspace(0,185)
h0 = plt.plot(xx,yy, "k-", label="non weighted")
Your decision boundary has actually nothing to do with the actual decision boundary. It's just a plot of y over x of your coordinate system.
(In addition to that, you're dealing with multi class data, so you'll have as much decision boundaries as you have classes.)
Now your actual problem is data dimensionality. You're trying to plot 4-dimensional data in a 2d plot, which simply won't work.
A possible approach would be to perform dimensionality reduction to map your 4d data into a lower dimensional space, so if you want to, I'd suggest you reading e.g. the excellent sklearn documentation for an introduction to SVMs and in addition something about dimensionality reduction.

Scikitlearn - order of fit and predict inputs, does it matter?

Just getting started with this library... having some issues (i've read the docs but didn't get clarity) with RandomForestClassifiers
My question is pretty simple, say i have a train data set like
A B C
1 2 3
Where A is the independent variable (y) and B-C are the dependent variables (x). Let's say the test set looks the same, however the order is
B A C
1 2 3
When I call forest.fit(train_data[0:,1:],train_data[0:,0])
do I then need to reorder the test set to match this order before running? (Ignoring the fact that I need to remove the already predicted y value (a), so lets just say B and C are out of order... )
Yes, you need to reorder them. Imagine a simpler case, Linear Regression. The algorithm will calculate the weights for each of the features, so for example if feature 1 is unimportant, it will get assigned a close to 0 weight.
If at prediction time the order is different, an important feature will be multiplied by this almost null weight, and the prediction will be totally off.
elyase is correct. scikit-learn will simply take the data in whatever order you give it. Hence, you'll have to ensure that the data is in the same order during training and prediction time.
Here's a simple illustrating example:
Training time:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
x = pd.DataFrame({
'feature_1': [0, 0, 1, 1],
'feature_2': [0, 1, 0, 1]
})
y = [0, 0, 1, 1]
model.fit(x, y)
# we now have a model that
# (i) predicts 0 when x = [0, 0] or [0, 1], and
# (ii) predicts 1 when x = [1, 0] or [1, 1]
Prediction time:
# positive example
http_request_payload = {
'feature_1': 0,
'feature_2': 1
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 0, as expected
# negative example
http_request_payload = {
'feature_2': 1, # notice that the order is jumbled up
'feature_1': 0
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 1, when it should have returned 0.
# scikit-learn doesn't care about the key-value mapping of the features.
# it simply vectorizes the dataframe in whatever order it comes in.
This is how I cache the column order during training so that I can use it during prediction time.
# training
x = pd.DataFrame([...])
column_order = x.columns
model = SomeModel().fit(x, y) # train model
# save the things that we need at prediction time. you can also use pickle if you don't want to pip install joblib
import joblib
joblib.dump(model, 'my_model.joblib')
joblib.dump(column_order, 'column_order.txt')
# load the artifacts from disk
model = joblib.load('linear_model.joblib')
column_order = joblib.load('column_order.txt')
# imaginary http request payload
request_payload = { 'feature_1': ..., 'feature_1': ... }
# create empty dataframe with the right shape and order (using column_order)
input_features = pd.DataFrame([], columns=column_order)
input_features = input_features.append(request_payload, ignore_index=True)
input_features = input_features.fillna(0) # handle any missing data however you like
model.predict(input_features.values.tolist())

Categories