Linear transformation for noisy data in python

Linear transformation for noisy data in python - python

I have a dataset that you see below. The data is pretty noisy, but there is a clear linear trend that goes up and to the right. I'd like to transform the data with y = m * x to make the lines horizontal. Essentially, I'd like to do a regression on the orange lines to pull out the slope, but I don't know how to extract the different linear clusters. Is there a good method for transforming data like this? I'm using python/pandas/numpy.

It looks like you'll want to try clustering the orange points. Some clustering methods will cope with the parallel clusters. I would probably start with DBSCAN.
For more on clustering, check out the tutorial on this scikit-learn page. Your situation is a bit like the 4th row here:
If you provide your data, I expect several people will take a look at it.

Related

HDBSCAN on Movielens Latent embeddings does not cluster well

I am working on a recommendation algorithm, and that has right now boiled down to finding the right clustering algorithm for the job.
Data
The data I'm working with is the MovieLens 100K dataset, from which I've extracted movie titles, genres and tags, and concatenated them into single documents (one for each movie). This gives me about 10000 documents. These have then been vectorized with TFDIF, which I have then autoencoded to 64-dim feature vectors (loss=0.0014 down from 22.14 in 30 epochs). The AutoEncoder is able to reconstruct the data well.
Clustering
Currently, I am working with HDBSCAN, as it should be able to handle datasets with varying density, with non-globular clustering, arbitrary cluster shapes, etc etc. It should be the correct algorithm to use here. The
2D representation of the original 64-dimensional data (gathered by TSNE) shows what seems to be a decently clusterable space, but I cannot get the HDBSCAN algorithm to work properly. Setting the min_cluster_size to 15-30 gives me this, any higher and it sees all points as noise, and lowering gives me this. Or, it just clusters a large majority of points into 1 cluster, with some additional very small clusters, and the rest as noise, like this. It just seems like it can't handle the data, but it does seem to be clusterable to me.
My Questions:
How can fiddling with parameters help HDBSCAN to cluster this space?
Is there a better algorithm for clustering such a space?
Or is the data simply non-clusterable, from what you can see in the plots?
Thanks so much in advance, I've been struggling with this for hours now.

How to visualize k-means of multiple columns

i'm not a datascientist however i am intriuged with datascience, machine learning etc etc..
in my efforts to understand all of this i am continiously making a dataset (daily scraping) of grand exchange prices of one of my favourite games Old School runescape.
one of my goals is to pick a set of stocks/items that would give me the most profit. currently i am trying out clustering with k-means, to find stocks that are similar to eachother based on some basic features that i could think of.
however i have no clue if what i'm doing is correct,
for example:
( y = kmeans.fit_predict(df_items) my item_id is included with this, so is it actualy considering item_id as a feature now?)
and how do i even visualise the outcome of this i mean what goes on the x axis and what goes on the y axis, i have multiple columns...
https://github.com/extreme4all/OSRS_DataSet/blob/master/NoteBooks/Stock%20Picking.ipynb

To visualize something you have to reduce dimensionality to 2-3 dimensions, plus you can use color as 4-th dimension or in your case to indicate cluster number.
tSNE is a common choice for this task, check sklearn docs for details: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

Choose almost any visualization technique for multivariate data.
Scatterplot matrix
Parallel coordinates
Dimensionality reduction (PCA makes more sense for k-mrans than tSNE, but also consider Fishers LDA, LMNN, etc.)
Box plots
Violin plots
...

How to plot classification regions in a lower dimensional space?

I'm working in a space which has 8 dimensions (i.e. 8 features). I have plotted the data points in 2D by applying PCA as well as TSNE. Now I would like also to draw the borderlines of the classifiers I use as shown here. By the way, I'm using different classifiers (SVM, GNB, Logistic Regression).
This means that I have the different 8-dimensional points which I plot in 2D using PCA or TSNE. On top of this plot I would like to plot the different classification regions as shown in the link above.
Of course the classification boundaries/regions are also 8-dimensional. How can I turn the classification boundaries/regions into 2D matching my 2D data points?

Interesting question here, I once wondered it.
It can be answered several way, including more or less details depending whether you want to fully understand or to apply the method.
As you don't a lot of detail but you included a sklearn link, I will first answer on a technical point of view: "How can you do it with sklearn?"
You have a function for this: transform(X, y=None) which will apply the PCA projection (yes, PCA is a projection for high dimensional space to a lower one).
So you basically just need to give transform(your_boundaries) to apply it.
In term of pseudo code this would give:
pca = PCA(n_component=2).fit(data)
2dboundaries = pca.transform(boundaries)
Et voilà!
Do not hesitate to give more details or ask question. I could add some specific development if it is relevant.
Hope it helps
pltrdy

Python 2D Polynomial Fit Equivalent to IRAF's 'surfit'

I'm working with some data trying to create a 2D polynomial fit just like IRAF's surfit(see here). I have 16 data points distributed in a grid pattern (i.e. pixel values at 16 different x- and y-coordinates) that need to be fitted to produce a 1024x1024 array. I've tried a bunch of different methods, starting with things like astropy.modeling or scipy.interpolate, but nothing gives quite the right result compared to IRAF's surfit. I imagine it's because I'm only using 16 data points, but that's all I have! The result should look something like this:
But what I'm getting looks more like this:
or this:
If you have any suggestions for how best to accomplish this task, I would very much appreciate your input! Thank you.

Scikit-learn kmeans clustering

I'm supposed to be doing a kmeans clustering implementation with some data. The example I looked at from http://glowingpython.blogspot.com/2012/04/k-means-clustering-with-scipy.html shows their test data in 2 columns... however, the data I'm given is 68 subjects with 78 features (so 68x78 matrix). How am I supposed to create an appropriate input for this?
I've basically just tried inputting the matrix anyway, but it doesn't seem to do what I want... and I don't know why it would. I'm pretty confused as to what to do.
data = np.rot90(data)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
I honestly don't know what kind of code to show you.. the data format I told you was already described. Otherwise, it's the same as the tutorial I linked.

Your visualization only uses the first two dimensions.
That is why these points appear to be "incorrect" - they are closer in a different dimension.
Have a look at the next two dimensions:
plot(data[idx==0,2],data[idx==0,3],'ob',
data[idx==1,2],data[idx==1,3],'or')
plot(centroids[:,2],centroids[:,3],'sg',markersize=8)
show()
... repeat for all remaining of oyur 78 dimensions...
At this many features, (squared) Euclidean distance gets meaningless, and k-means results tend to become as good as random convex partitions.
To get a more representative view, consider using MDS to project the data into 2d for visualization. It should work reasonably fast with just 68 subjects.
Please include visualizations in your questions. We don't have your data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.