Unsupervised high dimension clustering - python

I have dataset of records where each record is with 5 labels and the importance of each label is different.
I know to labels order according to importance but don't know the differences, so the difference between two records is look like: adist of label1 + bdist of label2 + c*dist of label3 such that a+b+c = 1.
The data set contains around 3000 records and I want to cluster it(don't know the number of clusters) in some way.
I thought about DBSCAN but it is not really good with high dimensional data.
Hierarchical clustering need to know the number of clusters and also I think that it depands what it the first record you compare to so maybe the result will be wrong in this case.
Also look for graph clustering so the difference between two records will be the weight of the edge between this tow nodes but didn't find an algorithm that does that.
EDIT:
the data is a CDR data, represent the antennas user connected to while using his cellphone for calling, SMS and internet so the labels are:
location(longitude,latitude), part_of_day(night,morning-noon,after noon,evening),
workday\weekend, ,day_of_week, num of days of connection to this antenna
And I want to cluster it to detect points of interest of this user such as gym, mall, etc.. so I want to cluster it and separate between gym and mall even though they are close to each other but it is a different activity.
Any ideas about how to do it?

Related

How do I find the 100 most different points within a pool of 10,000 points?

I have a set of 10,000 points, each made up of 70 boolean dimensions. From this set of 10,000, I would like to select 100 points which are representative of the whole set of 10,000. In other words, I would like to pick the 100 points which are most different from one another.
Is there some established way of doing this? The first thing that comes to my mind is a greedy algorithm, which begins by selecting one point at random, then the next point is selected as the most distant one from the first point, and then the second point is selected as having the longest average distance from the first two, etc. This solution doesn't need to be perfect, just roughly correct. Preferably, this solution of 100 points can also be found within ~10 minutes but finishing within 24 hours is also fine.
I don't care about distance, in particular, that's just something that comes to mind as a way to capture "differentness."
If it matters, every point has 10 values of TRUE and 60 values of FALSE.
Some already-built Python package to do this would be ideal, but I am also happy to just write the code myself something if somebody could point me to a Wikipedia article.
Thanks
Your use of "representative" is not standard terminology, but I read your question as you wish to find 100 items that cover a wide gamut of different examples from your dataset. So if 5000 of your 10000 items were near identical, you would prefer to see only one or two items from that large sub-group. Under the usual definition, a representative sample of 100 would have ~50 items from that group.
One approach that might match your stated goal is to identify diverse subsets or groups within your data, and then pick an example from each group.
You can establish group identities for a fixed number of groups - with different membership size allowed for each group - within a dataset using a clustering algorithm. A good option for you might be k-means clustering with k=100. This will find 100 groups within your data and assign all 10,000 items to one of those 100 groups, based on a simple distance metric. You can then either take the central point from each group or a random sample from each group to find your set of 100.
The k-means algorithm is based around minimising a cost function which is the average distance of each group member from the centre of its group. Both the group centres and the membership are allowed to change, updated in an alternating fashion, until the cost cannot be reduced any further.
Typically you start by assigning each item randomly to a group. Then calculate the centre of each group. Then re-assign items to groups based on closest centre. Then recalculate the centres etc. Eventually this should converge. Multiple runs might be required to find an good optimum set of centres (it can get stuck in a local optimum).
There are several implementations of this algorithm in Python. You could start with the scikit learn library implementation.
According to an IBM support page (from comment by sascha), k-means may not work well with binary data. Other clustering algorithms may work better. You could also try to convert your records to a space where Euclidean distance is more useful and continue to use k-means clustering. An algorithm that may do that for you is principle component analysis (PCA) which is also implemented in scikit learn.
The graph partitioning tool METIS claims to be able to partition graphs with millions of vertices in 256 parts within seconds.
You could treat your 10.000 points as vertices of an undirected graph. A fully connected graph with 50 million edges would probably be too big. Therefore, you could restrict the edges to "similarity links" between points which have a Hamming distance below a certrain threshold.
In general, Hamming distances for 70-bit words have values between 0 and 70. In your case, the upper limit is 20 as there are 10 true coordinates and 60 false coordinates per point. The maximum distance occurs, if all true coordinates are differently located for both points.
Creation of the graph is a costly operation of O(n^2). But it might be possible to get it done within your envisaged time frame.

Decision Tree leaf node condition for numeric dataset

I am asked to implement a Random Forest Classifier, which to my understanding is just a bunch of Decision Trees, on which the test data is ran through after training and the classification is then determined by majority voting of all the trees.
This is all well and good, and I even understand that entropy determines which feature to use next. What I am struggling to understand, is that for numeric data, how do I determine the conditions?
An example, is whether a person will play golf depending on weather conditions. Given 3 features (outlook, humidity, wind), and a classification label (play -> yes or no), we first start with outlook:
Outlook -> Overcast (pure), Sunny, Rain
From Sunny, choose Humidity next: High, Normal (pure)
From Outlook to Rain, choose Wind (last feature): Weak (pure), Strong
Essentially, in this case the values of the features are taken individually. But what happens, when I have a dataset with a bunch of decimals?
(Some of) the data:
In this case I would start by first looking at the label (0 or 1), then progress to the feature with the highest entropy in each. But how do I know the conditions of going to a leaf node? Or even, how many children a parent have?
A poor diagram to aid my question:
For a theoretical answer to your question, I would start by recommending this excellent visual tutorial.
http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
In terms of implementation, there are several ways to go into it. You could try the following algorithm (inspired by this answer):
For each column (feature) in your dataset, start by sorting it. At every point where you have a class change, split your dataset. Say, for example, that your data points change from class 0 to 1 when feature A is equal to 5. All data points with A < 5 will belong to class 0, and the ones with A > 5 will belong to class 1. In case your dataset is not as simple, you can then proceed in the way you would proceed with a categorical decision tree, for example, by calculating the entropy at each splitting candidate. You then calculate the data points that arrive at each children node, and proceed recursively.

How to visualize k-means of multiple columns

i'm not a datascientist however i am intriuged with datascience, machine learning etc etc..
in my efforts to understand all of this i am continiously making a dataset (daily scraping) of grand exchange prices of one of my favourite games Old School runescape.
one of my goals is to pick a set of stocks/items that would give me the most profit. currently i am trying out clustering with k-means, to find stocks that are similar to eachother based on some basic features that i could think of.
however i have no clue if what i'm doing is correct,
for example:
( y = kmeans.fit_predict(df_items) my item_id is included with this, so is it actualy considering item_id as a feature now?)
and how do i even visualise the outcome of this i mean what goes on the x axis and what goes on the y axis, i have multiple columns...
https://github.com/extreme4all/OSRS_DataSet/blob/master/NoteBooks/Stock%20Picking.ipynb
To visualize something you have to reduce dimensionality to 2-3 dimensions, plus you can use color as 4-th dimension or in your case to indicate cluster number.
tSNE is a common choice for this task, check sklearn docs for details: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Choose almost any visualization technique for multivariate data.
Scatterplot matrix
Parallel coordinates
Dimensionality reduction (PCA makes more sense for k-mrans than tSNE, but also consider Fishers LDA, LMNN, etc.)
Box plots
Violin plots
...

After clustering, How do i choose the best customers (subset) from the top cluster?

I ran a clustering exercise to identify my top customers based on 12 distinct features, using K-Means (on 3 PCA dimensions and 5 PCA dimensions) and GMM (using 5 PCA dimensions) methodologies. Both the outputs from the K-Means produced almost similar customers as the best set (1182 customers in each case with an overlap of 1156) while the GMM approach gave me 660 customers as my top customers. These 660 customers were present in both the K-Means approaches.
Now i want to identify who my top customers are from among this list. Could you please suggest any statistical approaches that i could use to say that these X number of customers are truly my best set and run some A/B tests on them? I do not want to go with the full identified set as it might cost me more to do what is planned to such a large set of customers.
Try some good old filtering! Select one or several features, create your own metric (maybe top customers are those who buy the most, or those who are more loyal/have stayedlonger with the company, or a weighted sum of those two factors), sort the 660 customers in your cluster and pick only the N first customers, N being your maximum allowed number of customers.

How can you compare two cluster groupings in terms of similarity or overlap in Python?

Simplified example of what I'm trying to do:
Let's say I have 3 data points A, B, and C. I run KMeans clustering on this data and get 2 clusters [(A,B),(C)]. Then I run MeanShift clustering on this data and get 2 clusters [(A),(B,C)]. So clearly the two clustering methods have clustered the data in different ways. I want to be able to quantify this difference. In other words, what metric can I use to determine percent similarity/overlap between the two cluster groupings obtained from the two algorithms? Here is a range of scores that might be given:
100% score for [(A,B),(C)] vs. [(A,B),(C)]
~50% score for [(A,B),(C)] vs. [(A),(B,C)]
~20% score for [(A,B),(C)] vs. [(A,B,C)]
These scores are a bit arbitrary because I'm not sure how to measure similarity between two different cluster groupings. Keep in mind that this is a simplified example, and in real applications you can have many data points and also more than 2 clusters per cluster grouping. Having such a metric is also useful when trying to compare a cluster grouping to a labeled grouping of data (when you have labeled data).
Edit: One idea that I have is to take every cluster in the first cluster grouping and get its percent overlap with every cluster in the second cluster grouping. This would give you a similarity matrix of clusters in the first cluster grouping against clusters in the second cluster grouping. But then I'm not sure what you would do with this matrix. Maybe take the highest similarity score in each row or column and do something with that?
Use evaluation metrics.
Many metrics are symmetric. For example, the adjusted Rand index.
A value close to 1 means they are very similar, close to 0 is random, and much less than 0 means each cluster of one is "evenly" distributed over all clusters of the other.
Well, determining the number of clusters is problem in data analysis and different issue from clustering problem itself. There are quite a few criteria for this AIC
or Cubic Clustering Criteria. I think that with scikit-learn there is not an option to calculate these by default two but I know that there are packages in R.

Categories