How to add weight to KMeans

How to add weight to KMeans - python

Good afternoon guys, hope that it's all going well,
I have the next question, I want to use a KMeans algorithm to group different clusters by size, at the moment i'm able to do it, but the next step that I want to do is to have in consideration some weights in this individuals,
For example, I have a cluster of 100 points, which everyone is 1Kg of weight, on the other hand I have another cluster of 100 points, but every point weights 2Kg, so I want to split this cluster of 100 points in 2, to have the same weight as the same cluster, the result would be like this.
1 group 100 points - 100Kg
2 group 50 points - 100Kg
3 group 50 points - 100Kg
On that way I would be able to distribute equally all the charge in my groups, as the groups with more weight, will have less points to work with.
I searched on internet and in SkLearn, but I couldn't find any solution at my problem, maybe i'm not choosing the propper algorithm to do this work?
I'll read your replies,
Thx
I searched on the website, also I checked some documentation and tried some scripts, and I'm not able to find any solution at it...

Related

How do I find the 100 most different points within a pool of 10,000 points?

I have a set of 10,000 points, each made up of 70 boolean dimensions. From this set of 10,000, I would like to select 100 points which are representative of the whole set of 10,000. In other words, I would like to pick the 100 points which are most different from one another.
Is there some established way of doing this? The first thing that comes to my mind is a greedy algorithm, which begins by selecting one point at random, then the next point is selected as the most distant one from the first point, and then the second point is selected as having the longest average distance from the first two, etc. This solution doesn't need to be perfect, just roughly correct. Preferably, this solution of 100 points can also be found within ~10 minutes but finishing within 24 hours is also fine.
I don't care about distance, in particular, that's just something that comes to mind as a way to capture "differentness."
If it matters, every point has 10 values of TRUE and 60 values of FALSE.
Some already-built Python package to do this would be ideal, but I am also happy to just write the code myself something if somebody could point me to a Wikipedia article.
Thanks

Your use of "representative" is not standard terminology, but I read your question as you wish to find 100 items that cover a wide gamut of different examples from your dataset. So if 5000 of your 10000 items were near identical, you would prefer to see only one or two items from that large sub-group. Under the usual definition, a representative sample of 100 would have ~50 items from that group.
One approach that might match your stated goal is to identify diverse subsets or groups within your data, and then pick an example from each group.
You can establish group identities for a fixed number of groups - with different membership size allowed for each group - within a dataset using a clustering algorithm. A good option for you might be k-means clustering with k=100. This will find 100 groups within your data and assign all 10,000 items to one of those 100 groups, based on a simple distance metric. You can then either take the central point from each group or a random sample from each group to find your set of 100.
The k-means algorithm is based around minimising a cost function which is the average distance of each group member from the centre of its group. Both the group centres and the membership are allowed to change, updated in an alternating fashion, until the cost cannot be reduced any further.
Typically you start by assigning each item randomly to a group. Then calculate the centre of each group. Then re-assign items to groups based on closest centre. Then recalculate the centres etc. Eventually this should converge. Multiple runs might be required to find an good optimum set of centres (it can get stuck in a local optimum).
There are several implementations of this algorithm in Python. You could start with the scikit learn library implementation.
According to an IBM support page (from comment by sascha), k-means may not work well with binary data. Other clustering algorithms may work better. You could also try to convert your records to a space where Euclidean distance is more useful and continue to use k-means clustering. An algorithm that may do that for you is principle component analysis (PCA) which is also implemented in scikit learn.

The graph partitioning tool METIS claims to be able to partition graphs with millions of vertices in 256 parts within seconds.
You could treat your 10.000 points as vertices of an undirected graph. A fully connected graph with 50 million edges would probably be too big. Therefore, you could restrict the edges to "similarity links" between points which have a Hamming distance below a certrain threshold.
In general, Hamming distances for 70-bit words have values between 0 and 70. In your case, the upper limit is 20 as there are 10 true coordinates and 60 false coordinates per point. The maximum distance occurs, if all true coordinates are differently located for both points.
Creation of the graph is a costly operation of O(n^2). But it might be possible to get it done within your envisaged time frame.

How to specify the number maximum of points in the cluster using DBSCAN?

I know that DBSCAN have one parameter that specify the number minimum of points(min points) but I would like to restrict with the maximum number of points on a cluster? Do you know how can I do? I have investigated but i haven't found anything... For example, per cluster I only want to have a maximum of 4 points to be grouped by dbscan
Thanks!

I think you will find everything you need in the link Aaron shared with you. Also, just so you know, clustering methodologies are unsupervised, so you don't train/test anything. You let the algo tell you the story, based on the data that is fed in. You don't know what will happen in advance. In short, with DBSCAN and also Hierarchical Clustering (but not K-Means), you do not pre-specify the number of clusters. The algo determines the optimal number of clusters for you. If you really want to control the number of clusters (min or max) you need to use a K-Means algo. Take a look at this link when you have a chance.
https://blog.cambridgespark.com/how-to-determine-the-optimal-number-of-clusters-for-k-means-clustering-14f27070048f

Unsupervised high dimension clustering

I have dataset of records where each record is with 5 labels and the importance of each label is different.
I know to labels order according to importance but don't know the differences, so the difference between two records is look like: adist of label1 + bdist of label2 + c*dist of label3 such that a+b+c = 1.
The data set contains around 3000 records and I want to cluster it(don't know the number of clusters) in some way.
I thought about DBSCAN but it is not really good with high dimensional data.
Hierarchical clustering need to know the number of clusters and also I think that it depands what it the first record you compare to so maybe the result will be wrong in this case.
Also look for graph clustering so the difference between two records will be the weight of the edge between this tow nodes but didn't find an algorithm that does that.
EDIT:
the data is a CDR data, represent the antennas user connected to while using his cellphone for calling, SMS and internet so the labels are:
location(longitude,latitude), part_of_day(night,morning-noon,after noon,evening),
workday\weekend, ,day_of_week, num of days of connection to this antenna
And I want to cluster it to detect points of interest of this user such as gym, mall, etc.. so I want to cluster it and separate between gym and mall even though they are close to each other but it is a different activity.
Any ideas about how to do it?

Python Selenium infinite loop

I'm trying to study customers behavior. Basically, I have information on customer's loyalty points activities data (e.g. how many points they have earned, how many points they have used, how recent they have used/earn points etc). I'm using R to conduct this analysis
I'm just wondering how should I go about segmenting customers based on the above information? I'm trying to apply the RFM concept then use K-means to segment my customers(although I have a few more variables than just R,F,M , as i have recency,frequency and monetary on both points earn and use, as well as other ratios and metrics) . Is this a good way to do this?
Essentially I have two objectives:
1. To segment customers
2. Via segmenting customers, identify customers behavior(e.g.customers who spend all of their points before churning), provided that segmentation is the right method for such task?
Clustering <- kmeans(RFM_Values4,centers = 10)
Please enlighten me, need some guidance on the best methods to tackle such problems.

Your attempts is always less then 5 because there is no variable increment. So your loop is infinite

Program, that chooses the best out of 10

I need to make a program in python that chooses cars from an array, that is filled with the mass of 10 cars. The idea is that it fills a barge, that can hold ~8 tonnes most effectively and minimum space is left unfilled. My idea is, that it makes variations of the masses and chooses one, that is closest to the max weight. But since I'm new to algorithms, I don't have a clue how to do it

I'd solve this exercise with dynamic programming. You should be able to get the optimal solution in O(m*n) operations (n beeing the number of cars, m beeing the total mass).
That will only work however if the masses are all integers.
In general you have a binary linear programming problem. Those are very hard in general (NP-complete).
However, both ways lead to algorithms which I wouldn't consider to be beginners material. You might be better of with trial and error (as you suggested) or simply try every possible combination.

This is a 1D bin-packing problem. It's a NP problem and there isn't an optimal solution. However there is a way to solve this with greedy algorithm. Most likely you want to try my bin-packing solver at phpclasses.org (bin-packing).
If I have a graph unweigthed and undirected and each node is connected which each node then I have (n^2-n)/2 pairs of node and overall n^2-n possibilities/combinations:
1,2,3,4,5,...,64
2,1,X,X,X,...,X
3,X,1,X,X,...,X
4,X,X,1,X,...,X
5,X,X,X,1,...,X
.,X,X,X,X,1,.,X
.,X,X,X,X,X,1,X
64,X,X,X,X,X,X,1
Isn't this the same with 10 cars? (45 pairs of cars and 90 combinations/possibilites). Did I forgot something? Where is the error?

A problem like you have here is similar to the classic traveling salesman problem, which asks for the most efficient way for a salesman to visit a list of cities. The difference is that it is conceivable that you might not need every car to fill the barge, whereas the salesman must visit every city. But the problem is similar. The brute-force way to solve the problem is to investigate every possible combination of cars, from 1 car to all 10. We will assume that it is valid to have any number of each car (i.e. if car 2 is a Ford Focus, you could have three Ford Foci). This is easy to change if the car list is an exact list of specific cars, however, and you can use only 1 of each.
Now, this quickly begins to consume a lot of time. As the number of cars goes up, the number of possible combinations of cars goes up geometrically, which means that with a number smaller than you expect, it will take longer to run the program then there is time left in your life. 10 should be manageable, though (it turns out to be a little more than 700,000 combinations, or 1024 if you can only have one of each item).
The first thing is to define the weight of each car and the maximum weight the barge can carry.
weights = [1, 2, 1, 3, 1, 2, 2, 4, 1, 2, 2]
capacity = 8
Now we need some way to find each possible combination. Python's itertools module has a function that will give us every combination of a given length, but we want all lengths. So we will write one loop that goes from 1 to 10 and calls itertools.combinations_with_replacement for each length. We can then find out the total weight of each combination, and if it is higher than any weight we have already found, yet still within capacity, we will remember it as the best we have found so far.
The only real trick here is that we don't want combinations of the weights -- we want combinations of the indexes of the weights, because at the end we want to know which cars to put on the barge, not their weights. So combinations_with_replacements(range(10), ...) rather than combinations_with_replacements(weights, ...). Inside the loop you will want to get the weight of each car in the combination with weights[i] to sum it up.
I originally had code here, but took it out since this is homework. :-) (It wasn't originally tagged as such, but I should have known -- I blame the time change.)
If you wanted to allow only one of each car, you'd use itertools.combinations instead of combinations_with_replacement.
A shortcut is possible since you mention elsewhere that cars weigh from 1-2 tonnes. This means that you know you will need at least 4 of them (4 * 2 = 8), so you can skip all the combinations of 1-3 cars. However, this wouldn't generalize well if the prof changed the parameters on you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.