I am trying to automatically extract analogies from a word2vec model in Python. My basic approach is as follows:
Enumerate all of the pairs of vectors (n^2) and get their difference.
For each difference, add it to every vector (n^3) and find the closest match to the result (n^4).
Subtract the difference vector from the closest match and see if we get back to the original test vector, to verify that we have a genuine relationship.
There are some things that can be done to speed this up a bit; if adding a difference to a test vector produces a result that's way off of the unit hypersphere, the closest in-model vector is probably spurious, so we can skip that. And once a relationship has been found, we can skip later re-testing similar differences between all of the pairs that we already added to that relation. But it's still excruciatingly slow!
I know this brute search works in principle, as, having run it for about 12 hours, it does manage to automatically discover analogy sets like son:grandson::daughter:granddaughter and less-obvious-but-it-checks-out-when-I-google-the-words ones like scinax:oreophryne::amalda:gymnobela. But it takes between several seconds and a few minutes to check every candidate difference, and with over 4 billion vector differences in a model with a 90-ish-thousand-word vocabulary... that will take millions of hours!
So, is there any way to speed this up? Is there a non-brute-force solution to finding natural clusters of similar differences between vectors that might represent coherent analogy sets?
Related
I have a rather simple problem to define but I did not find a simple answer so far.
I have two graphs (ie sets of vertices and edges) which are identical. Each of them has independently labelled vertices. Look at the example below:
How can the computer detect, without prior knowledge of it, that 1 is identical to 9, 2 to 10 and so on?
Note that in the case of symmetry, there may be several possible one to one pairings which give complete equivalence, but just finding one of them is sufficient to me.
This is in the context of a Python implementation. Does someone have a pointer towards a simple algorithm publicly available on the Internet? The problem sounds simple but I simply lack the mathematical knowledge to come up to it myself or to find proper keywords to find the information.
EDIT: Note that I also have atom types (ie labels) for each graphs, as well as the full distance matrix for the two graphs to align. However the positions may be similar but not exactly equal.
This is known as the graph isomorphism problem, and probably very hard; although the exactly details of how hard are still subject of research.
(But things look better if you graphs are planar.)
So, after searching for it a bit, I think that I found a solution that works most of the time for moderate computational cost. This is a kind of genetic algorithm which uses a bit of randomness, but it is practical enough for my purposes it seems. I didn't have any aberrant configuration with my samples so far even if it is theoretically possible that this happens.
Here is how I proceeded:
Determine the complete set of 2-paths, 3-paths and 4-paths
Determine vertex types using both atom type and surrounding topology, creating an "identity card" for each vertex
Do the following ten times:
Start with a random candidate set of pairings complying with the allowed vertex types
Evaluate how much of 2-paths, 3-paths and 4-paths correspond between the two pairings by scoring one point for each corresponding vertex (also using the atom type as an additional descriptor)
Evaluate all other shortlisted candidates for a given vertex by permuting the pairings for this candidate with its other positions in the same way
Sort the scores in descending order
For each score, check if the configuration is among the excluded configurations, and if it is not, take it as the new configuration and put it into the excluded configurations.
If the score is perfect (ie all of the 2-paths, 3-paths and 4-paths correspond), then stop the loop and calculate the sum of absolute differences between the distance matrices of the two graphs to pair using the selected pairing, otherwise go back to 4.
Stop this process after it has been done 10 times
Check the difference between distance matrices and take the pairings associated with the minimal sum of absolute differences between the distance matrices.
I was given a problem in which you are supposed to write a python code that distributes a number of different weights among 4 boxes.
Logically we can't expect a perfect distribution as in case we are given weights like 10, 65, 30, 40, 50 and 60 kilograms, there is no way of grouping those numbers without making one box heavier than another. But we can aim for the most homogenous distribution. ((60),(40,30),(65),(50,10))
I can't even think of an algorithm to complete this task let alone turn it into python code. Any ideas about the subject would be appreciated.
The problem you're describing is similar to the "fair teams" problem, so I'd suggest looking there first.
Because a simple greedy algorithm where weights are added to the lightest box won't work, the most straightforward solution would be a brute force recursive backtracking algorithm that keeps track of the best solution it has found while iterating over all possible combinations.
As stated in #j_random_hacker's response, this is not going to be something easily done. My best idea right now is to find some baseline. I describe a baseline as an object with the largest value since it cannot be subdivided. Using that you can start trying to match the rest of the data to that value which would only take about three iterations to do. The first and second would create a list of every possible combination and then the third can go over that list and compare the different options by taking the average of each group and storing the closest average value to your baseline.
Using your example, 65 is the baseline and since you cannot subdivide it you know that has to be the minimum bound on your data grouping so you would try to match all of the rest of the values to that. It wont be great, but it does give you something to start with.
As j_random_hacker notes, the partition problem is NP-complete. This problem is also NP-complete by a reduction from the 4-partition problem (the article also contains a link to a paper by Garey and Johnson that proves that 4-partition itself is NP-complete).
In particular, given a list to 4-partition, you could feed that list as an input to a function that solves your box distribution problem. If each box had the same weight in it, a 4-partition would exist, otherwise not.
Your best bet would be to create an exponential time algorithm that uses backtracking to iterate over the 4^n possible assignments. Because unless P = NP (highly unlikely), no polynomial time algorithm exists for this problem.
I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.
The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.
I have a task to find similar parts based on numeric dimensions--diameters, thickness--and categorical dimensions--material, heat treatment, etc. I have a list of 1 million parts. My approach as a programmer is to put all parts on a list, pop off the first part and use it as a new "cluster" to compare the rest of the parts on the list based on the dimensions. As a part on the list matches the categorical dimensions and numerical dimensions--within 5 percent--I will add that part to the cluster and remove from the initial list. Once all parts in the list are compared with the initial cluster part's dimensions, I will pop the next part off the list and start again, populating clusters until no parts remain on the original list. This is a programmatic approach. I am not sure if this is most efficient way of categorizing parts into "clusters" or if k-means clustering would be a better approach.
Define "better".
What you do seems to be related to "leader" clustering. But that is a very primitive form of clustering that will usually not yield competitive results. But with 1 million points, your choices are limited, and kmeans does not handle categorical data well.
But until you decide what is 'better', there probably is nothing 'wrong' with your greedy approach.
An obvious optimization would be to first split all the data based on the categorical attributes (as you expect them to match exactly). That requires just one pass over the data set and a hash table. If your remaining parts are small enough, you could try kmeans (but how would you choose k), or DBSCAN (probably using the same threshold you already have) on each part.
I'm working on a problem and one solution would require an input of every 14x10 matrix that is possible to be made up of 1's and 0's... how can I generate these so that I can input every possible 14x10 matrix into another function? Thank you!
Added March 21: It looks like I didn't word my post appropriately. Sorry. What I'm trying to do is optimize the output of 10 different production units (given different speeds and amounts of downtime) for several scenarios. My goal is to place blocks of downtime to minimized the differences in production on a day-to-day basis. The amount of downtime and frequency each unit is allowed is given. I am currently trying to evaluate a three week cycle, meaning every three weeks each production unit is taken down for a given amount of hours. I was asking the computer to determine the order the units would be taken down based on the constraint that the lines come down only once every 3 weeks and the difference in daily production is the smallest possible. My first approach was to use Excel (as I tried to describe above) and it didn't work (no suprise there)... where 1- running, 0- off and when these are summed to calculate production. The calculated production is subtracted from a set max daily production. Then, these differences were compared going from Mon-Tues, Tues-Wed, etc for a three week time frame and minimized using solver. My next approach was to write a Matlab code where the input was a tolerance (set allowed variation day-to-day). Is there a program that already does this or an approach to do this easiest? It seems simple enough, but I'm still thinking through the different ways to go about this. Any insight would be much appreciated.
The actual implementation depends heavily on how you want to represent matrices… But assuming the matrix can be represented by a 14 * 10 = 140 element list:
from itertools import product
for matrix in product([0, 1], repeat=140):
# ... do stuff with the matrix ...
Of course, as other posters have noted, this probably isn't what you want to do… But if it really is what you want to do, that's the best code (given your requirements) to do it.
Generating Every possible matrix of 1's and 0's for 14*10 would generate 2**140 matrixes. I don't believe you would have enough lifetime for this. I don't know, if the sun would still shine before you finish that. This is why it is impossible to generate all those matrices. You must look for some other solution, this looks like a brute force.
This is absolutely impossible! The number of possible matrices is 2140, which is around 1.4e42. However, consider the following...
If you were to generate two 14-by-10 matrices at random, the odds that they would be the same are 1 in 1.4e42.
If you were to generate 1 billion unique 14-by-10 matrices, then the odds that the next one you generate would be the same as one of those would still be exceedingly slim: 1 in 1.4e33.
The default random number stream in MATLAB uses a Mersenne twister algorithm that has a period of 219936-1. Therefore, the random number generator shouldn't start repeating itself any time this eon.
Your approach should be thus:
Find a computer no one ever wants to use again.
Give it as much storage space as possible to save your results.
Install MATLAB on it and fire it up.
Start computing matrices at random like so:
while true
newMatrix = randi([0 1],14,10);
%# Process the matrix and output your results to disk
end
Walk away
Since there are so many combinations, you don't have to compare newMatrix with any of the previous matrices since the length of time before a repeat is likely to occur is astronomically large. Your processing is more likely to stop due to other reasons first, such as (in order of likely occurrence):
You run out of disk space to store your results.
There's a power outage.
Your computer suffers a fatal hardware failure.
You pass away.
The Earth passes away.
The Universe dies a slow heat death.
NOTE: Although I injected some humor into the above answer, I think I have illustrated one useful alternative. If you simply want to sample a small subset of the possible combinations (where even 1 billion could be considered "small" due to the sheer number of combinations) then you don't have to go through the extra time- and memory-consuming steps of saving all of the matrices you've already processed and comparing new ones to it to make sure you aren't repeating matrices. Since the odds of repeating a combination are so low, you could safely do this:
for iLoop = 1:whateverBigNumberYouWant
newMatrix = randi([0 1],14,10); %# Generate a new matrix
%# Process the matrix and save your results
end
Are you sure you want every possible 14x10 matrix? There are 140 elements in each matrix, and each element can be on or off. Therefore there are 2^140 possible matrices. I suggest you reconsider what you really want.
Edit: I noticed you mentioned in a comment that you are trying to minimize something. There is an entire mathematical field called optimization devoted to doing this type of thing. The reason this field exists is because quite often it is not possible to exhaustively examine every solution in anything resembling a reasonable amount of time.
Trying this:
import numpy
for i in xrange(int(1e9)): a = numpy.random.random_integers(0,1,(14,10))
(which is much, much, much smaller than what you require) should be enough to convince you that this is not feasible. It also shows you how to calculate one, or few, such random matrices even up to a million is pretty fast).
EDIT: changed to xrange to "improve speed and memory requirements" :)
You don't have to iterate over this:
def everyPossibleMatrix(x,y):
N=x*y
for i in range(2**N):
b="{:0{}b}".format(i,N)
yield '\n'.join(b[j*x:(j+1)*x] for j in range(y))
Depending on what you want to accomplish with the generated matrices, you might be better off generating a random sample and running a number of simulations. Something like:
matrix_samples = []
# generate 10 matrices
for i in range(10):
sample = numpy.random.binomial(1, .5, 14*10)
sample.shape = (14, 10)
matrix_samples.append(sample)
You could do this a number of times to see how results vary across simulations. Of course, you could also modify the code to ensure that there are no repeats in a sample set, again depending on what you're trying to accomplish.
Are you saying that you have a table with 140 cells and each value can be 1 or 0 and you'd like to generate every possible output? If so, you would have 2^140 possible combinations...which is quite a large number.
Instead of just suggesting the this is unfeasible, I would suggest considering a scheme that samples the important subset of all possible combinations instead of applying a brute force approach. As one of your replies suggested, you are doing minimization. There are numerical techniques to do this such as simulated annealing, monte carlo sampling as well as traditional minimization algorithms. You might want to look into whether one is appropriate in your case.
I was actually much more pessimistic to begin with, but consider:
from math import log, e
def timeInYears(totalOpsNeeded=2**140, currentOpsPerSecond=10**9, doublingPeriodInYears=1.5):
secondsPerYear = 365.25 * 24 * 60 * 60
doublingPeriodInSeconds = doublingPeriodInYears * secondsPerYear
k = log(2,e) / doublingPeriodInSeconds # time-proportionality constant
timeInSeconds = log(1 + k*totalOpsNeeded/currentOpsPerSecond, e) / k
return timeInSeconds / secondsPerYear
if we assume that computer processing power continues to double every 18 months, and you can currently do a billion combinations per second (optimistic, but for sake of argument) and you start today, your calculation will be complete on or about April 29th 2137.
Here is an efficient way to do get started Matlab:
First generate all 1024 possible rows of length 10 containing only zeros and ones:
dec2bin(0:2^10-1)
Now you have all possible rows, and you can sample from them as you wish. For example by calling the following line a few times:
randperm(1024,14)