Quick/easy array comparison algorithm without sharing data

Quick/easy array comparison algorithm without sharing data - python

I have two arrays generated by two different systems that are independent of each other. I want to compare their similarities by comparing only a few numbers generated from the arrays.
Right now, I'm only comparing min, max, and sums of the arrays, but I was wondering if there is a better algorithm out there? Any type of hashing algorithm would need to be insensitive to small floating point differences between the arrays.
EDIT: What I'm trying to do is to verify that two algorithms generate the same data without having to compare the data directly. So the algorithm should be sensitive to shifts in the data and relatively insensitive to small differences between each element.

I wouldn't try to reduce this to one number; just pass around a tuple of values, and write a close_enough function that compares the tuples.
For example, you could use (mean, stdev) as your value, and then define close_enough as "each array's mean is within 0.25 stdev of the other array's mean".
def mean_stdev(a):
return mean(a), stdev(a)
def close_enough(mean_stdev_a, mean_stdev_b):
mean_a, stdev_a = mean_stdev_a
mean_b, stdev_b = mean_stdev_b
diff = abs(mean_a - mean_b)
return (diff < 0.25 * stdev_a and diff < 0.25 * stdev_b)
Obviously the right value is something you want to tune based on your use case. And maybe you actually want to base it on, e.g., variance (square of stdev), or variance and skew, or stdev and sqrt(skew), or some completely different normalization besides arithmetic mean. That all depends on what your numbers represent, and what "close enough" means.
Without knowing anything about your application area, it's hard to give anything more specific. For example, if you're comparing audio fingerprints (or DNA fingerprints, or fingerprint fingerprints), you'll want something very different from if you're comparing JPEG-compressed images of landscapes.
In your comment, you say you want to be sensitive to the order of the values. To deal with this, you can generate some measure of how "out-of-order" a sequence is. For example:
diffs = [elem[0] - elem[1] for elem in zip(seq, sorted(seq))]
This gives you the difference between each element and the element that would be there in sorted position. You can build a stdev-like measure out of this (square each value, average, sqrt), or take the mean absolute diff, etc.
Or you could compare how far away the actual index is from the "right" index. Or how far the value is from the value expected at its index based on the mean and stdev. Or… there are countless possibilities. Again, which is appropriate depends heavily on your application area.

Depends entirely on your definition of "compare their similarities".
What features do you want to compare? What features can you identify? are their identifiable patterns? i.e in this set, there are 6 critical points, there are 2 discontinuities... etc...
You've already mentioned comparing the min/max/sum; and means and standard deviations have been talked about in comments too. These are all features of the set.
Ultimately, you should be able to take all these features and make an n-dimensional descriptor. For example [min, max, mean, std, etc...]
You can then compare these n-dimensional descriptors to define whether one is "less", "equal" or "more" than the other. If you want to classify other sets into whether they are more like "set A" or more like "set B", you could look into classifiers.
See:
Classifying High-Dimensional Patterns Using a Fuzzy Logic
Support Vector Machines

Related

Efficiently find sets of pairs of points with similar differences?

I am trying to automatically extract analogies from a word2vec model in Python. My basic approach is as follows:
Enumerate all of the pairs of vectors (n^2) and get their difference.
For each difference, add it to every vector (n^3) and find the closest match to the result (n^4).
Subtract the difference vector from the closest match and see if we get back to the original test vector, to verify that we have a genuine relationship.
There are some things that can be done to speed this up a bit; if adding a difference to a test vector produces a result that's way off of the unit hypersphere, the closest in-model vector is probably spurious, so we can skip that. And once a relationship has been found, we can skip later re-testing similar differences between all of the pairs that we already added to that relation. But it's still excruciatingly slow!
I know this brute search works in principle, as, having run it for about 12 hours, it does manage to automatically discover analogy sets like son:grandson::daughter:granddaughter and less-obvious-but-it-checks-out-when-I-google-the-words ones like scinax:oreophryne::amalda:gymnobela. But it takes between several seconds and a few minutes to check every candidate difference, and with over 4 billion vector differences in a model with a 90-ish-thousand-word vocabulary... that will take millions of hours!
So, is there any way to speed this up? Is there a non-brute-force solution to finding natural clusters of similar differences between vectors that might represent coherent analogy sets?

Is there a form of lazy evaluation where a function (like mean) returns an approximate value when operating on arrays

For example we want to calculate mean of a list of numbers where the list is so long. and that numbers when sorted are nearly linear (or we can find a linear Regression Model for data). Mathematically we can aggregate mean by
((arr[0] + arr[length(arr)]) / 2 ) + intercept
Or in the case, linear model is nearly constant (slope coefficient is nearly 1). we can calculate approximately:
mean(arr[n/const]) = mean(arr)
The same concept is applied for the two cases. and is so basic.
Is there a way: pattern, function (hopefully in python), or any studies to suggest and that can help will be gratefully welcome; of course such a pattern if exists should be general and not only for the mean case (probably any function
or at least aggregate functions like: sum, mean ...). (as I don't have a strong mathematical background, and I'm new to machine learning, please tolerate my ignorance).
Please let me know if anything is not clear.

The Law of Large Numbers states that as sample size increases, an average of a sample of observations converges to the true population average with probability 1.
Therefore, if your hypothetical array is too big to average, you could at the very least take average of a large sample and know that you are close to the true population mean.
You can sample from a numpy array using numpy.random.choice(arr,n) where arr is your array and n is as many elements as you wish (or are able) to sample.

There are more general solutions to such jobs like Dask package, for example: http://dask.pydata.org/en/latest/
It can optimize calculation graphs, parallelize calculation and many more.

Python - multi-dimensional clustering with thresholds

Imagine I have a dataset as follows:
[{"x":20, "y":50, "attributeA":90, "attributeB":3849},
{"x":34, "y":20, "attributeA":86, "attributeB":5000},
etc.
There could be a bunch more other attributes in addition to these - this is just an example. What I am wondering is, how can I cluster these points based on all of the factors with control over the maximum separation between a given point and the next for a given variable for it to be considered linked. (i.e. euclidean distance must be within 10 points, attributeA within 5 points and attributeB within 1000 points)
Any ideas on how to do this in python? As I implied above, I would like to apply euclidean distance to compare distance between the two points if possible - not just comparing x and y as separate attributes. For the rest of the attributes it would be all single dimensional comparison...if that makes sense.
Edit: Just to add some clarity in case this doesn't make sense, basically I am looking for some algorithm to compare all objects with each other (or some more efficient way), if all of object A's attributes and euclidean distance are within the specified threshold when compared to object B, then those two are considered similar and linked - this procedure continues until eventually all the linked clusters can be returned as some clusters will have no points that satisfy the conditions to be similar to any point in another cluster resulting in the clusters being separated.

The simplest approach is to build a binary "connectivity" matrix.
Let a[i,j] be 0 exactly if your conditions are fullfilled, 1 otherwise.
Then run hierarchical agglomerative clustering with complete linkage on this matrix. If you don't need every pair of objects in every cluster to satisfy your threshold, then you can also use other linkages.
This isn't the best solution - other distance matrix will need O(n²) memory and time, and the clustering even O(n³), but the easiest to implement. Computing the distance matrix in Python code will be really slow unless you can avoid all loops and have e.g. numpy do most of the work. To improve scalability, you should consider DBSCAN, and a data index.
It's fairly straightforward to replace the three different thresholds with weights, so that you can get a continuous distance; likely even a metric. Then you could use data indexes, and try out OPTICS.

Classify into one of two sets (without learning)

I am dealing with a problem where I would like to automatically divide a set into two subsets, knowing that ALMOST ALL of the objects in the set A will have greater values in all of the dimensions than objects in the set B.
I know I could use machine learning but I need it to be fully automated, as in various instances of a problem objects of set A and set B will have different values (so values in set B of the problem instance 2 might be greater than values in set A of the problem instance 1!).
I imagine the solution could be something like finding objects which are the best representatives of those two sets (the density of the objects around them is the highest).
Finding N best representatives of both sets would be sufficient for me.
Does anyone know the name of the problem and/or could propose the implementation for that? (Python is preferable).
Cheers!

You could try some of the clustering methods, which belong to unsupervised machine learning. The result depends on your data and how distributed they are. According to your picture I think K-means algorithm could work. There is a python library for machine learning scikit-learn, which already contains k-means implementation: http://scikit-learn.org/stable/modules/clustering.html#k-means

If your data is as easy as you explained, then there are some rather obvious approaches.
Center and count:
Center your data set, and count for each object how many values are positive. If more values are positive than negative, it will likely be in the red class.
Length histogram:
Compute the sum of each vector. Make a histogram of values. Split at the largest gap, vectors longer than the threshold are in one group, the others in the lower group.
I have made an ipython notebook to demonstrate this approach available.

Find a random method that best fit list of values

I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?

You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.

Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example

Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.

Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.