Select array sub-sample that optimizes a function - python

I have an array of N elements and a function func() that takes as input M unique elements from the array, with M<N. I need to find the subset M* that maximizes my func().
I can't use an exhaustive search (ie: test every possible subset M that can be created from the N elements in the array) because the total number of combinations is too large even for modest values of N, M.
I can't use any of the usual scipy optimization algorithms (at least none that I am aware of) since I'm not working with a continuous, or even a discrete, parameter; rather I'm trying to find a subset of elements that maximizes my function.
Is there some Python algorithm in any package that could help me with this?

Related

Can a genetic algorithm optimize my NP-complete problem?

I have an array that stores a large collection of elements.
Any two elements can be compared in some function with the result being true or false.
The problem is to find the largest or at least a relatively large subgroup, where every element with all the others in that subgroup is in a true relationship.
Finding the largest subgroup from an array of size N requires N! operations, so the iterative way is out.
Randomly adding successive matching elements works, but the resulting subgroups are too small.
Can this problem be significantly optimised using a genetic algorithm and thus find much larger subgroups in a reasonable time?

Finding the combinations of variables to maximize( or minimize) a function

I have a 2D list. Consider it to be possibleVals. Each sub list within this 2D list contains possible values for a given variable.
i.e If the 2D list contains m 1D lists it implies that there are m variables. with the i th variable will having possible values contained within possibleVals[i].
I want to find the optimal combination for all the m variables. to maximize a certain function based of the inputs. I understand we could derive all the combinations and find the value one by one but that is very time consuming especially as m grows. I was wondering whether there is a solution for this using machine learning (Neural nets) as it has a similar behaviour (of course involving reducing a certain value

Markov chains in Python - filling the transition matrix from another array and hitting probabilities

Python or R (preferably Python, since I am better with arrays and loops there)
Suppose we have a two dimensional array A (of n^2 rows and n^2 columns) whose entries we have already filled in with for loops, using some complicated rules.
Now I want to create a Markov chain on n^2 states with transition matrix A and then compute the following hitting probability: the probability that, starting from (2,3), say, I ever reach a state corresponding to an integer which is a multiple of n.
How is this possible? Are there some nice library/package functions for this?
Update: I am also fine with not formally creating a chain but just a complicated system of equations which finds the hitting probability I am chasing.

Should I pickle primes in Python to produce primorials?

I produced an algorithm in Python to sieve the first n primes, then list the ordered pairs of the index n and the nth primorial p_n#.
Next I evaluate a function based on n and p_n# and finally, the objective, to determine whether the function f(n,p_n#) is monotonic, so the algorithm assesses where the sequence changes from rising to falling and vice-versa. The code is listed here for what it's worth.
This is of course memory-intensive and my pc can only cope with numbers up to around 2,000,000.
At any given point all I actually need is f(n-1), the ordered pair n,p_n#, the prime p_n (in order to quickly find the next prime), and a boolean indicating whether the sequence most recently rose or fell.
What are the best approaches to avoid storing a hundred thousand or more primes and primorials in memory while preserving speed?
I thought a first step would be to make a sieve that finds the one next prime above some given prime rather than every prime below some maximum. Then I can evaluate the next value of the function.
But I also wondered if it would be better to sieve batches of say 100 primes at a time. This could be supported by some "perpetual list" of ordered triples [n,p_n,p_n#] only containing n=100,200,300,... which I generate before runtime. Searching, I found the concept of "pickling" a list and wondered if this is the right scenario in which to use it, or is there a better way?

Quick/easy array comparison algorithm without sharing data

I have two arrays generated by two different systems that are independent of each other. I want to compare their similarities by comparing only a few numbers generated from the arrays.
Right now, I'm only comparing min, max, and sums of the arrays, but I was wondering if there is a better algorithm out there? Any type of hashing algorithm would need to be insensitive to small floating point differences between the arrays.
EDIT: What I'm trying to do is to verify that two algorithms generate the same data without having to compare the data directly. So the algorithm should be sensitive to shifts in the data and relatively insensitive to small differences between each element.
I wouldn't try to reduce this to one number; just pass around a tuple of values, and write a close_enough function that compares the tuples.
For example, you could use (mean, stdev) as your value, and then define close_enough as "each array's mean is within 0.25 stdev of the other array's mean".
def mean_stdev(a):
return mean(a), stdev(a)
def close_enough(mean_stdev_a, mean_stdev_b):
mean_a, stdev_a = mean_stdev_a
mean_b, stdev_b = mean_stdev_b
diff = abs(mean_a - mean_b)
return (diff < 0.25 * stdev_a and diff < 0.25 * stdev_b)
Obviously the right value is something you want to tune based on your use case. And maybe you actually want to base it on, e.g., variance (square of stdev), or variance and skew, or stdev and sqrt(skew), or some completely different normalization besides arithmetic mean. That all depends on what your numbers represent, and what "close enough" means.
Without knowing anything about your application area, it's hard to give anything more specific. For example, if you're comparing audio fingerprints (or DNA fingerprints, or fingerprint fingerprints), you'll want something very different from if you're comparing JPEG-compressed images of landscapes.
In your comment, you say you want to be sensitive to the order of the values. To deal with this, you can generate some measure of how "out-of-order" a sequence is. For example:
diffs = [elem[0] - elem[1] for elem in zip(seq, sorted(seq))]
This gives you the difference between each element and the element that would be there in sorted position. You can build a stdev-like measure out of this (square each value, average, sqrt), or take the mean absolute diff, etc.
Or you could compare how far away the actual index is from the "right" index. Or how far the value is from the value expected at its index based on the mean and stdev. Or… there are countless possibilities. Again, which is appropriate depends heavily on your application area.
Depends entirely on your definition of "compare their similarities".
What features do you want to compare? What features can you identify? are their identifiable patterns? i.e in this set, there are 6 critical points, there are 2 discontinuities... etc...
You've already mentioned comparing the min/max/sum; and means and standard deviations have been talked about in comments too. These are all features of the set.
Ultimately, you should be able to take all these features and make an n-dimensional descriptor. For example [min, max, mean, std, etc...]
You can then compare these n-dimensional descriptors to define whether one is "less", "equal" or "more" than the other. If you want to classify other sets into whether they are more like "set A" or more like "set B", you could look into classifiers.
See:
Classifying High-Dimensional Patterns Using a Fuzzy Logic
Support Vector Machines

Categories