Python: Alternative to quasi random sequences - python

Hey I have the following problem. I have a large parameter space. In my case I have like 10 dimensions. But to simplify lets assume I have 3 variables x1,x2 and x3. They are discrete numbers from 1 to 10. Now i create all possible parameter combinations and want to use them for postprocessing. In my real case that are too many combinatons. So I want to do a quasi random sequence search to reduce the search space. But the the combinations in the search space should cover it as good as possible. (uniform distributed). I want to prevent the parameter combination to Cluster in the search space, it should Cover the whole search space as good as possible. I need that to find preferences of the parameter combiantions in the processing of the parameters. There a many approaches to do that, like Haton, Hammersley or Sobol sequences. But they are not working for discrete numbers. One package which do quasi random sequences is chaospy. If i round the numbers of the sequences, variable numbers of each variable will occur more than once in the different variable combinations. That is not what I want. I want that every variable number only occurs Once and the variables are uniformly distributed in the search space. Is there a possibility to create from the beginning a random multi dimensional set of variable combination, in which every variable just appears once? For example In a two dimensional grid 10x10 one possible combination would be the diagonal. Of course in 3 dimensions I would need 100 combinations to Cover all Parameter value,
Lets have an simplified example with three variables from 1-10 with Sobol Sequence:
import numpy as np
import chaospy as cp
#Create a Joint distributuon of the three varaibles, which ranges going from 1 to 10
distribution2 = cp.J(cp.Uniform(1, 10),cp.Uniform(1, 10),cp.Uniform(1, 10))
#Create 10 numbers in the variable space
samplesSobol = distribution2.sample(10, rule="S")
#Transpose the array to get the variable combinations in subarrays
sobolPointsTranspose = np.transpose(samplesSobol)
Example Output:
[[ 7.89886475 6.34649658 4.8336792 ]
[ 5.64886475 4.09649658 2.5836792 ]
[ 1.14886475 8.59649658 7.0836792 ]
[ 1.21917725 5.01055908 2.5133667 ]
[ 5.71917725 9.51055908 7.0133667 ]
[ 7.96917725 2.76055908 9.2633667 ]
[ 3.46917725 7.26055908 4.7633667 ]
[ 4.59417725 1.63555908 5.8883667 ]
[ 9.09417725 6.13555908 1.3883667 ]
[ 6.84417725 3.88555908 3.6383667 ]]
Now here every variable number is unique but the Output is not discrete. I can round it and get:
[[ 8. 6. 5.]
[ 6. 4. 3.]
[ 1. 9. 7.]
[ 1. 5. 3.]
[ 6. 10. 7.]
[ 8. 3. 9.]
[ 3. 7. 5.]
[ 5. 2. 6.]
[ 9. 6. 1.]
[ 7. 4. 4.]]
Now the problem is, that for example 1 occurs twice in the first dimension or 4 in the second or 7 in the third dimension.

This is a very late answer so I assume is no longer relevant to the Original Poster but I came across the post whilst trying to find an existing implementation of what I describe below.
It sounds like you are looking for something like a Latin Hypercube: https://en.wikipedia.org/wiki/Latin_hypercube_sampling.
Essentially if I have n variables and I want 10 samples then the range of each variable is split into 10 intervals and the possible values for each variable are (e.g.) the middle points of each interval. A Latin hypercube algorithm picks samples at random in such a way that each of the 10 values for each variable appears only once. The example in Warren's answer is an example of a Latin Hypercube.
This doesn't help to cover the search space as well as possible (or in other words to check if the design is space filling). There is a criterion from Morris and Mitchell's 1995 paper Exploratory designs for computational experiments which calculates how space filling a sample is by looking at the distance between points. You can create a large number of different Latin Hypercube Designs and then use the criterion to choose the best, or take an initial design and manipulate it to give a better design. The latter is implemented in the algorithm here: https://github.com/1313e/e13Tools/blob/master/e13tools/sampling/lhs.py
They give some examples in the code, e.g. for 5 points and 2 variables:
import numpy as np
np.random.seed(0)
lhd(5, 2, method='fixed')
returns something like
array([[ 0.5 , 0.75],
[ 0.25, 0.25],
[ 0. , 1. ],
[ 0.75, 0.5 ],
[ 1. , 0. ]])
This will give the Latin Hypercube scaled on the interval [0, 1] so you would need to unscale to the range of your parameters using, for example
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
Here's an example of one of the outputs I get when I run the above code:
This one is pretty good at space-filling according to the Morris-Mitchell criterion.

"Is there a possibility to create from the beginning a random multi dimensional set of variable combination, in which every variable just appears once?" For this to work, each variable must have the same number of possible values. In your examples this number is 10, so I'll use that.
One way to generate the random points is to stack random permutations of range(10). Like this, for example, with three variables:
In [180]: np.column_stack([np.random.permutation(10) for _ in range(3)])
Out[180]:
array([[6, 6, 4],
[9, 2, 0],
[0, 4, 3],
[5, 9, 5],
[2, 8, 7],
[1, 1, 9],
[8, 3, 8],
[3, 5, 1],
[4, 0, 2],
[7, 7, 6]])

This answer gives a function that generates a list of 4-value lists such that
[a, b, c, d] are natural numbers between 1 and 10. In each set, the parameters may only take any value exactly once.
import random
def generate_random_sequences(num_params=4, seed=0)
random.seed(seed)
value_lists = [[val for val in range(1, 11)] for _ in range(num_params)]
for values in value_lists:
random.shuffle(values)
ret = [[] for _ in range(num_params)]
for value_idx in range(10):
for param_idx in range(num_params):
ret[param_idx].append(value_lists[param_idx][value_idx])
return ret
I just saw that Warren's answer using numpy is way superior, and you use numpy already anyway. Still submitting this one as a pure python implementation.

Related

How can I improve the efficiency of my algorithm, while I use two loops inside?

Dear experienced friends, I proposed a method to solve an algorithm problem. However, I found my method becomes very time-consuming when the data size grows. May I ask is there any better way to solve this problem? Is it possible to use matrix manipulation?
The question:
Suppose we have 1 score-matrix and 3 value-matrix.
Each of them is a square matrix with the same size (N*N).
The element in score-matrix means the weights between two entities. For example, S12 means the score between entity 1 and entity 2. (Weights are only meaningful when greater than 0.)
The element in value-matrix means the values between two entities. For example, V12 means the value between entity 1 and entity 2. Since we have 3 value-matrix, we have 3 different V12.
The target is: I want to multiply the values with the corresponding weights, so that I can finally output a (Nx3) matrix.
My solutions: I solved this problem as follows. However, I use two for-loops here, which makes my program become very time-consuming. (e.g. When N is big or 3 becomes 100) May I ask is there any way to improve this code? Any suggestions or hints would be very appreciated. Thank you in advance!
# generate sample data
import numpy as np
score_mat = np.random.randint(low=0, high=4, size=(2,2))
value_mat = np.random.randn(3,2,2)
# solve problem
# init the output info
output = np.zeros((2, 3))
# update the output info
for entity_1 in range(2):
# consider meaningful score
entity_others_list = np.where(score_mat[entity_1,:]>0)[0].tolist()
# iterate every other entity
for entity_2 in entity_others_list:
vec = value_mat[:,entity_1,entity_2].copy()
vec *= score_mat[entity_1,entity_2]
output[entity_1] += vec
You don't need to iterate them manually, just multiply score_mat by value_mat, then call sum on axis=2, again call sum on axis=1.
As you have mentioned that the score will make sense only if it is greater than zero, if that's the case, you can first replace non-positive values by 1, since multiplying something by 1 remains intact:
>>> score_mat[score_mat<=0] = 1
>>> (score_mat*value_mat).sum(axis=2).sum(axis=1)
array([-0.58826032, -3.08093186, 10.47858256])
Break-down:
# This is what the randomly generated numpy arrays look like:
>>> score_mat
array([[3, 3],
[1, 3]])
>>> value_mat
array([[[ 0.81935985, 0.92228075],
[ 1.07754964, -2.29691059]],
[[ 0.12355602, -0.36182607],
[ 0.49918847, -0.95510339]],
[[ 2.43514089, 1.17296263],
[-0.81233976, 0.15553725]]])
# When you multiply the matcrices, each inner matrices in value_mat will be multiplied
# element-wise by score_mat
>>> score_mat*value_mat
array([[[ 2.45807955, 2.76684225],
[ 1.07754964, -6.89073177]],
[[ 0.37066806, -1.08547821],
[ 0.49918847, -2.86531018]],
[[ 7.30542266, 3.51888789],
[-0.81233976, 0.46661176]]])
# Now calling sum on axis=2, will give the sum of each rows in the inner-most matrices
>>> (score_mat*value_mat).sum(axis=2)
array([[ 5.22492181, -5.81318213],
[-0.71481015, -2.36612171],
[10.82431055, -0.34572799]])
# Finally calling sum on axis=1, will again sum the row values
>>> (score_mat*value_mat).sum(axis=2).sum(axis=1)
array([-0.58826032, -3.08093186, 10.47858256])

Numpy argsort vs Scipy.stats rankdata

I've recently used both of these functions, and am looking for input from anyone who can speak to the following:
do argsort and rankdata differ fundamentally in their purpose?
are there performance advantages with one over the other? (specifically: large vs small array performance differences?)
what is the memory overhead associated with importing rankdata?
Thanks in advance.
p.s. I could not create the new tags 'argsort' or 'rankdata'. If anyone with sufficient standing feels they should be added to this question, please do.
Do argsort and rankdata differ fundamentally in their purpose?
In my opinion, they do slightly. The first gives you the positions of the data if the data was sorted, while the second the rank of the data. The difference can become apparent in the case of ties:
import numpy as np
from scipy import stats
a = np.array([ 5, 0.3, 0.4, 1, 1, 1, 3, 42])
almost_ranks = np.empty_like(a)
almost_ranks[np.argsort(a)] = np.arange(len(a))
print(almost_ranks)
print(almost_ranks+1)
print(stats.rankdata(a))
Results to (notice 3. 4. 5 vs. 4. 4. 4 ):
[6. 0. 1. 2. 3. 4. 5. 7.]
[7. 1. 2. 3. 4. 5. 6. 8.]
[7. 1. 2. 4. 4. 4. 6. 8.]
Are there performance advantages with one over the other?
(specifically: large vs small array performance differences?)
Both algorithms seem to me to have the same complexity: O(NlgN) I would expect the numpy implementation to be slightly faster as it has a bit of a smaller overhead, plus it's numpy. But you should test this yourself... Checking the code for scipy.rankdata, it seems to -at present, my python...- be calling np.unique among other functions, so i would guess it would take more in practice...
what is the memory overhead associated with importing rankdata?
Well, you import scipy, if you had not done so before, so it is the overhead of scipy...

Matlab range in Python

I must translate some Matlab code into Python 3 and I often come across ranges of the form start:step:stop. When these arguments are all integers, I easily translate this command with np.arange(), but when some of the arguments are floats, especially the step parameter, I don't get the same output in Python. For example,
7:8 %In Matlab
7 8
If I want to translate it in Python I simply use :
np.arange(7,8+1)
array([7, 8])
But if I have, let's say :
7:0.3:8 %In Matlab
7.0000 7.3000 7.6000 7.9000
I can't translate it using the same logic :
np.arange(7, 8+0.3, 0.3)
array([ 7. , 7.3, 7.6, 7.9, 8.2])
In this case, I must not add the step to the stop argument.
But then, if I have :
7:0.2:8 %In Matlab
7.0000 7.2000 7.4000 7.6000 7.8000 8.0000
I can use my first idea :
np.arange(7,8+0.2,0.2)
array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])
My problem comes from the fact that I am not translating hardcoded lines like these. In fact, each parameters of these ranges can change depending on the inputs of the function I am working on. Thus, I can sometimes have 0.2 or 0.3 as the step parameter. So basically, do you guys know if there is another numpy/scipy or whatever function that really acts like Matlab range, or if I must add a little bit of code by myself to make sure that my Python range ends up at the same number as Matlab's?
Thanks!
You don't actually need to add your entire step size to the max limit of np.arange but just a very tiny number to make sure that that max is enclose. For example the machine epsilon:
eps = np.finfo(np.float32).eps
adding eps will give you the same result as MATLAB does in all three of your scenarios:
In [13]: np.arange(7, 8+eps)
Out[13]: array([ 7., 8.])
In [14]: np.arange(7, 8+eps, 0.3)
Out[14]: array([ 7. , 7.3, 7.6, 7.9])
In [15]: np.arange(7, 8+eps, 0.2)
Out[15]: array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])
Matlab docs for linspace say
linspace is similar to the colon operator, ":", but gives direct control over the number of points and always includes the endpoints. "lin" in the name "linspace" refers to generating linearly spaced values as opposed to the sibling function logspace, which generates logarithmically spaced values.
numpy arange has a similar advise.
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use linspace for these cases.
End of interval. The interval does not include this value, except
in some cases where step is not an integer and floating point
round-off affects the length of out.
So differences in how step size gets translated into number of steps can produces differences in the number of steps. If you need consistency between the two codes, linspace is the better choice (in both).

Normalization using Numpy vs hard coded

import numpy as np
import math
def normalize(array):
mean = sum(array) / len(array)
deviation = [(float(element) - mean)**2 for element in array]
std = math.sqrt(sum(deviation) / len(array))
normalized = [(float(element) - mean)/std for element in array]
numpy_normalized = (array - np.mean(array)) / np.std(array)
print normalized
print numpy_normalized
print ""
normalize([2, 4, 4, 4, 5, 5, 7, 9])
normalize([1, 2])
normalize(range(5))
Outputs:
[-1.5, -0.5, -0.5, -0.5, 0.0, 0.0, 1.0, 2.0]
[-1.5 -0.5 -0.5 -0.5 0. 0. 1. 2. ]
[0.0, 1.414213562373095]
[-1. 1.]
[-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]
[-1.41421356 -0.70710678 0. 0.70710678 1.41421356]
Can someone explain to me why this code behaves differently in the second example, but similarly in the other two examples?
Did I do anything wrong in the hard coded example? What does NumPy do to end up with [-1, 1]?
As seaotternerd explains, you're using integers. And in Python 2 (unless you from __future__ import division), dividing an integer by an integer gives you an integer.
So, why aren't all three wrong? Well, look at the values. In the first one, the sum is 40 and the len is 8, and 40 / 8 = 5. And in the third one, 10 / 5 = 2. But in the second one, 3 / 2 = 1.5. Which is why only that one gets the wrong answer when you do integer division.
So, why doesn't NumPy also get the second one wrong? NumPy doesn't treat an array of integers as floats, it treats them as integers—print np.array(array).dtype and you'll see int64. However, as the docs for np.mean explain, "float64 intermediate and return values are used for integer inputs". And, although I don't know this for sure, I'd guess they designed it that way specifically to avoid problems like this.
As a side note, if you're interested in taking the mean of floats, there are other problems with just using sum / div. For example, the mean of [1, 2, 1e200, -1e200] really ought to be 0.75, but if you just do sum / div, you're going to get 0. (Why? Well, 1 + 2 + 1e200 == 1e200.) You may want to look at a simple stats library, even if you're not using NumPy, to avoid all these problems. In Python 3 (which would have avoided your problem in the first place), there's one in the stdlib, called statistics; in Python 2, you'll have to go to PyPI.
You aren't converting the numbers in the array to floats when calculating the mean. This isn't a problem for your second or third inputs, because they happen to work out neatly (as explained by #abarnert), but since the second input does not, and is composed exclusively of ints, you end up calculating the mean as 1 when it should be 1.5. This propagates through, resulting in your discrepancy with the results of using NumPy's functions.
If you replace the line where you calculate the mean with this, which forces Python to use float division:
mean = sum(array) / float(len(array))
you will ultimately get [-1, 1] as a result for the second set of inputs, just like NumPy.

Using and multiplying arrays in python

I have a set of tasks i have to complete please help me im stuck on the multiplication one :(
1. np.array([0,5,10]) will create an array of integers starting at 0, finishing at 10, with step 5. Use a different command to create the same array automatically.
array_a = np.linspace(0,10,5)
print array_a
Is this correct? Also what is meant by automatically?
2. Create (automatically, not using np.array!) another array that contains 3 equally-spaced floating point numbers starting at 2.5 and finishing at 3.5.
array_b = np.linspace(2.5,3.5,3,)
print array_b
Use the multiplication operator * to multiply the two arrays together
How do i multiply them? I get an error that they arent the same shape, so do i need to slice array a?
The answer to the first problem is wrong; it asks you to create an array with elements [0, 5, 10]. When I run your code it prints [ 0. , 2.5, 5. , 7.5, 10. ] instead. I don't want to give the answer away completely (it is homework after all), but try looking up the docs for the arange function. You can solve #1 with either linspace or arange (you'll have to tweak the parameters either way), but I think the arange function is more suited to the specific wording of the question.
Once you've got #1 returning the correct result, the error in #3 should go away because the arrays will both have length 3 (i.e. they'll have the same shape).

Categories