Numpy argsort vs Scipy.stats rankdata

Numpy argsort vs Scipy.stats rankdata - python

I've recently used both of these functions, and am looking for input from anyone who can speak to the following:
do argsort and rankdata differ fundamentally in their purpose?
are there performance advantages with one over the other? (specifically: large vs small array performance differences?)
what is the memory overhead associated with importing rankdata?
Thanks in advance.
p.s. I could not create the new tags 'argsort' or 'rankdata'. If anyone with sufficient standing feels they should be added to this question, please do.

Do argsort and rankdata differ fundamentally in their purpose?
In my opinion, they do slightly. The first gives you the positions of the data if the data was sorted, while the second the rank of the data. The difference can become apparent in the case of ties:
import numpy as np
from scipy import stats
a = np.array([ 5, 0.3, 0.4, 1, 1, 1, 3, 42])
almost_ranks = np.empty_like(a)
almost_ranks[np.argsort(a)] = np.arange(len(a))
print(almost_ranks)
print(almost_ranks+1)
print(stats.rankdata(a))
Results to (notice 3. 4. 5 vs. 4. 4. 4 ):
[6. 0. 1. 2. 3. 4. 5. 7.]
[7. 1. 2. 3. 4. 5. 6. 8.]
[7. 1. 2. 4. 4. 4. 6. 8.]
Are there performance advantages with one over the other?
(specifically: large vs small array performance differences?)
Both algorithms seem to me to have the same complexity: O(NlgN) I would expect the numpy implementation to be slightly faster as it has a bit of a smaller overhead, plus it's numpy. But you should test this yourself... Checking the code for scipy.rankdata, it seems to -at present, my python...- be calling np.unique among other functions, so i would guess it would take more in practice...
what is the memory overhead associated with importing rankdata?
Well, you import scipy, if you had not done so before, so it is the overhead of scipy...

Related

Python: Alternative to quasi random sequences

Hey I have the following problem. I have a large parameter space. In my case I have like 10 dimensions. But to simplify lets assume I have 3 variables x1,x2 and x3. They are discrete numbers from 1 to 10. Now i create all possible parameter combinations and want to use them for postprocessing. In my real case that are too many combinatons. So I want to do a quasi random sequence search to reduce the search space. But the the combinations in the search space should cover it as good as possible. (uniform distributed). I want to prevent the parameter combination to Cluster in the search space, it should Cover the whole search space as good as possible. I need that to find preferences of the parameter combiantions in the processing of the parameters. There a many approaches to do that, like Haton, Hammersley or Sobol sequences. But they are not working for discrete numbers. One package which do quasi random sequences is chaospy. If i round the numbers of the sequences, variable numbers of each variable will occur more than once in the different variable combinations. That is not what I want. I want that every variable number only occurs Once and the variables are uniformly distributed in the search space. Is there a possibility to create from the beginning a random multi dimensional set of variable combination, in which every variable just appears once? For example In a two dimensional grid 10x10 one possible combination would be the diagonal. Of course in 3 dimensions I would need 100 combinations to Cover all Parameter value,
Lets have an simplified example with three variables from 1-10 with Sobol Sequence:
import numpy as np
import chaospy as cp
#Create a Joint distributuon of the three varaibles, which ranges going from 1 to 10
distribution2 = cp.J(cp.Uniform(1, 10),cp.Uniform(1, 10),cp.Uniform(1, 10))
#Create 10 numbers in the variable space
samplesSobol = distribution2.sample(10, rule="S")
#Transpose the array to get the variable combinations in subarrays
sobolPointsTranspose = np.transpose(samplesSobol)
Example Output:
[[ 7.89886475 6.34649658 4.8336792 ]
[ 5.64886475 4.09649658 2.5836792 ]
[ 1.14886475 8.59649658 7.0836792 ]
[ 1.21917725 5.01055908 2.5133667 ]
[ 5.71917725 9.51055908 7.0133667 ]
[ 7.96917725 2.76055908 9.2633667 ]
[ 3.46917725 7.26055908 4.7633667 ]
[ 4.59417725 1.63555908 5.8883667 ]
[ 9.09417725 6.13555908 1.3883667 ]
[ 6.84417725 3.88555908 3.6383667 ]]
Now here every variable number is unique but the Output is not discrete. I can round it and get:
[[ 8. 6. 5.]
[ 6. 4. 3.]
[ 1. 9. 7.]
[ 1. 5. 3.]
[ 6. 10. 7.]
[ 8. 3. 9.]
[ 3. 7. 5.]
[ 5. 2. 6.]
[ 9. 6. 1.]
[ 7. 4. 4.]]
Now the problem is, that for example 1 occurs twice in the first dimension or 4 in the second or 7 in the third dimension.

This is a very late answer so I assume is no longer relevant to the Original Poster but I came across the post whilst trying to find an existing implementation of what I describe below.
It sounds like you are looking for something like a Latin Hypercube: https://en.wikipedia.org/wiki/Latin_hypercube_sampling.
Essentially if I have n variables and I want 10 samples then the range of each variable is split into 10 intervals and the possible values for each variable are (e.g.) the middle points of each interval. A Latin hypercube algorithm picks samples at random in such a way that each of the 10 values for each variable appears only once. The example in Warren's answer is an example of a Latin Hypercube.
This doesn't help to cover the search space as well as possible (or in other words to check if the design is space filling). There is a criterion from Morris and Mitchell's 1995 paper Exploratory designs for computational experiments which calculates how space filling a sample is by looking at the distance between points. You can create a large number of different Latin Hypercube Designs and then use the criterion to choose the best, or take an initial design and manipulate it to give a better design. The latter is implemented in the algorithm here: https://github.com/1313e/e13Tools/blob/master/e13tools/sampling/lhs.py
They give some examples in the code, e.g. for 5 points and 2 variables:
import numpy as np
np.random.seed(0)
lhd(5, 2, method='fixed')
returns something like
array([[ 0.5 , 0.75],
[ 0.25, 0.25],
[ 0. , 1. ],
[ 0.75, 0.5 ],
[ 1. , 0. ]])
This will give the Latin Hypercube scaled on the interval [0, 1] so you would need to unscale to the range of your parameters using, for example
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
Here's an example of one of the outputs I get when I run the above code:
This one is pretty good at space-filling according to the Morris-Mitchell criterion.

"Is there a possibility to create from the beginning a random multi dimensional set of variable combination, in which every variable just appears once?" For this to work, each variable must have the same number of possible values. In your examples this number is 10, so I'll use that.
One way to generate the random points is to stack random permutations of range(10). Like this, for example, with three variables:
In [180]: np.column_stack([np.random.permutation(10) for _ in range(3)])
Out[180]:
array([[6, 6, 4],
[9, 2, 0],
[0, 4, 3],
[5, 9, 5],
[2, 8, 7],
[1, 1, 9],
[8, 3, 8],
[3, 5, 1],
[4, 0, 2],
[7, 7, 6]])

This answer gives a function that generates a list of 4-value lists such that
[a, b, c, d] are natural numbers between 1 and 10. In each set, the parameters may only take any value exactly once.
import random
def generate_random_sequences(num_params=4, seed=0)
random.seed(seed)
value_lists = [[val for val in range(1, 11)] for _ in range(num_params)]
for values in value_lists:
random.shuffle(values)
ret = [[] for _ in range(num_params)]
for value_idx in range(10):
for param_idx in range(num_params):
ret[param_idx].append(value_lists[param_idx][value_idx])
return ret
I just saw that Warren's answer using numpy is way superior, and you use numpy already anyway. Still submitting this one as a pure python implementation.

Matlab range in Python

I must translate some Matlab code into Python 3 and I often come across ranges of the form start:step:stop. When these arguments are all integers, I easily translate this command with np.arange(), but when some of the arguments are floats, especially the step parameter, I don't get the same output in Python. For example,
7:8 %In Matlab
7 8
If I want to translate it in Python I simply use :
np.arange(7,8+1)
array([7, 8])
But if I have, let's say :
7:0.3:8 %In Matlab
7.0000 7.3000 7.6000 7.9000
I can't translate it using the same logic :
np.arange(7, 8+0.3, 0.3)
array([ 7. , 7.3, 7.6, 7.9, 8.2])
In this case, I must not add the step to the stop argument.
But then, if I have :
7:0.2:8 %In Matlab
7.0000 7.2000 7.4000 7.6000 7.8000 8.0000
I can use my first idea :
np.arange(7,8+0.2,0.2)
array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])
My problem comes from the fact that I am not translating hardcoded lines like these. In fact, each parameters of these ranges can change depending on the inputs of the function I am working on. Thus, I can sometimes have 0.2 or 0.3 as the step parameter. So basically, do you guys know if there is another numpy/scipy or whatever function that really acts like Matlab range, or if I must add a little bit of code by myself to make sure that my Python range ends up at the same number as Matlab's?
Thanks!

You don't actually need to add your entire step size to the max limit of np.arange but just a very tiny number to make sure that that max is enclose. For example the machine epsilon:
eps = np.finfo(np.float32).eps
adding eps will give you the same result as MATLAB does in all three of your scenarios:
In [13]: np.arange(7, 8+eps)
Out[13]: array([ 7., 8.])
In [14]: np.arange(7, 8+eps, 0.3)
Out[14]: array([ 7. , 7.3, 7.6, 7.9])
In [15]: np.arange(7, 8+eps, 0.2)
Out[15]: array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])

Matlab docs for linspace say
linspace is similar to the colon operator, ":", but gives direct control over the number of points and always includes the endpoints. "lin" in the name "linspace" refers to generating linearly spaced values as opposed to the sibling function logspace, which generates logarithmically spaced values.
numpy arange has a similar advise.
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use linspace for these cases.
End of interval. The interval does not include this value, except
in some cases where step is not an integer and floating point
round-off affects the length of out.
So differences in how step size gets translated into number of steps can produces differences in the number of steps. If you need consistency between the two codes, linspace is the better choice (in both).

Python: Are numpy arrays linked when one array is transposed?

Currently I am working on a python script which extracts measurement data from a text file. I am working with iPython Notebook and Python 2.7
Now I experienced some odd behaviour when working with numpy arrays. I have no explanation for this.
myArray = numpy.zeros((4,3))
myArrayTransposed = myArray.transpose()
for i in range(0,4):
for j in range(0,3):
myArray[i][j] = i+j
print myArray
print myArrayTransposed
leads to:
[[ 0. 1. 2.]
[ 1. 2. 3.]
[ 2. 3. 4.]
[ 3. 4. 5.]]
[[ 0. 1. 2. 3.]
[ 1. 2. 3. 4.]
[ 2. 3. 4. 5.]]
So without working on the transposed array, values are updated in this array.
How is this possible?

From http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html:
Different ndarrays can share the same data, so that changes made in one ndarray may be visible in another. That is, an ndarray can be a “view” to another ndarray, and the data it is referring to is taken care of by the “base” ndarray. ndarrays can also be views to memory owned by Python strings or objects implementing the buffer or array interfaces.
When you do a transpose(), this returns a "view" to the original ndarray. It points to the same memory buffer, but it has a different indexing scheme:
A segment of memory is inherently 1-dimensional, and there are many different schemes for arranging the items of an N-dimensional array in a 1-dimensional block. Numpy is flexible, and ndarray objects can accommodate any strided indexing scheme.
To create an independent ndarray, you can use numpy.array() operator:
myArrayTransposed = myArray.transpose().copy()

Using and multiplying arrays in python

I have a set of tasks i have to complete please help me im stuck on the multiplication one :(
1. np.array([0,5,10]) will create an array of integers starting at 0, finishing at 10, with step 5. Use a different command to create the same array automatically.
array_a = np.linspace(0,10,5)
print array_a
Is this correct? Also what is meant by automatically?
2. Create (automatically, not using np.array!) another array that contains 3 equally-spaced floating point numbers starting at 2.5 and finishing at 3.5.
array_b = np.linspace(2.5,3.5,3,)
print array_b
Use the multiplication operator * to multiply the two arrays together
How do i multiply them? I get an error that they arent the same shape, so do i need to slice array a?

The answer to the first problem is wrong; it asks you to create an array with elements [0, 5, 10]. When I run your code it prints [ 0. , 2.5, 5. , 7.5, 10. ] instead. I don't want to give the answer away completely (it is homework after all), but try looking up the docs for the arange function. You can solve #1 with either linspace or arange (you'll have to tweak the parameters either way), but I think the arange function is more suited to the specific wording of the question.
Once you've got #1 returning the correct result, the error in #3 should go away because the arrays will both have length 3 (i.e. they'll have the same shape).

numpy.arange divide by zero error

I have used numpy's arange function to make the following range:
a = n.arange(0,5,1/2)
This variable works fine by itself, but when I try putting it anywhere in my script I get an error that says
ZeroDivisionError: division by zero

First, your step evaluates to zero (on python 2.x that is). Second, you may want to check np.linspace if you want to use a non-integer step.
Docstring:
arange([start,] stop[, step,], dtype=None)
Return evenly spaced values within a given interval.
[...]
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use ``linspace`` for these cases.
In [1]: import numpy as np
In [2]: 1/2
Out[2]: 0
In [3]: 1/2.
Out[3]: 0.5
In [4]: np.arange(0, 5, 1/2.) # use a float
Out[4]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

If you're not using a newer version of python (3.1 or later I think) the expression 1/2 evaluates to zero, since it's assuming integer division.
You can fix this by replacing 1/2 with 1./2 or 0.5, or put from __future__ import division at the top of your script.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.