I must translate some Matlab code into Python 3 and I often come across ranges of the form start:step:stop. When these arguments are all integers, I easily translate this command with np.arange(), but when some of the arguments are floats, especially the step parameter, I don't get the same output in Python. For example,
7:8 %In Matlab
7 8
If I want to translate it in Python I simply use :
np.arange(7,8+1)
array([7, 8])
But if I have, let's say :
7:0.3:8 %In Matlab
7.0000 7.3000 7.6000 7.9000
I can't translate it using the same logic :
np.arange(7, 8+0.3, 0.3)
array([ 7. , 7.3, 7.6, 7.9, 8.2])
In this case, I must not add the step to the stop argument.
But then, if I have :
7:0.2:8 %In Matlab
7.0000 7.2000 7.4000 7.6000 7.8000 8.0000
I can use my first idea :
np.arange(7,8+0.2,0.2)
array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])
My problem comes from the fact that I am not translating hardcoded lines like these. In fact, each parameters of these ranges can change depending on the inputs of the function I am working on. Thus, I can sometimes have 0.2 or 0.3 as the step parameter. So basically, do you guys know if there is another numpy/scipy or whatever function that really acts like Matlab range, or if I must add a little bit of code by myself to make sure that my Python range ends up at the same number as Matlab's?
Thanks!
You don't actually need to add your entire step size to the max limit of np.arange but just a very tiny number to make sure that that max is enclose. For example the machine epsilon:
eps = np.finfo(np.float32).eps
adding eps will give you the same result as MATLAB does in all three of your scenarios:
In [13]: np.arange(7, 8+eps)
Out[13]: array([ 7., 8.])
In [14]: np.arange(7, 8+eps, 0.3)
Out[14]: array([ 7. , 7.3, 7.6, 7.9])
In [15]: np.arange(7, 8+eps, 0.2)
Out[15]: array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])
Matlab docs for linspace say
linspace is similar to the colon operator, ":", but gives direct control over the number of points and always includes the endpoints. "lin" in the name "linspace" refers to generating linearly spaced values as opposed to the sibling function logspace, which generates logarithmically spaced values.
numpy arange has a similar advise.
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use linspace for these cases.
End of interval. The interval does not include this value, except
in some cases where step is not an integer and floating point
round-off affects the length of out.
So differences in how step size gets translated into number of steps can produces differences in the number of steps. If you need consistency between the two codes, linspace is the better choice (in both).
Related
This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 6 years ago.
I wanted values like -
1,1.02,1.04,1.06,1.08 etc...
So used numpy in python:
y = [x for x in numpy.arange(1,2,0.02)]
I got the values-
1.0,
1.02,
1.04,
1.0600000000000001,
1.0800000000000001,
I have three questions here:
How do I get exactly values 1,1.02,1.04,1.06,1.08 etc....
Why correct values for 1.02, and 1.04, and not for 1.0600000000000001,
How reliable our programs can be when we can't trust such basic operations in programs that can run into thousands of lines of code, and we do so many calculations in that? How do we cope with such scenario?
There are very similar questions that address problems with floating points in general and numpy library in particular -
Is floating point math broken?
Why Are Floating Point Numbers Inaccurate?
While they address why such a thing happens, here I'm concerned more about how do I deal with such scenario in everyday programming, particularly in numpy python? Hence I've these questions.
First pitfall : do not confuse accuracy and printing policies.
In your example :
In [6]: [x.as_integer_ratio() for x in arange(1,1.1,0.02)]
Out[6]:
[(1, 1),
(2296835809958953, 2251799813685248),
(1170935903116329, 1125899906842624),
(2386907802506363, 2251799813685248),
(607985949695017, 562949953421312),
(2476979795053773, 2251799813685248)]
shows that only 1 has an exact float representation.
In [7]: ['{:1.18f}'.format(f) for f in arange(1,1.1,.02)]
Out[7]:
['1.000000000000000000',
'1.020000000000000018',
'1.040000000000000036',
'1.060000000000000053',
'1.080000000000000071',
'1.100000000000000089']
shows intern accuracy.
In [8]: arange(1,1.1,.02)
Out[8]: array([ 1. , 1.02, 1.04, 1.06, 1.08, 1.1 ])
shows how numpy deals with printing, rounding to at most 6 digits, discarding trailing 0.
In [9]: [f for f in arange(1,1.1,.02)]
Out[9]: [1.0, 1.02, 1.04, 1.0600000000000001, 1.0800000000000001, 1.1000000000000001]
shows how python deals with printing, rounding to at most 16 digits, discarding trailing 0 after first digit.
Some advise for How do I deal with such scenario in everyday programming ?
Furthermore, each operation on floats can deteriorate accuracy.
Natural float64 accuracy is roughly 1e-16, which is sufficient for a lot of applications. Subtract is the most common source of precision loss, as in this example where exact result is 0. :
In [5]: [((1+10**(-exp))-1)/10**(-exp)-1 for exp in range(0,24,4)]
Out[5]:
[0.0,
-1.1013412404281553e-13,
-6.07747097092215e-09,
8.890058234101161e-05,
-1.0,
-1.0]
import numpy as np
import math
def normalize(array):
mean = sum(array) / len(array)
deviation = [(float(element) - mean)**2 for element in array]
std = math.sqrt(sum(deviation) / len(array))
normalized = [(float(element) - mean)/std for element in array]
numpy_normalized = (array - np.mean(array)) / np.std(array)
print normalized
print numpy_normalized
print ""
normalize([2, 4, 4, 4, 5, 5, 7, 9])
normalize([1, 2])
normalize(range(5))
Outputs:
[-1.5, -0.5, -0.5, -0.5, 0.0, 0.0, 1.0, 2.0]
[-1.5 -0.5 -0.5 -0.5 0. 0. 1. 2. ]
[0.0, 1.414213562373095]
[-1. 1.]
[-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]
[-1.41421356 -0.70710678 0. 0.70710678 1.41421356]
Can someone explain to me why this code behaves differently in the second example, but similarly in the other two examples?
Did I do anything wrong in the hard coded example? What does NumPy do to end up with [-1, 1]?
As seaotternerd explains, you're using integers. And in Python 2 (unless you from __future__ import division), dividing an integer by an integer gives you an integer.
So, why aren't all three wrong? Well, look at the values. In the first one, the sum is 40 and the len is 8, and 40 / 8 = 5. And in the third one, 10 / 5 = 2. But in the second one, 3 / 2 = 1.5. Which is why only that one gets the wrong answer when you do integer division.
So, why doesn't NumPy also get the second one wrong? NumPy doesn't treat an array of integers as floats, it treats them as integers—print np.array(array).dtype and you'll see int64. However, as the docs for np.mean explain, "float64 intermediate and return values are used for integer inputs". And, although I don't know this for sure, I'd guess they designed it that way specifically to avoid problems like this.
As a side note, if you're interested in taking the mean of floats, there are other problems with just using sum / div. For example, the mean of [1, 2, 1e200, -1e200] really ought to be 0.75, but if you just do sum / div, you're going to get 0. (Why? Well, 1 + 2 + 1e200 == 1e200.) You may want to look at a simple stats library, even if you're not using NumPy, to avoid all these problems. In Python 3 (which would have avoided your problem in the first place), there's one in the stdlib, called statistics; in Python 2, you'll have to go to PyPI.
You aren't converting the numbers in the array to floats when calculating the mean. This isn't a problem for your second or third inputs, because they happen to work out neatly (as explained by #abarnert), but since the second input does not, and is composed exclusively of ints, you end up calculating the mean as 1 when it should be 1.5. This propagates through, resulting in your discrepancy with the results of using NumPy's functions.
If you replace the line where you calculate the mean with this, which forces Python to use float division:
mean = sum(array) / float(len(array))
you will ultimately get [-1, 1] as a result for the second set of inputs, just like NumPy.
If I define
>>> y=np.linspace(1., 10, 10)
and I do
>>> np.percentile(y, [25, 50, 75])
I obtain [3.25, 5.5, 7.75].
For the 1,2,3,4,5,6,7,8,9,10 series, Q3= 5.5 (OK) and Q1=3 (and not 3.25) and Q3=8 (and not 7.75) !!!
Sorry I am a little lost with these elementary things ..thanks by advance for some help.
By default, numpy uses linear interpolation for percentiles, meaning that if the "true" value of a percentile lies between two data points, it returns a value that is between them, proportionally closer to the data point that is closer to the requested percentile.
Starting in numpy 1.9.0, you can override this by passing the interpolation parameter to percentile. You have several options as documented here. "Lower" or "nearest" is likely what you're looking for.
In earlier versions of numpy there is no way to get the behavior you want. There is a function scipy.stats.scoreatpercentile in scipy which provides "lower" and "higher" interpolation methods (but not the extra "nearest" and "midpoint" methods that np.percentile offers).
Because I like to understand exactly how things are working ... And because someone else might be like me ...
Firstly I thank a lot BrenBarn for his help and the time he spent for
answering. So how plt gives the results for first quartile (Q1, 25th percentile), the median (Q2, 50th percentile) and 3th quartile (Q3, 75th percentile) in plt.boxplot() and more generally with np.percentile() ?
BrenBarn said to read the manual http://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html where it is writen "linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.", but for me it was not very clear because i, j and fraction were not obvious for me.
So let's do :
>>> x=np.linspace(1,10,10)
>>> x
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
>>> np.percentile(x,[0,1,20,25,50,75,80,99,100])
[1.0, 1.0900000000000001, 2.8000000000000003, 3.25, 5.5, 7.75, 8.1999999999999993, 9.9100000000000001, 10.0]
The way to deduce i,j and fraction is, if we take the definition of numpy manual:
P: percentile to calculate.
N: total number of data.
n=((P/100)*(N-1))+1.
n=k+d.
k is an integer, the ith sorted data of the array x (let's say v_i) and d is the fraction described in the numpy manual for percentile(). v_i+1 is called vj (the (i+1)th sorted datum).
So using the definition of the numpy manual:
n=i+fraction.
Then the result is found easily by using:
value= first sorted data (x[0] in this exemple) if P=0.
value= last data (x[9] in this exemple) if P=100.
value= v_i + d * (v_j - v_i) if 0 < i < N.
For exemple above:
For the First percentile:
n=((1/100)*(10-1))+1=1.09
and
>>> x[0]+0.09*(x[1]-x[0])
1.0900000000000001
OK, as returned by np.percentile(x,[0,1,20,25,50,75,80,99,100]) above.
For the 20th:
n=((20Ă·100)*(10-1))+1=2.8
and
>>> x[1]+0.8*(x[2]-x[1])
2.7999999999999998
OK, very near of the return of np.percentile(x,[0,1,20,25,50,75,80,99,100]) above.
For the first quartile:
n=((25/100)*(10-1))+1=3.25
and
>>> x[2]+0.25*(x[3]-x[2])
3.25
OK, as returned by np.percentile(x,[0,1,20,25,50,75,80,99,100]) above.
For the median:
n=((50/100)*(10-1))+1=5.5
and
>>> x[4]+0.5*(x[5]-x[4])
5.5
OK, as returned by np.percentile(x,[0,1,20,25,50,75,80,99,100]) above.
And so on ...
I used this link https://en.m.wikipedia.org/wiki/Percentile#Microsoft_Excel_method to do it, where this method is called the "Microsoft Excel method". Because I am from the free world I do not like it but this is the name given in this link ...
Hoping this will help someone, despite my bad English writing.
Je suis Charlie.
I have a set of tasks i have to complete please help me im stuck on the multiplication one :(
1. np.array([0,5,10]) will create an array of integers starting at 0, finishing at 10, with step 5. Use a different command to create the same array automatically.
array_a = np.linspace(0,10,5)
print array_a
Is this correct? Also what is meant by automatically?
2. Create (automatically, not using np.array!) another array that contains 3 equally-spaced floating point numbers starting at 2.5 and finishing at 3.5.
array_b = np.linspace(2.5,3.5,3,)
print array_b
Use the multiplication operator * to multiply the two arrays together
How do i multiply them? I get an error that they arent the same shape, so do i need to slice array a?
The answer to the first problem is wrong; it asks you to create an array with elements [0, 5, 10]. When I run your code it prints [ 0. , 2.5, 5. , 7.5, 10. ] instead. I don't want to give the answer away completely (it is homework after all), but try looking up the docs for the arange function. You can solve #1 with either linspace or arange (you'll have to tweak the parameters either way), but I think the arange function is more suited to the specific wording of the question.
Once you've got #1 returning the correct result, the error in #3 should go away because the arrays will both have length 3 (i.e. they'll have the same shape).
I have used numpy's arange function to make the following range:
a = n.arange(0,5,1/2)
This variable works fine by itself, but when I try putting it anywhere in my script I get an error that says
ZeroDivisionError: division by zero
First, your step evaluates to zero (on python 2.x that is). Second, you may want to check np.linspace if you want to use a non-integer step.
Docstring:
arange([start,] stop[, step,], dtype=None)
Return evenly spaced values within a given interval.
[...]
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use ``linspace`` for these cases.
In [1]: import numpy as np
In [2]: 1/2
Out[2]: 0
In [3]: 1/2.
Out[3]: 0.5
In [4]: np.arange(0, 5, 1/2.) # use a float
Out[4]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
If you're not using a newer version of python (3.1 or later I think) the expression 1/2 evaluates to zero, since it's assuming integer division.
You can fix this by replacing 1/2 with 1./2 or 0.5, or put from __future__ import division at the top of your script.