Normalization using Numpy vs hard coded - python

import numpy as np
import math
def normalize(array):
mean = sum(array) / len(array)
deviation = [(float(element) - mean)**2 for element in array]
std = math.sqrt(sum(deviation) / len(array))
normalized = [(float(element) - mean)/std for element in array]
numpy_normalized = (array - np.mean(array)) / np.std(array)
print normalized
print numpy_normalized
print ""
normalize([2, 4, 4, 4, 5, 5, 7, 9])
normalize([1, 2])
normalize(range(5))
Outputs:
[-1.5, -0.5, -0.5, -0.5, 0.0, 0.0, 1.0, 2.0]
[-1.5 -0.5 -0.5 -0.5 0. 0. 1. 2. ]
[0.0, 1.414213562373095]
[-1. 1.]
[-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]
[-1.41421356 -0.70710678 0. 0.70710678 1.41421356]
Can someone explain to me why this code behaves differently in the second example, but similarly in the other two examples?
Did I do anything wrong in the hard coded example? What does NumPy do to end up with [-1, 1]?

As seaotternerd explains, you're using integers. And in Python 2 (unless you from __future__ import division), dividing an integer by an integer gives you an integer.
So, why aren't all three wrong? Well, look at the values. In the first one, the sum is 40 and the len is 8, and 40 / 8 = 5. And in the third one, 10 / 5 = 2. But in the second one, 3 / 2 = 1.5. Which is why only that one gets the wrong answer when you do integer division.
So, why doesn't NumPy also get the second one wrong? NumPy doesn't treat an array of integers as floats, it treats them as integers—print np.array(array).dtype and you'll see int64. However, as the docs for np.mean explain, "float64 intermediate and return values are used for integer inputs". And, although I don't know this for sure, I'd guess they designed it that way specifically to avoid problems like this.
As a side note, if you're interested in taking the mean of floats, there are other problems with just using sum / div. For example, the mean of [1, 2, 1e200, -1e200] really ought to be 0.75, but if you just do sum / div, you're going to get 0. (Why? Well, 1 + 2 + 1e200 == 1e200.) You may want to look at a simple stats library, even if you're not using NumPy, to avoid all these problems. In Python 3 (which would have avoided your problem in the first place), there's one in the stdlib, called statistics; in Python 2, you'll have to go to PyPI.

You aren't converting the numbers in the array to floats when calculating the mean. This isn't a problem for your second or third inputs, because they happen to work out neatly (as explained by #abarnert), but since the second input does not, and is composed exclusively of ints, you end up calculating the mean as 1 when it should be 1.5. This propagates through, resulting in your discrepancy with the results of using NumPy's functions.
If you replace the line where you calculate the mean with this, which forces Python to use float division:
mean = sum(array) / float(len(array))
you will ultimately get [-1, 1] as a result for the second set of inputs, just like NumPy.

Related

Matlab range in Python

I must translate some Matlab code into Python 3 and I often come across ranges of the form start:step:stop. When these arguments are all integers, I easily translate this command with np.arange(), but when some of the arguments are floats, especially the step parameter, I don't get the same output in Python. For example,
7:8 %In Matlab
7 8
If I want to translate it in Python I simply use :
np.arange(7,8+1)
array([7, 8])
But if I have, let's say :
7:0.3:8 %In Matlab
7.0000 7.3000 7.6000 7.9000
I can't translate it using the same logic :
np.arange(7, 8+0.3, 0.3)
array([ 7. , 7.3, 7.6, 7.9, 8.2])
In this case, I must not add the step to the stop argument.
But then, if I have :
7:0.2:8 %In Matlab
7.0000 7.2000 7.4000 7.6000 7.8000 8.0000
I can use my first idea :
np.arange(7,8+0.2,0.2)
array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])
My problem comes from the fact that I am not translating hardcoded lines like these. In fact, each parameters of these ranges can change depending on the inputs of the function I am working on. Thus, I can sometimes have 0.2 or 0.3 as the step parameter. So basically, do you guys know if there is another numpy/scipy or whatever function that really acts like Matlab range, or if I must add a little bit of code by myself to make sure that my Python range ends up at the same number as Matlab's?
Thanks!
You don't actually need to add your entire step size to the max limit of np.arange but just a very tiny number to make sure that that max is enclose. For example the machine epsilon:
eps = np.finfo(np.float32).eps
adding eps will give you the same result as MATLAB does in all three of your scenarios:
In [13]: np.arange(7, 8+eps)
Out[13]: array([ 7., 8.])
In [14]: np.arange(7, 8+eps, 0.3)
Out[14]: array([ 7. , 7.3, 7.6, 7.9])
In [15]: np.arange(7, 8+eps, 0.2)
Out[15]: array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])
Matlab docs for linspace say
linspace is similar to the colon operator, ":", but gives direct control over the number of points and always includes the endpoints. "lin" in the name "linspace" refers to generating linearly spaced values as opposed to the sibling function logspace, which generates logarithmically spaced values.
numpy arange has a similar advise.
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use linspace for these cases.
End of interval. The interval does not include this value, except
in some cases where step is not an integer and floating point
round-off affects the length of out.
So differences in how step size gets translated into number of steps can produces differences in the number of steps. If you need consistency between the two codes, linspace is the better choice (in both).

Resolving Zeros in Product of items in list

Given that we can easily convert between product of items in list with sum of logarithm of items in list if there are no 0 in the list, e.g:
>>> from operator import mul
>>> pn = [0.4, 0.3, 0.2, 0.1]
>>> math.pow(reduce(mul, pn, 1), 1./len(pn))
0.22133638394006433
>>> math.exp(sum(0.25 * math.log(p) for p in pn))
0.22133638394006436
How should we handle cases where there are 0s in the list and in Python (in a programatically and mathematically correct way)?
More specifically, how should we handle cases like:
>>> pn = [0.4, 0.3, 0, 0]
>>> math.pow(reduce(mul, pn, 1), 1./len(pn))
0.0
>>> math.exp(sum(1./len(pn) * math.log(p) for p in pn))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <genexpr>
ValueError: math domain error
Is returning 0 really the right way to handle this? What is an elegant solution such that we considers the 0s in the list but not end up with 0s?
Since it's some sort of a geometric average (product of list) and it's not exactly useful when we return 0 just because there is a single 0 in the list.
Spill over from Math Stackexchange:
https://math.stackexchange.com/questions/1727497/resolving-zeros-in-product-of-items-in-list, No answer from the math people, maybe the python/code Jedis have better ideas at resolving this.
TL;DR: Yes, returning 0 is the only right way.
(But see Conclusion.)
Mathematical background
In real analysis (i.e. not for complex numbers), when logarithms are considered, we traditionally assume the domain of log are real positive numbers. We have the identity:
x = exp(log(x)), for x>0.
It can be naturally extended to x=0 since the limit of the right hand side expression is well defined at x->0+ and equal to 0. Moreover, it's legit to set log(0)=-inf and exp(-inf)=0 (again: only for real, not complex, numbers). Formally, we extend the set of real numbers adding two elements -inf, +inf and defining consistent arithmetic etc. (For our purposes, we need to have inf + x = inf, x * inf = inf for a real x, inf + inf = inf etc.)
The other identity x = log(exp(x)) is less troublesome and holds for all real numbers (and even x=-inf or +inf).
Geometric mean
The geometric mean can be defined for nonnegative numbers (possibly equal to zeros). For two numbers a, b (it naturally generalizes to more numbers, so I'll be using only two further on), it is
gm(a,b) = sqrt(a*b), for a,b >= 0.
Certainly, gm(0,b)=0. Taking log, we get:
log(gm(a,b)) = (log(a) + log(b))/2
and it is well defined if a or b is zero. (We can plug in log(0) = -inf and the identity still holds true thanks to the extended arithmetic we defined earlier.)
Interpretation
Not surprisingly, the notion of the geometric mean hails from geometry and was originally (in ancient Greece) used for strictly positive numbers.
Suppose, we have a rectangular with sides of lengths a and b. Find a square with the area equal to the area of the rectangular. Easy to see, that the side of the square is the geometric mean of a and b.
Now, if we take a = 0, then we don't really have a rectangular and this geometric interpretation breaks. Similar problems can arise with other interpretations. We can mitigate it by considering, for example, degenerate rectangulars and squares but it may not always be a plausible approach.
Conclusion
It's up to a user (mathematician, engineer, programmer) how she understands the meaning of a geometric mean being zero. If it causes serious problems with interpretation of the results or breaks a computer program, then in the first place, maybe the choice of the geometric mean was not justified as a mathematical model.
Python
As already mentioned in the other answers, python has infinity implemented. It raises a runtime warning (division by zero) when executing np.exp(np.log(0)) but the result of the operation is correct.
Whether or not 0 is the correct result depends on what you're trying to accomplish. ptrj did a great job with their answer, so I will only add one thing to consider.
You may want to consider using an epsilon-adjusted geometric mean. Whereas a standard geometric mean is of the form (a_1*a_2*...*a_n)^(1/n), the epsilon-adjusted geometric mean is of the form ( (a_1+e)*(a_2+e)*...*(a_n+e) )^(1/n) - e. The appropriate value for epsilon (e) depends again on your task.
Epsilon-adjusted geometric means are sometimes used in data retrieval where a 0 in the set shouldn't cause a record's score to vanish entirely, though it should still penalize the record's score. See for example Score Aggregation Techniques in Retrieval Experimentation.
For example, with your data and an epsilon adjustment of 0.01
>>> from operator import mul
>>> pn=[0.4, 0.3, 0, 0]
>>> e=0.01
>>> pow(reduce(mul, [x+e for x in pn], 1), 1./len(pn)) - e
0.04970853116594962
You should return -math.inf in python 3.5 or -float('inf') in older versions. This is because the logarithm of numbers very close to 0 goes to negative infinity. This float value with preserve the correct inequalities between the sum of logs between lists, for instance one would expect that
sumlog([5, 4, 1, 0, 2]) < sumlog([5, 1, 4, 0.0001, 1])
This inequality is held if you return negative infinity.
You can try using list comprehensions in Python. They can be very powerful for customising the way your data is handled. This example uses list comprehension and an error number of -999.
>>> [math.log(i) if i > 0 else -999 for i in pn]
>>> [-0.916290731874155, -1.2039728043259361, -999, -999]
If you're only using the if and not the else, then the if goes after the for i in pn part.

Using and multiplying arrays in python

I have a set of tasks i have to complete please help me im stuck on the multiplication one :(
1. np.array([0,5,10]) will create an array of integers starting at 0, finishing at 10, with step 5. Use a different command to create the same array automatically.
array_a = np.linspace(0,10,5)
print array_a
Is this correct? Also what is meant by automatically?
2. Create (automatically, not using np.array!) another array that contains 3 equally-spaced floating point numbers starting at 2.5 and finishing at 3.5.
array_b = np.linspace(2.5,3.5,3,)
print array_b
Use the multiplication operator * to multiply the two arrays together
How do i multiply them? I get an error that they arent the same shape, so do i need to slice array a?
The answer to the first problem is wrong; it asks you to create an array with elements [0, 5, 10]. When I run your code it prints [ 0. , 2.5, 5. , 7.5, 10. ] instead. I don't want to give the answer away completely (it is homework after all), but try looking up the docs for the arange function. You can solve #1 with either linspace or arange (you'll have to tweak the parameters either way), but I think the arange function is more suited to the specific wording of the question.
Once you've got #1 returning the correct result, the error in #3 should go away because the arrays will both have length 3 (i.e. they'll have the same shape).

numpy.arange divide by zero error

I have used numpy's arange function to make the following range:
a = n.arange(0,5,1/2)
This variable works fine by itself, but when I try putting it anywhere in my script I get an error that says
ZeroDivisionError: division by zero
First, your step evaluates to zero (on python 2.x that is). Second, you may want to check np.linspace if you want to use a non-integer step.
Docstring:
arange([start,] stop[, step,], dtype=None)
Return evenly spaced values within a given interval.
[...]
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use ``linspace`` for these cases.
In [1]: import numpy as np
In [2]: 1/2
Out[2]: 0
In [3]: 1/2.
Out[3]: 0.5
In [4]: np.arange(0, 5, 1/2.) # use a float
Out[4]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
If you're not using a newer version of python (3.1 or later I think) the expression 1/2 evaluates to zero, since it's assuming integer division.
You can fix this by replacing 1/2 with 1./2 or 0.5, or put from __future__ import division at the top of your script.

Creating Gaussian filter of required length in python

Could anyone suggest which library supports creation of a gaussian filter of required length and sigma?I basically need an equivalent function for the below matlab function:
fltr = fspecial('gaussian',[1 n],sd)
You don't need a library for a simple 1D gaussian.
from math import pi, sqrt, exp
def gauss(n=11,sigma=1):
r = range(-int(n/2),int(n/2)+1)
return [1 / (sigma * sqrt(2*pi)) * exp(-float(x)**2/(2*sigma**2)) for x in r]
Note: This will always return an odd-length list centered around 0. I suppose there may be situations where you would want an even-length Gaussian with values for x = [..., -1.5, -0.5, 0.5, 1.5, ...], but in that case, you would need a slightly different formula and I'll leave that to you ;)
Output example with default values n = 11, sigma = 1:
>>> g = gauss()
1.48671951473e-06
0.000133830225765
0.00443184841194
0.0539909665132
0.241970724519
0.398942280401
0.241970724519
0.0539909665132
0.00443184841194
0.000133830225765
1.48671951473e-06
>>> sum(g)
0.99999999318053079
Perhaps scipy.ndimage.filters.gaussian_filter? I've never used it, but the documentation is at: https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.ndimage.filters.gaussian_filter.html
Try scipy.ndimage.gaussian_filter, but do you really want the kernel or do you also want to apply it? (In which case you can just use this function.) In the former case, apply the filter on an array which is 0 everywhere but with a 1 in the center. For the easier-to-write 1d case, this would be for example:
>>> ndimage.gaussian_filter1d(np.float_([0,0,0,0,1,0,0,0,0]), 1)
array([ 1.33830625e-04, 4.43186162e-03, 5.39911274e-02,
2.41971446e-01, 3.98943469e-01, 2.41971446e-01,
5.39911274e-02, 4.43186162e-03, 1.33830625e-04])
If run-time speed is of importance I highly recommend creating the filter once and then using it on every iteration. Optimizations are constantly made but a couple of years ago this significantly sped some code I wrote. ( The above answers show how to create the filter ).

Categories