Using and multiplying arrays in python - python

I have a set of tasks i have to complete please help me im stuck on the multiplication one :(
1. np.array([0,5,10]) will create an array of integers starting at 0, finishing at 10, with step 5. Use a different command to create the same array automatically.
array_a = np.linspace(0,10,5)
print array_a
Is this correct? Also what is meant by automatically?
2. Create (automatically, not using np.array!) another array that contains 3 equally-spaced floating point numbers starting at 2.5 and finishing at 3.5.
array_b = np.linspace(2.5,3.5,3,)
print array_b
Use the multiplication operator * to multiply the two arrays together
How do i multiply them? I get an error that they arent the same shape, so do i need to slice array a?

The answer to the first problem is wrong; it asks you to create an array with elements [0, 5, 10]. When I run your code it prints [ 0. , 2.5, 5. , 7.5, 10. ] instead. I don't want to give the answer away completely (it is homework after all), but try looking up the docs for the arange function. You can solve #1 with either linspace or arange (you'll have to tweak the parameters either way), but I think the arange function is more suited to the specific wording of the question.
Once you've got #1 returning the correct result, the error in #3 should go away because the arrays will both have length 3 (i.e. they'll have the same shape).

Related

efficient way of column sum and reciprocal in a matrix

I am working with large matrices(up to million X million).I want to column sum each column in a matrix and put the reciprocal of each column sum in the respective column elements where non zero elements are there.I have done two attempts on this but I still want a faster method of computation and since some columns are zero cant do direct np.reciprocal.
Here are my attempts:
A=np.array([[0,1,1,1],[0,0,1,0],[0,1,0,0],[0,0,0,0]])
d=sc.shape(A)[0]
V=sc.zeros(d)
sc.sum(A,axis=0,out=V,dtype='int')
with sc.errstate(divide='ignore', invalid='ignore'):
Vs = sc.true_divide( 1, V )
Vs[ ~ sc.isfinite( Vs )] = 0 # -inf inf NaN
print Vs
Second attempt:
A=np.array([[0,1,1,1],[0,0,1,0],[0,1,0,0],[0,0,0,0]])
d=sc.shape(A)[0]
V=sc.zeros(d)
sc.sum(A,axis=0,out=V,dtype='int')
for i in range(0,d):
if V[i]!=0:
V[i]=1/V[i]
print V
Is there a faster way than this?As my running time is very poor.
Thanks
edit1: Do you think changing everything to csr sparse matrix format would make it faster?
NumPy: Return 0 with divide by zero
discusses various divide by zero options. The accepted answer looks a lot like your first try. But there's a new answer that might (?) be faster
https://stackoverflow.com/a/37977222/901925
In [240]: V=A.sum(axis=0)
In [241]: np.divide(1,V,out=np.zeros(V.shape),where=V>0)
Out[241]: array([ 0. , 0.5, 0.5, 1. ])
Your example is too small to make meaningful time tests on. I don't have any intuition about the relative speeds (beyond my comment).
A recent SO question pointed out that the out parameter is required with where in the latest release (1.13) but optional in earlier ones.

Why don't np.where & np.min seem to work right with this array?

The issues
So I have an array I imported containing values ranging from ~0.0 to ~0.76. When I started trying to find the min & max values using Numpy, I ran into some strange inconsistencies that I'd like know how to solve if they're my fault, or avoid if they're programming errors on the Numpy developer's end.
The code
Let's start with finding the location of the maximum values using np.max & np.where.
print array.shape
print np.max(array)
print np.where(array == 0.763728955743)
print np.where(array == np.max(array))
print array[35,57]
The output is this:
(74, 145)
0.763728955743
(array([], dtype=int64), array([], dtype=int64))
(array([35]), array([57]))
0.763728955743
When I look for where the array exactly equals the maximum entry's value, Numpy doesn't find it. However, when I simply search for the location of the maximum values without specifying what that value is, it works. Note this doesn't happen in np.min.
Now I have a different issue regarding minima.
print array.shape
print np.min(array)
print np.where(array == 0.0)
print np.where(array == np.min(array))
print array[10,25], array[31,131]
Look at the returns.
(74, 145)
0.0
(array([10, 25]), array([ 31, 131]))
(array([10, 25]), array([ 31, 131]))
0.0769331747301 1.54220192172e-09
1.54^-9 is close enough to 0.0 that it seems like it would be the minimum value. But why is a location with the value 0.077 also listed by np.where? That's not even close to 0.0 compared to the other value.
The Questions
Why doesn't np.where seem to work when entering the maximum value of the array, but it does when searching for np.max(array) instead? And why does np.where() mixed with np.min() returns two locations, one of which is definitely not the minimum value?
You have two issues: the interpretation of floats and the interpretation of the results of np.where.
Non-integer floating point numbers are stored internally in binary and can not always be represented exactly in decimal notation. Similarly, decimal numbers can not always be represented exactly in binary. This is why np.where(array == 0.763728955743) returns an empty array, while print np.where(array == np.max(array)) does the right thing. Note that the second case just uses the exact binary number internally without any conversions. The search for the minimum succeeds because 0.0 can be represented exactly in both decimal and binary. In general, it is a bad idea to compare floats using == for this and related reasons.
For the version of np.where that you are using, it devolves into np.nonzero. You are interpreting the results here because it returns an array for each dimension of the array, not individual arrays of coordinates. There are a number of ways of saying this differently:
If you had three matches, you would be getting two arrays back, each with three elements.
If you had a 3D input array with two matches, you would get three arrays back, each with two elements.
The first array is row-coordinates (dim 0) and the second array is column-coordinates (dim 1).
Notice how you are interpreting the output of where for the maximum case. This is correct, but it is not what you are doing in the minimum case.
There are a number of ways of dealing with these issues. The easiest could be to use np.argmax and np.argmin. These will return the first coordinate of a maximum or minimum in the array, respectively.
>>> x = np.argmax(array)
>>> print(x)
array([35, 57])
>> print(array[x])
0.763728955743
The only possible problem here is that you may want to get all of the coordinates.
In that case, using where, or nonzero is fine. The only difference from your code is that you should print
print array[10,31], array[25,131]
instead of the transposed values as you are doing.
Try using numpy.isclose() instead of ==. Because floating point numbers cannot be tested for exact equality.
i.e. change this: np.where(array == 0.763728955743)
to: np.isclose(array, 0.763728955743)
np.min() and np.max() work as expected for me. Also note you can provide an axis like arr.min(axis=1) if you want to.
If this does not solve it, perhaps you could post some csv data somewhere to try to reproduce the problem? I kinda highly doubt it is a bug with numpy itself but you never know!

Matlab range in Python

I must translate some Matlab code into Python 3 and I often come across ranges of the form start:step:stop. When these arguments are all integers, I easily translate this command with np.arange(), but when some of the arguments are floats, especially the step parameter, I don't get the same output in Python. For example,
7:8 %In Matlab
7 8
If I want to translate it in Python I simply use :
np.arange(7,8+1)
array([7, 8])
But if I have, let's say :
7:0.3:8 %In Matlab
7.0000 7.3000 7.6000 7.9000
I can't translate it using the same logic :
np.arange(7, 8+0.3, 0.3)
array([ 7. , 7.3, 7.6, 7.9, 8.2])
In this case, I must not add the step to the stop argument.
But then, if I have :
7:0.2:8 %In Matlab
7.0000 7.2000 7.4000 7.6000 7.8000 8.0000
I can use my first idea :
np.arange(7,8+0.2,0.2)
array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])
My problem comes from the fact that I am not translating hardcoded lines like these. In fact, each parameters of these ranges can change depending on the inputs of the function I am working on. Thus, I can sometimes have 0.2 or 0.3 as the step parameter. So basically, do you guys know if there is another numpy/scipy or whatever function that really acts like Matlab range, or if I must add a little bit of code by myself to make sure that my Python range ends up at the same number as Matlab's?
Thanks!
You don't actually need to add your entire step size to the max limit of np.arange but just a very tiny number to make sure that that max is enclose. For example the machine epsilon:
eps = np.finfo(np.float32).eps
adding eps will give you the same result as MATLAB does in all three of your scenarios:
In [13]: np.arange(7, 8+eps)
Out[13]: array([ 7., 8.])
In [14]: np.arange(7, 8+eps, 0.3)
Out[14]: array([ 7. , 7.3, 7.6, 7.9])
In [15]: np.arange(7, 8+eps, 0.2)
Out[15]: array([ 7. , 7.2, 7.4, 7.6, 7.8, 8. ])
Matlab docs for linspace say
linspace is similar to the colon operator, ":", but gives direct control over the number of points and always includes the endpoints. "lin" in the name "linspace" refers to generating linearly spaced values as opposed to the sibling function logspace, which generates logarithmically spaced values.
numpy arange has a similar advise.
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use linspace for these cases.
End of interval. The interval does not include this value, except
in some cases where step is not an integer and floating point
round-off affects the length of out.
So differences in how step size gets translated into number of steps can produces differences in the number of steps. If you need consistency between the two codes, linspace is the better choice (in both).

Normalization using Numpy vs hard coded

import numpy as np
import math
def normalize(array):
mean = sum(array) / len(array)
deviation = [(float(element) - mean)**2 for element in array]
std = math.sqrt(sum(deviation) / len(array))
normalized = [(float(element) - mean)/std for element in array]
numpy_normalized = (array - np.mean(array)) / np.std(array)
print normalized
print numpy_normalized
print ""
normalize([2, 4, 4, 4, 5, 5, 7, 9])
normalize([1, 2])
normalize(range(5))
Outputs:
[-1.5, -0.5, -0.5, -0.5, 0.0, 0.0, 1.0, 2.0]
[-1.5 -0.5 -0.5 -0.5 0. 0. 1. 2. ]
[0.0, 1.414213562373095]
[-1. 1.]
[-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]
[-1.41421356 -0.70710678 0. 0.70710678 1.41421356]
Can someone explain to me why this code behaves differently in the second example, but similarly in the other two examples?
Did I do anything wrong in the hard coded example? What does NumPy do to end up with [-1, 1]?
As seaotternerd explains, you're using integers. And in Python 2 (unless you from __future__ import division), dividing an integer by an integer gives you an integer.
So, why aren't all three wrong? Well, look at the values. In the first one, the sum is 40 and the len is 8, and 40 / 8 = 5. And in the third one, 10 / 5 = 2. But in the second one, 3 / 2 = 1.5. Which is why only that one gets the wrong answer when you do integer division.
So, why doesn't NumPy also get the second one wrong? NumPy doesn't treat an array of integers as floats, it treats them as integers—print np.array(array).dtype and you'll see int64. However, as the docs for np.mean explain, "float64 intermediate and return values are used for integer inputs". And, although I don't know this for sure, I'd guess they designed it that way specifically to avoid problems like this.
As a side note, if you're interested in taking the mean of floats, there are other problems with just using sum / div. For example, the mean of [1, 2, 1e200, -1e200] really ought to be 0.75, but if you just do sum / div, you're going to get 0. (Why? Well, 1 + 2 + 1e200 == 1e200.) You may want to look at a simple stats library, even if you're not using NumPy, to avoid all these problems. In Python 3 (which would have avoided your problem in the first place), there's one in the stdlib, called statistics; in Python 2, you'll have to go to PyPI.
You aren't converting the numbers in the array to floats when calculating the mean. This isn't a problem for your second or third inputs, because they happen to work out neatly (as explained by #abarnert), but since the second input does not, and is composed exclusively of ints, you end up calculating the mean as 1 when it should be 1.5. This propagates through, resulting in your discrepancy with the results of using NumPy's functions.
If you replace the line where you calculate the mean with this, which forces Python to use float division:
mean = sum(array) / float(len(array))
you will ultimately get [-1, 1] as a result for the second set of inputs, just like NumPy.

"Invalid input data" from SciPy's cublic spline interpolation process; bad results from interpolate.bisplrep?

I'm attempting to use scipy.interpolate.bisplrep and scipy.interpolate.bisplev to perform a 2D regression on the differences between two datasets, based on a small set of known differences. The code is:
splineRT = interp.bisplrep(diffPoints[0], diffPoints[1], RTdiffs)
allDiffs = interp.bisplev(features[0], features[1], splineRT)
When I run this, bisplev throws the inscrutable exception "ValueError: Invalid input data", which is in response from an error code returned from the underlying _fitpack._bisplev function. I don't know nearly enough about splines to know what qualifies as an invalid description of one, but I did look at the value of splineRT, which is:
[array([ 367.51732902, 367.51732902, 367.51732902, 367.51732902,
911.4739006 , 911.4739006 , 911.4739006 , 911.4739006 ]),
array([ 1251.8868, 1251.8868, 1251.8868, 1251.8868, 1846.2027,
1846.2027, 1846.2027, 1846.2027]),
array([ -1.36687935e+04, 3.78197089e+04, -6.83863404e+04,
-7.25568790e+04, 4.90004158e+04, -1.11701213e+05,
2.02854711e+05, -1.67569797e+05, -7.22174063e+04,
1.27574330e+05, -2.33080009e+05, 2.80073578e+05,
3.37054374e+04, 1.89380033e+04, -1.81027026e+04,
-2.51210000e+00]),
3,
3]
What strikes me is that the first two elements, which signify the "knots" in the spline, are eight elements, consisting of only two unique values repeated four times each. Both unique values are from the corresponding diffPoints lists, but diffPoints are both 16 unique elements.
What's going on here? And/or is the problem this or something else? Any assistance is appreciated.
EDIT: Here's a transcript of the bug (?) in action, start-to-finish: https://www.dropbox.com/s/w758s7racfy9q4s/interpolationBug.txt .
From my past experience with this problem, features[0] and features[1] must be sorted in ascending order for bisplev to work.

Categories