I was able to optimise some operations in my program quite a bit using numpy. When I profile a run, I noticed that most of the time is spent in numpy.nan_to_num. I'd like to improve this even further.
The sort of calculations occurring are multiplication of two arrays for which one of the arrays could contain nan values. I want these to be treated as zeros, but I can't initialise the array with zeros, as nan has a meaning later on and can't be set to 0. Is there a way of doing multiplications (and additions) with nan being treated as zero?
From the nan_to_num docstring, I can see a new array is produced which may explain why it's taking so long.
Replace nan with zero and inf with finite numbers.
Returns an array or scalar replacing Not a Number (NaN) with zero,...
A function like nansum for arbitrary arithmetic operations would be great.
Here's some example data:
import numpy as np
a = np.random.rand(1000, 1000)
a[a < 0.1] = np.nan # set some random values to nan
b = np.ones_like(a)
One option is to use np.where to set the value of the result to 0 wherever one of your arrays is equal to NaN:
result = np.where(np.isnan(a), 0, a * b)
If you have to do several operations on an array that contains NaNs, you might consider using masked arrays, which provide a more general method for dealing with missing or invalid values:
masked_a = np.ma.masked_invalid(a)
result2 = masked_a * b
Here, result2 is another np.ma.masked_array whose .mask attribute is set according to where the NaN values were in a. To convert this back to a normal np.ndarray with the masked values replaced by 0s, you can use the .filled() method, passing in the fill value of your choice:
result_filled = result2.filled(0)
assert np.all(result_filled == result)
Related
I am trying to calculate the inverse of a matrix via:
A = pd.read_csv("A_use.csv", header = None)
I = np.identity(len(A))
INV = np.linalg.inv(I - A)
However, the resulting array is full of np.nan.
I don't understand why that's the case.
I've tried to replace all np.nan values in A (although there shouldn't be any) via A[np.isnan(A)] = 0 but the problem persists.
Any suggestions?
Not all matrices have an inverse. A matrix has and inverse if its determinant is non-zero.
Check first whether
np.linalg.det(I-A) ~= 0
If it's non-zero, then you should be able to do
np.linalg.inv(I-A)
Second, make sure I-A does not have a single NaN value. If it does, then computing its inverse will result to a matrix of NaN values.
The problem is in A.
There could be nan values in the dataframe.
The matrix A could be singular, please check if np.linalg.det(A) is not 0
Then I would pass to the function np.linalg.inv a numpy array using pd.DataFrame.to_numpy (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html)
The reason is an np.inf value in at least column 293. The np.inf value can be replaced via A[np.isinf(A)] = 0. If np.inf is replaced with zero, there are no np.nan values in L.
I'm looping over a large masked array along a specified axis, using element-wise comparison between the current index values and the previous minima.
Below is a simplified example to illustrate the problem. If one of the values is masked and the other is valid, numpy.minimum() returns a masked value. Is it possible to change this behaviour, such that the valid value is returned instead?
x = np.random.randn(5).round(2)
x[[2,4]] = np.nan
print(x)
print(np.minimum(x[-1], x[-2]))
y = np.ma.masked_invalid(x)
print(y)
print(np.minimum(y[-1], y[-2]))
>> [-1.21 2.02 nan -0.31 nan]
>> nan
>> [-1.21 2.02 -- -0.31 --]
>> --
Instead, I would like the return value to be -0.31.
In this simple example, I could use y.compressed() or something like that, but in my case, I have several dimensions, so that doesn't work.
Currently, I use a very high fill_value, which works because there's a physical constraint on the data, but I would like to have a more general solution.
I want to combine two arrays which represent a curve where the variable is column 1, however the column 0 values do not always match:
import numpy as np
arr1= np.array([(12,1003),(17,900),(20,810)])
arr2= np.array([(10,1020),(17,902),(19,870),(21,750)])
I want to combine these into one array where the column 0 is combined and both column 1s are stacked with gaps where there is no value for the corresponding column 0 value, something like this:
arr3=np.array([((10,None,1020),(12,1003,None),(17,900,902),(19,None,870),(20,810,None),(21,None,750))])
The reason for this is that I want to be able to get mean values of the second column for each array but they are not at exactly the same column 0 value so the idea of creating this array is to then interpolate to replace all the None values, then create mean values from column 1 and 2 and have an extra column to represent that.
I have used numPy for everything else so far but obviously have got stuck with the np.column_stack function as it needs lists of the same length and also will be blind to stacking based on values from column o. Lastly I do not want to create a fit for the data as the actual data is non-linear and possibily not consistent so a fit will not work and interpolation seems like the most accurate method.
There may be an answer already but due to me not knowing how to describe it well I can't find it. Also I am relatively new to python so please don't make any assumptions about my knowledge other than it is very little.
Thank you.
will this help ??
import pandas
import numpy as np
arr1= np.array([(12,1003),(17,900),(20,810)])
arr2= np.array([(10,1020),(17,902),(19,870),(21,750)])
d1 = pandas.DataFrame(arr1)
d2 = pandas.DataFrame(arr2)
d1.columns = d2.columns = ['t','v']
d3 = pandas.DataFrame(np.array(d1.merge(d2, on='t',how='outer')))
print d3.values
# use d3.as_matrix() to convert to numpy array
output
[[ 12. 1003. nan]
[ 17. 900. 902.]
[ 20. 810. nan]
[ 10. nan 1020.]
[ 19. nan 870.]
[ 21. nan 750.]]
I am trying to normalize rows of a numpy matrix using L2 norm (unity length).
I am seeing a problem when I do that.
Assuming my matrix 'b' is as follows:
Now when I do the normalization of first row as below it works fine.
But when I try to do it by iterating through all the rows and converting the same matrix b as below it gives me all zeros.
Any idea why is that happening and how to get the correct normalization?.
Any faster way of row normalizing of matrix without having to iterate each row?. I don't want to use sci-kit learn normalization function though.
Thanks
The problem comes from the fact that b has type int so when you fill in row by row, numpy automatically converts the results of you computation (float) to int, hence the zeros. One way to avoid that is to define b with type float by using 0., 1. etc... or just adding .astype(float) at definition.
This should work to do the computation in one go which also doesn't require converting to float first:
b = b / np.linalg.norm(b, axis=1, keepdims=True)
This works because you are redefining the whole array rather than changing its rows one by one, and numpy is clever enough to make it float.
how do I null certain values in numpy array based on a condition?
I don't understand why I end up with 0 instead of null or empty values where the condition is not met... b is a numpy array populated with 0 and 1 values, c is another fully populated numpy array. All arrays are 71x71x166
a = np.empty(((71,71,166)))
d = np.empty(((71,71,166)))
for indexes, value in np.ndenumerate(b):
i,j,k = indexes
a[i,j,k] = np.where(b[i,j,k] == 1, c[i,j,k], d[i,j,k])
I want to end up with an array which only has values where the condition is met and is empty everywhere else but with out changing its shape
FULL ISSUE FOR CLARIFICATION as asked for:
I start with a float populated array with shape (71,71,166)
I make an int array based on a cutoff applied to the float array basically creating a number of bins, roughly marking out 10 areas within the array with 0 values in between
What I want to end up with is an array with shape (71,71,166) which has the average values in a particular array direction (assuming vertical direction, if you think of a 3D array as a 3D cube) of a certain "bin"...
so I was trying to loop through the "bins" b == 1, b == 2 etc, sampling the float where that condition is met but being null elsewhere so I can take the average, and then recombine into one array at the end of the loop....
Not sure if I'm making myself understood. I'm using the np.where and using the indexing as I keep getting errors when I try and do it without although it feels very inefficient.
Consider this example:
import numpy as np
data = np.random.random((4,3))
mask = np.random.random_integers(0,1,(4,3))
data[mask==0] = np.NaN
The data will be set to nan wherever the mask is 0. You can use any kind of condition you want, of course, or do something different for different values in b.
To erase everything except a specific bin, try the following:
c[b!=1] = np.NaN
So, to make a copy of everything in a specific bin:
a = np.copy(c)
a[b!=1] == np.NaN
To get the average of everything in a bin:
np.mean(c[b==1])
So perhaps this might do what you want (where bins is a list of bin values):
a = np.empty(c.shape)
a[b==0] = np.NaN
for bin in bins:
a[b==bin] = np.mean(c[b==bin])
np.empty sometimes fills the array with 0's; it's undefined what the contents of an empty() array is, so 0 is perfectly valid. For example, try this instead:
d = np.nan * np.empty((71, 71, 166)).
But consider using numpy's strength, and don't iterate over the array:
a = np.where(b, c, d)
(since b is 0 or 1, I've excluded the explicit comparison b == 1.)
You may even want to consider using a masked array instead:
a = np.ma.masked_where(b, c)
which seems to make more sense with respect to your question: "how do I null certain values in a numpy array based on a condition" (replace null with mask and you're done).