Using masked numpy arrays with rpy2 - python

import numpy
import rpy2
from rpy2 import robjects
import rpy2.robjects.numpy2ri
r = robjects.r
rpy2.robjects.numpy2ri.activate()
x = numpy.array( [1, 5, -99, 4, 5, 3, 7, -99, 6] )
mx = numpy.ma.masked_values( x, -99 )
print x # works, displays all values
print r.sd(x) # works, but uses -99 values in calculation
print mx # works, now -99 values are masked (--)
print r.sd(mx) # does not work - error
I am a new user of rpy2 and numpy. I am using R 2.14.1, python 2.7.1, rpy2 2.2.5, numpy 1.5.1 on RHEL5.
I need to read data into a numpy array and use rpy2 functions on it. However, I need to mask missing values prior to using the array with rpy2.
I have no problem masking values, but I can't get rpy2 to work with the resulting masked array. Looks like maybe the numpy2ri conversion doesn't work on masked numpy arrays? (see error below)
How can I make this work? Is it possible to tell rpy2 to ignore masked values? I'd like to stick with R rather than use scipy/numpy directly, since I'll be doing more advanced stats later.
Thanks.
Traceback (most recent call last):
File "d.py", line 16, in <module>
print r.sd(mx) # does not work - error
File "/dev/py/lib/python2.7/site-packages/rpy2-2.2.5dev_20120227-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py", line 82, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "/dev/py/lib/python2.7/site-packages/rpy2-2.2.5dev_20120227-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py", line 30, in __call__
new_args = [conversion.py2ri(a) for a in args]
File "/dev/py/lib/python2.7/site-packages/rpy2-2.2.5dev_20120227-py2.7-linux-x86_64.egg/rpy2/robjects/numpy2ri.py", line 36, in numpy2ri
vec = SexpVector(o.ravel("F"), _kinds[o.dtype.kind])
TypeError: ravel() takes exactly 1 argument (2 given)
Update: Since rpy2 can't handle masked numpy arrays, I tried converting my -99 values to numpy NaN values. Apparently rpy2 recognizes numpy NaN values as R-style NA values.
The code below works because in the r.sd() call I can tell rpy2 to not use NA values. But the initial NaN substitution is definitely slower than applying the numpy mask.
Can any of you python wizards give me a faster way to do the -99 to NaN substitution across a large numpy ndarray? Or maybe suggest another approach?
Thanks.
# 'x' is a large numpy ndarray I am working with
# ('x' in the original code above was a small test array)
for i in range(900, 950): # random slice of numpy ndarray
for j in range(6225): # full extent across slice
if x[i][j] == -99:
x[i][j] = numpy.NaN
y = x[933] # random piece of converted range
sd = r.sd( y, **{'na.rm': 'TRUE'} ) # r.sd() call that ignores numpy NaN values
print sd

The concept of "masked values" (that is of an array of value coupled to a list of indices to be masked) does not directly exist in R.
In R values are either set to be "missing" (NA), or a subset of the original data structure is taken (so a new object containing only this subset is created).
Now what is happening behind the scene in rpy2 during numpy to rinterface is that a copy of the numpy array into an R array is made (the other way around, exposing an R array to numpy, does not necessarily require copying). There is no reason why masks would not be handled at that stage (this may make it way to the code base quicker if someone is providing a patch). The alternative is to create a numpy array without the masked values, then feed this to rpy2.

You can speed up the process of replacing -99 values by NaN
by using masked arrays, objects that are natively defined in numpy.ma
as in the following code :
x_masked = numpy.ma.masked_array(x, mask= (x==-99) )
x_filled = x_masked.filled( numpy.NaN )
x_masked is a numpy.ma (masked array).
x_filled is a numpy.ndarray (regular numpy array)

Related

How to initialise a fixed-size ListArray in pyarrow from a numpy array efficiently?

How would I efficiently initialise a fixed-size pyarray.ListArray
from a suitably prepared numpy array?
The documentation of pyarray.array indicates that a nested iterable input structure works, but in practice that does not work if the outer iterable is a numpy array:
import numpy as np
import pyarrow as pa
n = 1000
w = 3
data = np.arange(n*w,dtype="i2").reshape(-1,w)
# this works:
pa.array(list(data),pa.list_(pa.int16(),w))
# this fails:
pa.array(data,pa.list_(pa.int16(),w))
# -> ArrowInvalid: only handle 1-dimensional arrays
It seems ridiculus to split an input array directly matching the Arrow specification into n separate arrays and then re-assemble from there.
pyarray.ListArray.from_arrays seems to require an offsets argument, which only has a meaning for variable-size lists.
I believe you are looking for pyarrow.FixedSizeListArray.from_arrays which, regrettably, appears undocumented (I went ahead and filed a JIRA ticket)
You'll want to reshape your numpy array as a contiguous array first.
import numpy as np
import pyarrow as pa
len = 10
width = 3
# Or just skip the initial reshape but keeping it in to simulate real data
arr = np.arange(len*width,dtype="i2").reshape(-1,width)
arr.shape = -1
pa.FixedSizeListArray.from_arrays(arr, width)

Python: How to insert block matrixes along diagonal of larger matrix

I have generated a random symmetric 100 x 100 matrix. I have also generated a number of random 10 x 10 symmetric matrices. Now I want to insert these 10 blocks along the diagonal of the 100 x 100. How do I go about doing this?
I thought about getting the diagonal indices and then inserting as
B[diag1, diag2] = A
But I cannot seem to get the diagonal indices out to insert in the code.
If you are using numpy maybe this can help (works for symmetric and not symmetric matrices):
import numpy as np
# Your initial 100 x 100 matrix
a = np.zeros((100, 100))
for i in range(10):
# the 10 x 10 generated matrix with "random" number
# I'm creating it with ones for checking if the code works
b = np.ones((10, 10)) * (i + 1)
# The random version would be:
# b = np.random.rand(10, 10)
# Diagonal insertion
a[i*10:(i+1)*10,i*10:(i+1)*10] = b
if you are using numpy then we can write as another available solution
import numpy as np
x1 = np.eye(10)
A = np.block([
[x1,np.random.rand(10,90)],
[np.random.rand(10,10),x1,np.random.rand(10,80)],
[np.random.rand(10,20),x1,np.random.rand(10,70)],
[np.random.rand(10,30),x1,np.random.rand(10,60)],
[np.random.rand(10,40),x1,np.random.rand(10,50)],
[np.random.rand(10,50),x1,np.random.rand(10,40)],
[np.random.rand(10,60),x1,np.random.rand(10,30)],
[np.random.rand(10,70),x1,np.random.rand(10,20)],
[np.random.rand(10,80),x1,np.random.rand(10,10)],
[np.random.rand(10,90),x1],
])
print (A)
x1 is your small matrix and it can be any distribution, I used identity matrix for testing only.
Doing this in a vectorized way would be ideal - and would, in theory, look something like this:
In [50]: a = np.ones((100,100)); b = np.ones((10,10))*2;
In [51]: np.diagonal(a)[:] = np.ravel(b)
But that doesn't work because np.diagonal() returns a read-only view of the underlying array:
In [51]: np.diagonal(a)[:] = np.ravel(b)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-51-ac0ada1b350d> in <module>()
----> 1 np.diagonal(a)[:] = np.ravel(b)
ValueError: assignment destination is read-only
Running help(np.diagonal) sheds some light on this behavior, and reveals that, at some point in the future, the vectorized expression above will work, because np.diagonal() will return a mutable slice of the array:
In versions of NumPy prior to 1.7, this function always returned a new,
independent array containing a copy of the values in the diagonal.
In NumPy 1.7 and 1.8, it continues to return a copy of the diagonal,
but depending on this fact is deprecated. Writing to the resulting
array continues to work as it used to, but a FutureWarning is issued.
Starting in NumPy 1.9 it returns a read-only view on the original array.
Attempting to write to the resulting array will produce an error.
In some future release, it will return a read/write view and writing to
the returned array will alter your original array. The returned array
will have the same type as the input array.
However, Numpy (currently on version 1.13) still returns an immutable slice.
For anyone looking for a way to jump into Numpy and contribute, this would be a great first pull request.
Edit: I interpreted the question as asking how to use the 100 entries from a given 10 x 10 matrix, and assign them to the 100 diagonal entries of the 100 x 100 matrix. Perhaps you meant setting 10 separate 10 x 10 blocks of the 100 x 100 matrix using 10 10x10 matrices. (In which case, it would be helpful to specify that you have 10 10x10 matrices - or include a picture.)

Differences of scipy.spatial.KDTree in python 2.7 and 3.5

I have a pandas dataframe containing a list of x,y coordinates and I am using scipy.spatial to find the nearest point in the dataframe given an additional point.
import pandas as pd
import numpy as np
import scipy.spatial as spatial
stops = pd.read_csv("stops.csv")
pt = x,y
points = np.array(zip(stops['stop_lat'],stops['stop_lon']))
nn = points[spatial.KDTree(points).query(pt)[1]]
Now, in python 2.7 this work perfectly. In python 3.5 I get the following error:
.../scipy/spatial/kdtree.py", line 231, in __init__
self.n, self.m = np.shape(self.data)
ValueError: not enough values to unpack (expected 2, got 0)
In the docs I can't find anything useful.
In Python3, zip() returns an iterator object rather than a list of tuples. points will therefore be a 0-dimensional np.object array containing a zip iterator, rather than a 2D array of x, y coordinates.
You could construct a list from the iterator:
points = np.array(list(zip(stops['stop_lat'],stops['stop_lon'])))
However, a more elegant solution might be to avoid using zip altogether by indexing multiple columns of your dataframe:
points = stops[['stop_lat','stop_lon']].values

numpy: apply operation to multidimensional array

Assume I have a matrix of matrices, which is an order-4 tensor. What's the best way to apply the same operation to all the submatrices, similar to Map in Mathematica?
#!/usr/bin/python3
from pylab import *
t=random( (8,8,4,4) )
#t2=my_map(det,t)
#then shape(t2) becomes (8,8)
EDIT
Sorry for the bad English, since it's not my native one.
I tried numpy.linalg.det, but it doesn't seem to cope well with 3D or 4D tensors:
>>> import numpy as np
>>> a=np.random.rand(8,8,4,4)
>>> np.linalg.det(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/numpy/linalg/linalg.py", line 1703, in det
sign, logdet = slogdet(a)
File "/usr/lib/python3/dist-packages/numpy/linalg/linalg.py", line 1645, in slogdet
_assertRank2(a)
File "/usr/lib/python3/dist-packages/numpy/linalg/linalg.py", line 155, in _assertRank2
'two-dimensional' % len(a.shape))
numpy.linalg.linalg.LinAlgError: 4-dimensional array given. Array must be two-dimensional
EDIT2 (Solved)
The problem is older numpy version (<1.8) doesn't support inner loop in numpy.linalg.det, updating to numpy 1.8 solves the problem.
numpy 1.8 has some gufunc that can do this in C loop:
for example, numpy.linalg.det() is a gufunc:
import numpy as np
a = np.random.rand(8,8,4,4)
np.linalg.det(a)
First check the documentation for the operation that you intend to use. Many have a way of specifying which axis to operate on (np.sum). Others specify which axes they use (e.g. np.dot).
For np.linalg.det the documentation includes:
a : (..., M, M) array_like
Input array to compute determinants for.
So np.linalg.det(t) returns an (8,8) array, having calculated each det using the last 2 dimensions.
While it is possible to iterate on dimensions (the first is the default), it is better to write a function that makes use of numpy operations that use the whole array.

Matlab filter() with SciPy lfilter()

According to their documentation for Matlab filter() and SciPy lfilter(), it seems like they should be "compatible". However I have a problem, porting larger Matlab code in Python, for which I get ValueError: object of too small depth for desired array. As I can't think of how I can present my source without complicating it, I'll use the example provided in Matlab's documentation:
data = [1:0.2:4]';
windowSize = 5;
filter(ones(1,windowSize)/windowSize,1,data)
which I translate in Python to:
import numpy as np
from scipy.signal import lfilter
data = np.arange(1, 4.1, 0.2)
windowSize = 5
lfilter(np.ones((1, windowSize)) / windowSize, 1, data)
In this case I get:
ValueError: object too deep for desired array
Why do I get these errors?
Is there a reason you're adding a an extra dimension when creating your array of ones? Is this what you need:
lfilter(np.ones(windowSize) / windowSize, 1, data)

Categories