Differences of scipy.spatial.KDTree in python 2.7 and 3.5 - python

I have a pandas dataframe containing a list of x,y coordinates and I am using scipy.spatial to find the nearest point in the dataframe given an additional point.
import pandas as pd
import numpy as np
import scipy.spatial as spatial
stops = pd.read_csv("stops.csv")
pt = x,y
points = np.array(zip(stops['stop_lat'],stops['stop_lon']))
nn = points[spatial.KDTree(points).query(pt)[1]]
Now, in python 2.7 this work perfectly. In python 3.5 I get the following error:
.../scipy/spatial/kdtree.py", line 231, in __init__
self.n, self.m = np.shape(self.data)
ValueError: not enough values to unpack (expected 2, got 0)
In the docs I can't find anything useful.

In Python3, zip() returns an iterator object rather than a list of tuples. points will therefore be a 0-dimensional np.object array containing a zip iterator, rather than a 2D array of x, y coordinates.
You could construct a list from the iterator:
points = np.array(list(zip(stops['stop_lat'],stops['stop_lon'])))
However, a more elegant solution might be to avoid using zip altogether by indexing multiple columns of your dataframe:
points = stops[['stop_lat','stop_lon']].values

Related

Find indices of each integer group in a labelled array

I have a labelled array obtained by using scipy measure.label on a binary 2 dimensional array. For argument sake it might look like this:
[
[1,1,0,0,2],
[1,1,1,0,2],
[1,0,0,0,0],
[0,0,0,3,3]
]
I want to get the indices of each group of labels. So in this case:
[
[(0,0),(0,1),(1,0),(1,1),(1,2),(2,0)],
[(0,4),(1,4)],
[(3,3),(3,4)]
]
I can do this using builtin Python like so (n and m are the dimensions of the array):
_dict = {}
for coords in itertools.product(range(n), range(m)):
_dict.setdefault(labelled_array[coords], []).append(coords)
blobs = [np.array(item) for item in _dict.values()]
This is very slow (about 10 times slower than the initial labelling of the binary array using measure.label!)
Scipy also has a function find_objects:
from scipy import ndimage
objs = ndimage.find_objects(labelled_array)
From what I can gather though this is returning the bounding box for each group (object). I don't want the bounding box I want the exact coordinates of each value in the group.
I have also tried using np.where for each integer in the number of labels. This is very slow.
it also seems to me that what I'm tring to do here is something like the minesweeper algorithm. I suspect there must be an efficient solution using numpy or scipy.
Is there an efficient way to obtain these coordinates?

How to initialise a fixed-size ListArray in pyarrow from a numpy array efficiently?

How would I efficiently initialise a fixed-size pyarray.ListArray
from a suitably prepared numpy array?
The documentation of pyarray.array indicates that a nested iterable input structure works, but in practice that does not work if the outer iterable is a numpy array:
import numpy as np
import pyarrow as pa
n = 1000
w = 3
data = np.arange(n*w,dtype="i2").reshape(-1,w)
# this works:
pa.array(list(data),pa.list_(pa.int16(),w))
# this fails:
pa.array(data,pa.list_(pa.int16(),w))
# -> ArrowInvalid: only handle 1-dimensional arrays
It seems ridiculus to split an input array directly matching the Arrow specification into n separate arrays and then re-assemble from there.
pyarray.ListArray.from_arrays seems to require an offsets argument, which only has a meaning for variable-size lists.
I believe you are looking for pyarrow.FixedSizeListArray.from_arrays which, regrettably, appears undocumented (I went ahead and filed a JIRA ticket)
You'll want to reshape your numpy array as a contiguous array first.
import numpy as np
import pyarrow as pa
len = 10
width = 3
# Or just skip the initial reshape but keeping it in to simulate real data
arr = np.arange(len*width,dtype="i2").reshape(-1,width)
arr.shape = -1
pa.FixedSizeListArray.from_arrays(arr, width)

Grouping a numpy array

I have a huge NumPy array of size 778. I would like to pair the elements hence I'm using the following code to do so.
coordinates = coordinates.reshape(-1, 2,2)
However, if I use the following code it just works fine.
coordinates = coordinates[:len(coordinates)-1].reshape(-1, 2,2)
How can I do this in a proper way irrespective of the size?

counting points in grid cells in python, np.histogramdd

I have a numpy array including the coordinates of the points in 3-dimensional space:
import numpy as np
testdata=np.array([[0.5,0.5,0.5],[0.6,0.6,0.6],[0.7,0.7,0.7],[1.5,0.5,0.5],[1.5,0.6,0.6],[0.5,1.5,0.5],[0.5,1.5,1.5]])
Each row for one particle including 3 coordinates (x y z).There are 8 points in this example. is there any python package for griding the 3D space, then counting the particles in each cell?
I tried np.histogramdd in this way
xcoord=testdata[:,0]
ycoord=testdata[:,1]
zcoord=testdata[:,2]
xedg=[0,1,2]
yedg=[0,1,2]
zedg=[0,1,2]
histo=np.histogramdd([xcoord,ycoord,zcoord],bins=(xedg,yedg,zedg),range=[[0,2],[0,2],[0,2]])
and it seems it is working but the indexing is strange. I mean the final array that np.histogramdd returns has no meaningful indexing regarding the original coordinates. is there any other way for griding the 3d space and count the number of points in each cell?
Not sure if this is what you are needing but you can use Pandas.
import pandas as pd
coords = [[1,2,3],[4,5,6],[7,8,9]]
df_coords = pd.DataFrame(coords)
df_coords.count()

Using masked numpy arrays with rpy2

import numpy
import rpy2
from rpy2 import robjects
import rpy2.robjects.numpy2ri
r = robjects.r
rpy2.robjects.numpy2ri.activate()
x = numpy.array( [1, 5, -99, 4, 5, 3, 7, -99, 6] )
mx = numpy.ma.masked_values( x, -99 )
print x # works, displays all values
print r.sd(x) # works, but uses -99 values in calculation
print mx # works, now -99 values are masked (--)
print r.sd(mx) # does not work - error
I am a new user of rpy2 and numpy. I am using R 2.14.1, python 2.7.1, rpy2 2.2.5, numpy 1.5.1 on RHEL5.
I need to read data into a numpy array and use rpy2 functions on it. However, I need to mask missing values prior to using the array with rpy2.
I have no problem masking values, but I can't get rpy2 to work with the resulting masked array. Looks like maybe the numpy2ri conversion doesn't work on masked numpy arrays? (see error below)
How can I make this work? Is it possible to tell rpy2 to ignore masked values? I'd like to stick with R rather than use scipy/numpy directly, since I'll be doing more advanced stats later.
Thanks.
Traceback (most recent call last):
File "d.py", line 16, in <module>
print r.sd(mx) # does not work - error
File "/dev/py/lib/python2.7/site-packages/rpy2-2.2.5dev_20120227-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py", line 82, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "/dev/py/lib/python2.7/site-packages/rpy2-2.2.5dev_20120227-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py", line 30, in __call__
new_args = [conversion.py2ri(a) for a in args]
File "/dev/py/lib/python2.7/site-packages/rpy2-2.2.5dev_20120227-py2.7-linux-x86_64.egg/rpy2/robjects/numpy2ri.py", line 36, in numpy2ri
vec = SexpVector(o.ravel("F"), _kinds[o.dtype.kind])
TypeError: ravel() takes exactly 1 argument (2 given)
Update: Since rpy2 can't handle masked numpy arrays, I tried converting my -99 values to numpy NaN values. Apparently rpy2 recognizes numpy NaN values as R-style NA values.
The code below works because in the r.sd() call I can tell rpy2 to not use NA values. But the initial NaN substitution is definitely slower than applying the numpy mask.
Can any of you python wizards give me a faster way to do the -99 to NaN substitution across a large numpy ndarray? Or maybe suggest another approach?
Thanks.
# 'x' is a large numpy ndarray I am working with
# ('x' in the original code above was a small test array)
for i in range(900, 950): # random slice of numpy ndarray
for j in range(6225): # full extent across slice
if x[i][j] == -99:
x[i][j] = numpy.NaN
y = x[933] # random piece of converted range
sd = r.sd( y, **{'na.rm': 'TRUE'} ) # r.sd() call that ignores numpy NaN values
print sd
The concept of "masked values" (that is of an array of value coupled to a list of indices to be masked) does not directly exist in R.
In R values are either set to be "missing" (NA), or a subset of the original data structure is taken (so a new object containing only this subset is created).
Now what is happening behind the scene in rpy2 during numpy to rinterface is that a copy of the numpy array into an R array is made (the other way around, exposing an R array to numpy, does not necessarily require copying). There is no reason why masks would not be handled at that stage (this may make it way to the code base quicker if someone is providing a patch). The alternative is to create a numpy array without the masked values, then feed this to rpy2.
You can speed up the process of replacing -99 values by NaN
by using masked arrays, objects that are natively defined in numpy.ma
as in the following code :
x_masked = numpy.ma.masked_array(x, mask= (x==-99) )
x_filled = x_masked.filled( numpy.NaN )
x_masked is a numpy.ma (masked array).
x_filled is a numpy.ndarray (regular numpy array)

Categories