I need to recover the "raw data" from a timing histogram provided by a timing counter as a .csv file.
I've got the code below but since the actual data has several thousands of counts in each bin, a for loop is taking a very long time, so I was wondering if there was a better way.
import numpy as np
# Example histogram with 1 second bins
hist = np.array([[1., 2., 3., 4., 5., 6., 7., 8., 9., 10.], [0, 17, 3, 34, 35, 100, 101, 107, 12, 1]])
# Array for bins and counts
time_bins = hist[0]
counts = hist[1]
# Empty data to append
data = np.empty(0)
for i in range(np.size(counts)):
for j in range(counts[i]):
data = np.append(data, [time_bins[i]])
I get that the resolution of the raw data will be the smallest time bin but that is fine for my purposes.
In the end, this is to be able to produce another histogram with logarithmic bins, which I am able to do with the raw data.
EDIT
The code I'm using to load the CSV is
x = np.loadtxt(fname, delimiter=',', skiprows=1).T
a = x[0]
b = x[1]
data = np.empty(0)
for i in range(np.size(b)):
for j in range(np.int(b[i])):
data = np.append(data, [a[i]])
You can do this with a list comprehension and the numpy concatenation:
import numpy as np
hist = np.array([[1., 2., 3., 4., 5., 6., 7., 8., 9., 10.], [0, 17, 3, 34, 35, 100, 101, 107, 12, 1]])
new_array = np.concatenate([[hist[0][i]]*int(hist[1][i]) for i in range(len(hist[0]))])
Especially if there's a lot of data, copying the array each iteration (which is what append does -- numpy arrays can't be resized) will be costly. Try allocating first (i.e. data = np.zeros(np.size(counts))) and then just assigning to it.
I'm also not sure what your innermost for loop is doing, since each iteration appends the same thing?
Related
I have a (n, m) tensor X where I want to zero out all values smaller than some threshold t. I.e.,
X = X * tf.cast(tf.greater(X, t), X.dtype)
I was wondering, is there a more efficient way to do this? Because X in my setup is huge and as I understand it, the tf.cast(tf.greater(X, t), X.dtype) constructs an other tensor that needs as much memory as X.
What is wrong with the good old
for i in range(n):
for j in range(m):
if X[n][m] < t: X[n][m] = 0
I am not sure if this will more efficient
x = tf.constant([1, 2, 3, 4, 5, 6, 7])
y = tf.where(tf.greater(x, tf.constant(5)),
x, # if ture
tf.zeros_like(x)) # if false
with tf.Session() as sess:
a = sess.run(y)
# a is [0, 0, 0, 0, 0, 6, 7]
If X is your matrix (a numpy array I assume) you can try:
x[x<small_value]=0
if creating the boolean array takes too much memory you can try doing that through a loop by individual columns.
foo = tf.constant([1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
threshold_map = tf.greater(foo, tf.constant(5.))
threshold_map_index = tf.reshape(tf.where(threshold_map), [-1])
foo_threshold = tf.gather(foo, threshold_map_index)
# foo_threshold = [6., 7., 8., 9., 10.]
( this won't work with more then one-dimension )
Is there a filter similar to ndimage's generic_filter that supports vector output? I did not manage to make scipy.ndimage.filters.generic_filter return more than a scalar. Uncomment the line in the code below to get the error: TypeError: only length-1 arrays can be converted to Python scalars.
I'm looking for a generic filter that process 2D or 3D arrays and returns a vector at each point. Thus the output would have one added dimension. For the example below I'd expect something like this:
m.shape # (10,10)
res.shape # (10,10,2)
Example Code
import numpy as np
from scipy import ndimage
a = np.ones((10, 10)) * np.arange(10)
footprint = np.array([[1,1,1],
[1,0,1],
[1,1,1]])
def myfunc(x):
r = sum(x)
#r = np.array([1,1]) # uncomment this
return r
res = ndimage.generic_filter(a, myfunc, footprint=footprint)
The generic_filter expects myfunc to return a scalar, never a vector.
However, there is nothing that precludes myfunc from also adding information
to, say, a list which is passed to myfunc as an extra argument.
Instead of using the array returned by generic_filter, we can generate our vector-valued array by reshaping this list.
For example,
import numpy as np
from scipy import ndimage
a = np.ones((10, 10)) * np.arange(10)
footprint = np.array([[1,1,1],
[1,0,1],
[1,1,1]])
ndim = 2
def myfunc(x, out):
r = np.arange(ndim, dtype='float64')
out.extend(r)
return 0
result = []
ndimage.generic_filter(
a, myfunc, footprint=footprint, extra_arguments=(result,))
result = np.array(result).reshape(a.shape+(ndim,))
I think I get what you're asking, but I'm not completely sure how does the ndimage.generic_filter work (how abstruse is the source!).
Here's just a simple wrapper function. This function will take in an array, all the parameters ndimage.generic_filter needs. Function returns an array where each element of the former array is now represented by an array with shape (2,), result of the function is stored as the second element of that array.
def generic_expand_filter(inarr, func, **kwargs):
shape = inarr.shape
res = np.empty(( shape+(2,) ))
temp = ndimage.generic_filter(inarr, func, **kwargs)
for row in range(shape[0]):
for val in range(shape[1]):
res[row][val][0] = inarr[row][val]
res[row][val][1] = temp[row][val]
return res
Output, where res denotes just the generic_filter and res2 denotes generic_expand_filter, of this function is:
>>> a.shape #same as res.shape
(10, 10)
>>> res2.shape
(10, 10, 2)
>>> a[0]
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
>>> res[0]
array([ 3., 8., 16., 24., 32., 40., 48., 56., 64., 69.])
>>> print(*res2[0], sep=", ") #this is just to avoid the vertical default output
[ 0. 3.], [ 1. 8.], [ 2. 16.], [ 3. 24.], [ 4. 32.], [ 5. 40.], [ 6. 48.], [ 7. 56.], [ 8. 64.], [ 9. 69.]
>>> a[0][0]
0.0
>>> res[0][0]
3.0
>>> res2[0][0]
array([ 0., 3.])
Of course you probably don't want to save the old array, but instead have both fields as new results. Except I don't know what exactly you had in mind, if the two values you want stored are unrelated, just add a temp2 and func2 and call another generic_filter with the same **kwargs and store that as the first value.
However if you want an actual vector quantity that is calculated using multiple inarr elements, meaning that the two new created fields aren't independent, you are just going to have to write that kind of a function, one that takes in an array, idx, idy indices and returns a tuple\list\array value which you can then unpack and assign to the result.
I'm trying to iterate an array of values generated with numpy.linspace:
slX = numpy.linspace(obsvX, flightX, numSPts)
slY = np.linspace(obsvY, flightY, numSPts)
for index,point in slX:
yPoint = slY[index]
arcpy.AddMessage(yPoint)
This code worked fine on my office computer, but I sat down this morning to work from home on a different machine and this error came up:
File "C:\temp\gssm_arcpy.1.0.3.py", line 147, in AnalyzeSightLine
for index,point in slX:
TypeError: 'numpy.float64' object is not iterable
slX is just an array of floats, and the script has no problem printing the contents -- just, apparently iterating through them. Any suggestions for what is causing it to break, and possible fixes?
numpy.linspace() gives you a one-dimensional NumPy array. For example:
>>> my_array = numpy.linspace(1, 10, 10)
>>> my_array
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
Therefore:
for index,point in my_array
cannot work. You would need some kind of two-dimensional array with two
elements in the second dimension:
>>> two_d = numpy.array([[1, 2], [4, 5]])
>>> two_d
array([[1, 2], [4, 5]])
Now you can do this:
>>> for x, y in two_d:
print(x, y)
1 2
4 5
Given a NumPy array of int32, how do I convert it to float32 in place? So basically, I would like to do
a = a.astype(numpy.float32)
without copying the array. It is big.
The reason for doing this is that I have two algorithms for the computation of a. One of them returns an array of int32, the other returns an array of float32 (and this is inherent to the two different algorithms). All further computations assume that a is an array of float32.
Currently I do the conversion in a C function called via ctypes. Is there a way to do this in Python?
Update: This function only avoids copy if it can, hence this is not the correct answer for this question. unutbu's answer is the right one.
a = a.astype(numpy.float32, copy=False)
numpy astype has a copy flag. Why shouldn't we use it ?
You can make a view with a different dtype, and then copy in-place into the view:
import numpy as np
x = np.arange(10, dtype='int32')
y = x.view('float32')
y[:] = x
print(y)
yields
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.], dtype=float32)
To show the conversion was in-place, note that copying from x to y altered x:
print(x)
prints
array([ 0, 1065353216, 1073741824, 1077936128, 1082130432,
1084227584, 1086324736, 1088421888, 1090519040, 1091567616])
You can change the array type without converting like this:
a.dtype = numpy.float32
but first you have to change all the integers to something that will be interpreted as the corresponding float. A very slow way to do this would be to use python's struct module like this:
def toi(i):
return struct.unpack('i',struct.pack('f',float(i)))[0]
...applied to each member of your array.
But perhaps a faster way would be to utilize numpy's ctypeslib tools (which I am unfamiliar with)
- edit -
Since ctypeslib doesnt seem to work, then I would proceed with the conversion with the typical numpy.astype method, but proceed in block sizes that are within your memory limits:
a[0:10000] = a[0:10000].astype('float32').view('int32')
...then change the dtype when done.
Here is a function that accomplishes the task for any compatible dtypes (only works for dtypes with same-sized items) and handles arbitrarily-shaped arrays with user-control over block size:
import numpy
def astype_inplace(a, dtype, blocksize=10000):
oldtype = a.dtype
newtype = numpy.dtype(dtype)
assert oldtype.itemsize is newtype.itemsize
for idx in xrange(0, a.size, blocksize):
a.flat[idx:idx + blocksize] = \
a.flat[idx:idx + blocksize].astype(newtype).view(oldtype)
a.dtype = newtype
a = numpy.random.randint(100,size=100).reshape((10,10))
print a
astype_inplace(a, 'float32')
print a
Time spent reading data
t1=time.time() ; V=np.load ('udata.npy');t2=time.time()-t1 ; print( t2 )
95.7923333644867
V.dtype
dtype('>f8')
V.shape
(3072, 1024, 4096)
**Creating new array **
t1=time.time() ; V64=np.array( V, dtype=np.double); t2=time.time()-t1 ; print( t2 )
1291.669689655304
Simple in-place numpy conversion
t1=time.time() ; V64=np.array( V, dtype=np.double); t2=time.time()-t1 ; print( t2 )
205.64322113990784
Using astype
t1=time.time() ; V = V.astype(np.double) ; t2=time.time()-t1 ; print( t2 )
400.6731758117676
Using view
t1=time.time() ; x=V.view(np.double);V[:,:,:]=x ;t2=time.time()-t1 ; print( t2 )
556.5982494354248
Note that each time I cleared the variables. Thus simply let python handle the conversion is the most efficient.
import numpy as np
arr_float = np.arange(10, dtype=np.float32)
arr_int = arr_float.view(np.float32)
use view() and parameter 'dtype' to change the array in place.
Use this:
In [105]: a
Out[105]:
array([[15, 30, 88, 31, 33],
[53, 38, 54, 47, 56],
[67, 2, 74, 10, 16],
[86, 33, 15, 51, 32],
[32, 47, 76, 15, 81]], dtype=int32)
In [106]: float32(a)
Out[106]:
array([[ 15., 30., 88., 31., 33.],
[ 53., 38., 54., 47., 56.],
[ 67., 2., 74., 10., 16.],
[ 86., 33., 15., 51., 32.],
[ 32., 47., 76., 15., 81.]], dtype=float32)
a = np.subtract(a, 0., dtype=np.float32)
In pure Python you can grow matrices column by column pretty easily:
data = []
for i in something:
newColumn = getColumnDataAsList(i)
data.append(newColumn)
NumPy's array doesn't have the append function. The hstack function doesn't work on zero sized arrays, thus the following won't work:
data = numpy.array([])
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
data = numpy.hstack((data, newColumn)) # ValueError: arrays must have same number of dimensions
So, my options are either to remove the initalization iside the loop with appropriate condition:
data = None
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
if data is None:
data = newColumn
else:
data = numpy.hstack((data, newColumn)) # works
... or to use a Python list and convert is later to array:
data = []
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
data.append(newColumn)
data = numpy.array(data)
Both variants seem a little bit awkward to be. Are there nicer solutions?
NumPy actually does have an append function, which it seems might do what you want, e.g.,
import numpy as NP
my_data = NP.random.random_integers(0, 9, 9).reshape(3, 3)
new_col = NP.array((5, 5, 5)).reshape(3, 1)
res = NP.append(my_data, new_col, axis=1)
your second snippet (hstack) will work if you add another line, e.g.,
my_data = NP.random.random_integers(0, 9, 16).reshape(4, 4)
# the line to add--does not depend on array dimensions
new_col = NP.zeros_like(my_data[:,-1]).reshape(-1, 1)
res = NP.hstack((my_data, new_col))
hstack gives the same result as concatenate((my_data, new_col), axis=1), i'm not sure how they compare performance-wise.
While that's the most direct answer to your question, i should mention that looping through a data source to populate a target via append, while just fine in python, is not idiomatic NumPy. Here's why:
initializing a NumPy array is relatively expensive, and with this conventional python pattern, you incur that cost, more or less, at each loop iteration (i.e., each append to a NumPy array is roughly like initializing a new array with a different size).
For that reason, the common pattern in NumPy for iterative addition of columns to a 2D array is to initialize an empty target array once(or pre-allocate a single 2D NumPy array having all of the empty columns) the successively populate those empty columns by setting the desired column-wise offset (index)--much easier to show than to explain:
>>> # initialize your skeleton array using 'empty' for lowest-memory footprint
>>> M = NP.empty(shape=(10, 5), dtype=float)
>>> # create a small function to mimic step-wise populating this empty 2D array:
>>> fnx = lambda v : NP.random.randint(0, 10, v)
populate NumPy array as in the OP, except each iteration just re-sets the values of M at successive column-wise offsets
>>> for index, itm in enumerate(range(5)):
M[:,index] = fnx(10)
>>> M
array([[ 1., 7., 0., 8., 7.],
[ 9., 0., 6., 9., 4.],
[ 2., 3., 6., 3., 4.],
[ 3., 4., 1., 0., 5.],
[ 2., 3., 5., 3., 0.],
[ 4., 6., 5., 6., 2.],
[ 0., 6., 1., 6., 8.],
[ 3., 8., 0., 8., 0.],
[ 5., 2., 5., 0., 1.],
[ 0., 6., 5., 9., 1.]])
of course if you don't known in advance what size your array should be
just create one much bigger than you need and trim the 'unused' portions
when you finish populating it
>>> M[:3,:3]
array([[ 9., 3., 1.],
[ 9., 6., 8.],
[ 9., 7., 5.]])
Usually you don't keep resizing a NumPy array when you create it. What don't you like about your third solution? If it's a very large matrix/array, then it might be worth allocating the array before you start assigning its values:
x = len(something)
y = getColumnDataAsNumpyArray.someLengthProperty
data = numpy.zeros( (x,y) )
for i in something:
data[i] = getColumnDataAsNumpyArray(i)
The hstack can work on zero sized arrays:
import numpy as np
N = 5
M = 15
a = np.ndarray(shape = (N, 0))
for i in range(M):
b = np.random.rand(N, 1)
a = np.hstack((a, b))
Generally it is expensive to keep reallocating the NumPy array - so your third solution is really the best performance wise.
However I think hstack will do what you want - the cue is in the error message,
ValueError: arrays must have same number of dimensions`
I'm guessing that newColumn has two dimensions (rather than a 1D vector), so you need data to also have two dimensions..., for example, data = np.array([[]]) - or alternatively make newColumn a 1D vector (generally if things are 1D it is better to keep them 1D in NumPy, so broadcasting, etc. work better). in which case use np.squeeze(newColumn) and hstack or vstack should work with your original definition of the data.