Load data from generator into already allocated numpy array - python

I have a large array
data = np.empty((n, k))
where both n and k are large. I also have a lot of generators g, each with k elements, and I want to load each generator into a row in data. I can do:
data[i] = list(g)
or something similar, but this makes a copy of the data in g. I can load with a for loop:
for j, x in enumerate(g):
data[i, j] = x
but I'm wondering if numpy has a way to do this already without copying or looping in Python.
I know that g have length k in advance and am happy to do some __len__ subclass patching if necessary. np.fromiter will accept something like that when creating a new array, but I'd rather load into this already existing array if possible, due to the constraints of my context.

There's not much you can do, as stated in the comments.
Although you can consider these two solutions:
using numpy.fromiter
Instead of creating data = np.empty((n, k)) yourself, use numpy.fromiter and the count argument, which is made specifically from this case where you know the number of items in advance. This way numpy won't have to "guess" the size and re-allocate until the guess is large enough.
Using fromiter allows to run the for loop in C instead of python. This might be a tiny bit faster, but the real bottleneck will likely be in your generators anyway.
Note that fromiter only deals with flat arrays, so you need to read everything flatten (e.g. using chain.from_iterable) and only then call reshape:
from itertools import chain
n = 20
k = 4
generators = (
(i*j for j in range(k))
for i in range(n)
)
flat_gen = chain.from_iterable(generators)
data = numpy.fromiter(flat_gen, 'int64', count=n*k)
data = data.reshape((n, k))
"""
array([[ 0, 0, 0, 0],
[ 0, 1, 2, 3],
[ 0, 2, 4, 6],
[ 0, 3, 6, 9],
[ 0, 4, 8, 12],
[ 0, 5, 10, 15],
[ 0, 6, 12, 18],
[ 0, 7, 14, 21],
[ 0, 8, 16, 24],
[ 0, 9, 18, 27],
[ 0, 10, 20, 30],
[ 0, 11, 22, 33],
[ 0, 12, 24, 36],
[ 0, 13, 26, 39],
[ 0, 14, 28, 42],
[ 0, 15, 30, 45],
[ 0, 16, 32, 48],
[ 0, 17, 34, 51],
[ 0, 18, 36, 54],
[ 0, 19, 38, 57]])
"""
using cython
If you can re-use data and want to avoid re-allocation of the memory, you can't use numpy's fromiter anymore. IMHO the only way to avoid the python's for loop is to implement it in cython. Again, this is extremely likely overkill, since you still have to read the generators in python.
For reference, the C implementation of fromiter looks like that: https://github.com/numpy/numpy/blob/v1.18.3/numpy/core/src/multiarray/ctors.c#L4001-L4118

There is no faster way than the ones you described. You have to allocate each element of the numpy array, either by iterating the generator or by allocating the entire list.

Couple of things here:
1) You can just say
for whatever in g:
do_stuff
Since g is a generator, the for loop understands how to get the data out of the generator.
2) You won't have to "copy" out of the generator necessarily (since it isn't doesn't have the entire sequence loaded in memory by design) but you will need to loop through it to fill up your numpy data structure. You might be able to squeeze out some performance (since your structures are large) with tools in numpy or itertools.
So the answer is "no" since you're using generators. If you don't need to have all of the data available at once, you can just use generators to keep the memory profile small but I don't have any context for what you are doing with the data.

Related

This particular way of using .map() in python

I was reading an article and I came across this below-given piece of code. I ran it and it worked for me:
x = df.columns
x_labels = [v for v in sorted(x.unique())]
x_to_num = {p[1]:p[0] for p in enumerate(x_labels)}
#till here it is okay. But I don't understand what is going with this map.
x.map(x_to_num)
The final result from the map is given below:
Int64Index([ 0, 3, 28, 1, 26, 23, 27, 22, 20, 21, 24, 18, 10, 7, 8, 15, 19,
13, 14, 17, 25, 16, 9, 11, 6, 12, 5, 2, 4],
dtype='int64')
Can someone please explain to me how the .map() worked here. I searched online, but could not find anything related.
ps: df is a pandas dataframe.
Let's look what .map() function in general does in python.
>>> l = [1, 2, 3]
>>> list(map(str, l))
# ['1', '2', '3']
Here the list having numeric elements is converted to string elements.
So, whatever function we are trying to apply using map needs an iterator.
You probably might have got confused because the general syntax of map (map(MappingFunction, IteratorObject)) is not used here and things still work.
The variable x takes the form of IteratorObject , while the dictionary x_to_num contains the mapping and hence takes the form of MappingFunction.
Edit: this scenario has nothing to with pandas as such, x can be any iterator type object.

Raise Elements of Array to Series of Exponents

Suppose I have a numpy array such as:
a = np.arange(9)
>> array([0, 1, 2, 3, 4, 5, 6, 7, 8])
If I want to raise each element to succeeding powers of two, I can do it this way:
power_2 = np.power(a,2)
power_4 = np.power(a,4)
Then I can combine the arrays by:
np.c_[power_2,power_4]
>> array([[ 0, 0],
[ 1, 1],
[ 4, 16],
[ 9, 81],
[ 16, 256],
[ 25, 625],
[ 36, 1296],
[ 49, 2401],
[ 64, 4096]])
What's an efficient way to do this if I don't know the degree of the even monomial (highest multiple of 2) in advance?
One thing to observe is that x^(2^n) = (...(((x^2)^2)^2)...^2)
meaning that you can compute each column from the previous by taking the square.
If you know the number of columns in advance you can do something like:
import functools as ft
a = np.arange(5)
n = 4
out = np.empty((*a.shape,n),a.dtype)
out[:,0] = a
# Note: this works by side-effect!
# The optional second argument of np.square is "out", i.e. an
# array to write the result to (nonetheless the result is also
# returned directly)
ft.reduce(np.square,out.T)
out
# array([[ 0, 0, 0, 0],
# [ 1, 1, 1, 1],
# [ 2, 4, 16, 256],
# [ 3, 9, 81, 6561],
# [ 4, 16, 256, 65536]])
If the number of columns is not known in advance then the most efficient method is to make a list of columns, append as needed and only in the end use np.column_stack or np.c_ (if using np.c_ do not forget to cast the list to tuple first).
The straightforward approach is:
exponents = [2**n for n in a]
[a**e for e in exponents]
This works fine for relatively small numbers, but I see what looks like numerical overflow on the larger numbers. (Although I can compute those high powers just fine using scalars.)
The most elegant way I could think of is to not calculate the exponents beforehand. Since your exponents follow a very easy pattern, you can express everything using on list-comprehension.
result = [item**2*index for index,item in enumerate(a)]
If you are working with quite large datasets, this will cause some serious overhead. This statement will do all calculations immediately and save all calculated element in one large array. To mitigate this problem, you could you a generator expression, which will generate the data on the fly.
result = (item**2*index for index,item in enumerate(a))
See here for more details.

vectorize a function on 2 python objects

I would like to vectorize a function that takes 2 objects as argument such that it takes 2 ndarrays (of length m and n) and returns a matrix of shape (m x n).
Kinda like a tensor product.
I've tried to use numpy.vectorize without much success:
vFunc = np.vectorize(myFunc)
arg1 = np.asmatrix(a)
arg2 = np.transpose(np.asmatrix(b))
test = vFunc(arg1,arg2)
The above doesn't work, so for now I have to iterate on one of the arrays, which is an ugly solution. How do I fix this?
vFunc = np.vectorize(myFunc)
arg1 = np.asmatrix(a)
arg2 = np.transpose(np.asmatrix(b))
for i in range(arg1.size): cMat[i,] = vFunc(arg1[i],arg2)
This is the basic vectorize setup:
In [420]: def myfunc(x,y):
...: return 10*x + y
...:
In [421]: f = np.vectorize(myfunc)
In [422]: f(np.arange(4), np.arange(3)[:,None])
Out[422]:
array([[ 0, 10, 20, 30],
[ 1, 11, 21, 31],
[ 2, 12, 22, 32]])
How is your case different? Don't just say 'it doesn't work'!
With this particular function, I don't even need vectorize:
In [423]: myfunc(np.arange(4), np.arange(3)[:,None])
Out[423]:
array([[ 0, 10, 20, 30],
[ 1, 11, 21, 31],
[ 2, 12, 22, 32]])
The actions within myfunc already work fine with broadcasting
myfunc(np.asmatrix(np.arange(4)), np.asmatrix(np.arange(3)).T) also works, but the conversion to matrix isn't needed, and is generally discouraged.

Best practices with reading and operating on fortran ordered arrays with numpy

I'm reading ascii and binary files that all specify 3 dimensional arrays in fortran order. I want to perform some arbitrary manipulations on these arrays, then export them to the same ascii or binary format.
I'm confused on the best ways to deal with these arrays in my library. My current design seems prone to error because I have to keep reshaping things from the default C order if any new array is created.
Current design:
I have a few functions that read these files and return numpy arrays. The read functions all behave in a similar way and essentially read in the data and return something like:
return array.reshape((i, j, k), order='F')
The way I understand it, I'm returning a view for fortran order onto the original array.
My code assumes all the arrays are in fortran order. This means any new operations that might create a new array I make sure to use reshape to convert it back to fortran order.
This seems very error-prone because I have to pay close attention to any operation that creates a new array and make sure to reshape it into fortran order since the default is usually C order.
I later might have to export these arrays to binary or ascii again and need to maintain the fortran ordering. So, I use numpy.nditer to write each element out in the fortran order.
Concerns:
The current approach seems very error-prone since I typically think in C order. I'm afraid that I'll always be getting bitten by missing calls to reshape that forces things in C order.
I'd like to not have to worry about the ordering of the array elements except when reading the input files or writing the data to the output files.
The current approach seems messy because the indexes can be interpreted different ways and things can get confusing.
When dealing with fortran arrays the tuple ordering for indexes is backwards, right?
So, x[(1, 2, 3)] for a fortran array means k = 1, j = 2, and i = 3 whereas x[(1, 2, 3)] for a C order array means k = 3, j = 2, i = 1 correct?
This means that me and users of my library must always think of indexes in (k, j, i) order, not what we are C/Python programmers typically think in, (i, j, k).
Question:
Is there a best practice for doing this type of thing? In an ideal world I'd like to read in the fortran ordered arrays, then forget about ordering until I export to a file. However, I'm afraid I'll keep misinterpreting the indexes, etc.
I've read through the only numpy documentation on this that I can find, http://docs.scipy.org/doc/numpy/reference/internals.html#multidimensional-array-indexing-order-issues. However, the concept still seems as clear as mud to me. Maybe I just need a different explanation of the numpy docs, http://docs.scipy.org/doc/numpy/reference/internals.html#multidimensional-array-indexing-order-issues.
Numpy abstracts away the difference between Fortran ordering and C-ordering at the python level. (In fact, you can even have other orderings for >2d arrays with numpy. They're all treated the same at the python level.)
The only time you'll need to worry about C vs F ordering is when you're reading/writing to disk or passing the array to lower-level functions.
A Simple Example
As an example, let's make a simple 3D array in both C order and Fortran order:
In [1]: import numpy as np
In [2]: c_order = np.arange(27).reshape(3,3,3)
In [3]: f_order = c_order.copy(order='F')
In [4]: c_order
Out[4]:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])
In [5]: f_order
Out[5]:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])
Notice that they both look identical (they are at the level we're interacting with them). How can you tell that they're in different orderings? First off, let's take a look at the flags (pay attention to C_CONTIGUOUS vs F_CONTIGUOUS):
In [6]: c_order.flags
Out[6]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
In [7]: f_order.flags
Out[7]:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
And if you don't trust the flags, you can effectively view the memory order by looking at arr.ravel(order='K'). The order='K' is important. Otherwise, when you call arr.ravel() the output will be in C-order regardless of the memory layout of the array. order='K' uses the memory layout.
In [8]: c_order.ravel(order='K')
Out[8]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26])
In [9]: f_order.ravel(order='K')
Out[9]:
array([ 0, 9, 18, 3, 12, 21, 6, 15, 24, 1, 10, 19, 4, 13, 22, 7, 16,
25, 2, 11, 20, 5, 14, 23, 8, 17, 26])
The difference is actually represented (and stored) in the strides of the array. Notice that c_order's strides are (72, 24, 8), while f_order's strides are (8, 24, 72).
Just to prove that the indexing works the same way:
In [10]: c_order[0,1,2]
Out[10]: 5
In [11]: f_order[0,1,2]
Out[11]: 5
Reading and Writing
The main place where you'll run into problems with this is when you're reading from or writing to disk. Many file formats expect a particular ordering. I'm guessing that you're working with seismic data formats, and most of them (e.g. Geoprobe .vol's, and I think Petrel's volume format as well) essentially write a binary header and then a Fortran-ordered 3D array to disk.
With that in mind, I'll use a small seismic cube (snippet of some data from my dissertation) as an example.
Both of these are binary arrays of uint8s with a shape of 50x100x198. One is in C-order, while the other is in Fortran-order. c_order.dat f_order.dat
To read them in:
import numpy as np
shape = (50, 100, 198)
c_order = np.fromfile('c_order.dat', dtype=np.uint8).reshape(shape)
f_order = np.fromfile('f_order.dat', dtype=np.uint8).reshape(shape, order='F')
assert np.all(c_order == f_order)
Notice that the only difference is specifying the memory layout to reshape. The memory layout of the two arrays is still different (reshape doesn't make a copy), but they're treated identically at the python level.
Just to prove that the files really are written in a different order:
In [1]: np.fromfile('c_order.dat', dtype=np.uint8)[:10]
Out[1]: array([132, 142, 107, 204, 37, 37, 217, 37, 82, 60], dtype=uint8)
In [2]: np.fromfile('f_order.dat', dtype=np.uint8)[:10]
Out[2]: array([132, 129, 140, 138, 110, 88, 110, 124, 142, 139], dtype=uint8)
Let's visualize the result:
def plot(data):
fig, axes = plt.subplots(ncols=3)
for i, ax in enumerate(axes):
slices = [slice(None), slice(None), slice(None)]
slices[i] = data.shape[i] // 2
ax.imshow(data[tuple(slices)].T, cmap='gray_r')
return fig
plot(c_order).suptitle('C-ordered array')
plot(f_order).suptitle('F-ordered array')
plt.show()
Notice that we indexed them the same way, and they're displayed identically.
Common Mistakes with IO
First off, let's try reading in the Fortran-ordered file as if it were C-ordered and then take a look at the result (using the plot function above):
wrong_order = np.fromfile('f_order.dat', dtype=np.uint8).reshape(shape)
plot(wrong_order)
Not so good!
You mentioned that you're having to use "reversed" indicies. This is probably because you fixed what happened in the figure above by doing something like (note the reversed shape!):
c_order = np.fromfile('c_order.dat', dtype=np.uint8).reshape([50,100,198])
rev_f_order = np.fromfile('f_order.dat', dtype=np.uint8).reshape([198,100,50])
Let's visualize what happens:
plot(c_order).suptitle('C-ordered array')
plot(rev_f_order).suptitle('Incorrectly read Fortran-ordered array')
Note that the image on the far right (the timeslice) of the first plot matches a transposed version of the image on the far left of the second.
Similarly, print rev_f_order[1,2,3] and print c_order[3,2,1] both yield 140, while indexing them the same way gives a different result.
Basically, this is where your reversed indices come from. Numpy thinks it's a C-ordered array with a different shape. Notice if we look at the flags, they're both C-contiguous in memory:
In [24]: rev_f_order.flags
Out[24]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
In [25]: c_order.flags
Out[25]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
This is because a fortran-ordered array is equivalent to a C-ordered array with the reverse shape.
Writing to Disk in Fortran-Order
There's an additional wrinkle when writing a numpy array to disk in Fortran-order.
Unless you specify otherwise, the array will be written in C-order regardless of its memory-layout! (There's a clear note about this in the documentation for ndarray.tofile, but it's a common gotcha. The opposite behavior would be incorrect, though, i.m.o.)
Therefore, regardless of the memory layout of an array, to write it to disk in Fortran order, you need to do:
arr.ravel(order='F').tofile('output.dat')
If you're writing it as ascii, the same applies. Use ravel(order='F') and then write out the 1-dimensional result.

Create larger list from an existing list using a list comprehension or map()

I am trying to generate a list to index coordinates (x, y and z), given a set of atom indices. My problem is quite simply how to elegantly go from this list:
atom_indices = [0, 4, 5, 8]
To this list:
coord_indices = [0, 1, 2, 12, 13, 14, 15, 16, 17, 24, 25, 26]
The easiest to read/understand way of doing this I've thought of so far is simply:
coord_indices = []
for atom in atom_indices:
coord_indices += [3 * atom,
3 * atom + 1,
3 * atom + 2]
But this doesn't seem very Pythonic. Is there a better way I haven't thought of without getting a list of lists or a list of tuples?
How about:
>>> atom_indices = [0, 4, 5, 8]
>>> coords = [3*a+k for a in atom_indices for k in range(3)]
>>> coords
[0, 1, 2, 12, 13, 14, 15, 16, 17, 24, 25, 26]
We can nest loops in list comprehensions in the same order we'd write the loops, i.e. this is basically
coords = []
for a in atom_indices:
for k in range(3):
coords.append(3*a+k)
Don't be afraid of for loops, though, if they're clearer in the situation. For reasons I've never fully understood, some people feel like they're being more clever when they write code horizontally instead of vertically, even though it makes it harder to debug.

Categories