numpy style which is preferred? array.T[x] or array[:,0] - python

What is the preferred way to extract a column of data in numpy?
array[:,x]
or
array.T[x]
I find that having an array of data with the fields along the rows and data in columns is cleaner to manipulate in numpy:
array[x]
to get a whole series along one variable as opposed to the above options.
But having variables ordered by column is the standard file format.
Any preferences as to what is the easiest way to work with the data?
Should I transpose all my data when I read it in and then transpose again when I output?

You should prefer slicing [:, x], for several reasons:
It is faster, probably because you are not transposing the entire array to extract a piece from it. Tested in Python 3.5.1, NumPy 1.11.0:
>>> timeit.timeit('A[:,568]', setup = 'import numpy as np\nA = np.random.uniform(size=(1000,1000))')
0.21135332298581488
>>> timeit.timeit('A.T[568]', setup = 'import numpy as np\nA = np.random.uniform(size=(1000,1000))')
0.3025632489880081
It generalizes in a straightforward way to higher dimensional arrays, like A[3, :, 4]
It reflects the NumPy way of thinking of arrays as multidimensional objects, rather than lists of lists (of lists).

Related

Slicing a 2D numpy array using vectors for start-stop indices

First post here, so please go easy on me. :)
I want to vectorize the following:
rowStart=array of length N
rowStop=rowStart+4
colStart=array of length N
colStop=colStart+4
x=np.random.rand(512,512) #dummy test array
output=np.zeros([N,4,4])
for i in range(N):
output[i,:,:]=x[ rowStart[i]:rowStop[i], colStart[i]:colStop[i] ]
What I'd like to be able to do is something like:
output=x[rowStart:rowStop, colStart:colStop ]
where numpy recognizes that the slicing indices are vectors and broadcasts the slicing. I understand that this probably doesn't work because while I know that my slice output is always the same size, numpy doesn't.
I've looked at various approaches, including "fancy" or "advanced" indexing (which seems to work for indexing, not slicing), massive boolean indexing using meshgrids (not practical from a memory standpoint, as my N can get to 50k-100k), and np.take, which just seems to be another way of doing fancy/advanced indexing.
I could see how I could potentially use fancy/advanced indexing if I could get an array that looks like:
[np.arange(rowStart[0],rowStop[0]),
np.arange(rowStart[1],rowStop[1]),
...,
np.arange(rowStart[N],rowStop[N])]
and a similar one for columns, but I'm also having trouble figuring out a vectorized approach for creating that.
I'd appreciate any advice you can provide.
Thanks!
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows and hence solve our case here. More info on use of as_strided based view_as_windows.
from skimage.util.shape import view_as_windows
BSZ = (4, 4) # block size
w = view_as_windows(x, BSZ)
out = w[rowStart, colStart]

Dask element wise string concatination

I need to create a multi-index for dask by concatenating two arrays (preferably dask arrays). I found the following solution for numpy, but looking for a dask solution
cols=100000
index = np.array([x1 + x2 +x3 for x1,x2,x3 in zip(repeat(1,cols ).astype('str'),repeat('-',cols ),repeat(1,cols ).astype('str'))])
if I pass it da.from_array() it balks at + two arrays.
I have also tried np.core.defchararray.add(), this works but converts to dask array to numpy arrays (as far as i can tell).
You might want to try da.map_blocks. You can make a numpy function that does whatever you want, and then da.map_blocks will apply that numpy function blockwise on to each of the numpy arrays that make up your dask array.

The efficient way of Array transformation by using numpy

How to change the ARRAY U(Nz,Ny, Nx) to U(Nx,Ny, Nz) by using numpy? thanks
Just numpy.transpose(U) or U.T.
In general, if you want to change the order of data in numpy array, see http://docs.scipy.org/doc/numpy-1.10.1/reference/routines.array-manipulation.html#rearranging-elements.
The np.fliplr() and np.flipud() functions can be particularly useful when the transpose is not actually what you want.
Additionally, more general element reordering can be done by creating an index mask, partially explained here

Save a csr_matrix and a numpy array in one file

I need to save a large sparse csr_matrix and a numpy array to be able to read them back later. Let X be the sparse csr_matrix and Y be the number array.
Currently I take the following slightly insane route.
from scipy.sparse import csr_matrix
import numpy as np
def save_sparse_csr(filename,array):
np.savez(filename,data = array.data ,indices=array.indices,
indptr =array.indptr, shape=array.shape )
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix(( loader['data'], loader['indices'], loader['indptr']),
shape = loader['shape'])
save_sparse_csr("file1", X)
np.save("file2", Y)
Then when I want to read them in it is:
X = load_sparse_csr("file1.npz")
Y = np.load("file2.npy")
Two questions:
Is there a better way to save a csr_matrix than this?
Can I save both X and Y to the same file somehow? I seems crazy to have to make two files for this.
So you are saving the 3 array attributes of the csr along with its shape. And that is sufficient to recreate the array, right?
What's wrong with that? Even if you find a function that saves the csr for you, I bet it is doing the same thing - saving those same arrays.
The normal way in Python to save a class is to pickle it. But the class has to create the appropriate pickle method. numpy does that (essentially its save function). But as far as I know scipy.sparse has not provided that.
Since scipy.sparse has its roots in the MATLAB sparse code (and C/Fortran code developed for linear algebra problems), it can load/save using the loadmat/savemat functions. I'd have to double check but I think the work with csc the default MATLAB sparse format.
There are one or two other sparse.io modules than handle sparse, but I have worked with those. There formats for sharing sparse arrays among different packages working with the same problems (for example PDEs or finite element). More than likely those formats will use a coo compatible layout (data, rows, cols), either as 3 arrays, a csv of 3 columns, or 2d array.
Mentioning coo format raises another possibility. Make a structure array with data, row, col fields, and use np.save or even np.savetxt. I don't think it's any faster or cleaner than csr direct. But it does put all the data in one array (but shape might still need a separate entry).
You might also be able to pickle the dok format, since it is a dict subclass.

Python - How to construct a numpy array out of a list of objects efficiently

I am building a python application where I retrieve a list of objects and I want to plot them (for ploting I use matplotlib). Each object in the list contains two properties.
For example let's say I have the list rawdata and the objects stored in it have the properties timestamp and power
rawdata[0].timestamp == 1
rawdata[1].timestamp == 2
rawdata[2].timestamp == 3
etc
rawdata[0].power == 1232.547
rawdata[1].power == 2525.423
rawdata[2].power == 1125.253
etc
I want to be able to plot those two dimensions, that the two properties represent, and I want to do it a time and space efficient way. That means that I want to avoid iterating over the list and sequentially constructing something like a numpy array out it.
Is there a way that to apply an on-the-fly transformation of the list? Or somehow plot it as it is? Since all the information is already included in the list I believe there should be a way.
The closest answer I found was this, but it includes sequential iteration over the list.
update
As pointed out by Antonio Ragagnin I can use the map builtin function to construct a numpy array efficiently. But that also means that I will have to create a second data structure. Can I use map to transform the list on the fly to a two dimensional numpy array?
From the matplotlib tutorial (emphasis mine):
If matplotlib were limited to working with lists, it would be fairly useless for numeric processing. Generally, you will use numpy arrays. In fact, all sequences are converted to numpy arrays internally.
So you lose nothing by converting it to a numpy array, if you don't do it matplotlib will.

Categories