Choosing correct numpy or pandas data structure - python

I have a function that generates square ndarrays of [ex. shape (10,10)]. The values are floats.
I need to be able to say, "tell me the standard deviation of an arbitrary cell [ex. (3,6)] in all of the 10x10 ndarrays I just generated"
I don't know what the best structure to store these 10x10 ndarrays is. I was searching through older StackOverflow questions and people were warning against making "arrays of arrays" for example.
I'd like something that is efficient, but also easily manipulated (being able to do descriptive statistics on slices of the three dimensional structure).
Not sure how to assemble this, and whether I should be making it a dataFrame (which the original data that I have been processing was in) or a numpy array, or something else.
Wisdom please?

A pandas Panel seems to fit your requirements nicely. An example of creating the data structure you describe (with n=15, filled with random numbers) and extracting the standard deviation for each data point in the 10x10 square across all squares is:
import pandas
import numpy
wp = pandas.Panel(numpy.random.randn(15, 10, 10))
wp.std(axis=0)

Related

index milion row square matrix for fast access

I have some very large matrices (let say of the order of the million rows), that I can not keep in memory, and I would need to access to subsample of this matrix in descent time (less than a minute...).
I started looking at hdf5 and blaze in combination with numpy and pandas:
http://web.datapark.io/yves/blaze.html
http://blaze.pydata.org
But I found it a bit complicated, and I am not sure if it is the best solution.
Are there other solutions?
thanks
EDIT
Here some more specifications about the kind of data I am dealing with.
The matrices are usually sparse (< 10% or < 25% of cells with non-zero)
The matrices are symmetric
And what I would need to do is:
Access for reading only
Extract rectangular sub-matrices (mostly along the diagonal, but also outside)
Did you try PyTables ? It can be very useful for very large matrix. Take a look to this SO post.
Your question is lacking a bit in context; but hdf5 compressed block storage is probably as-efficient as a sparse storage format for these relatively dense matrices you describe. In memory, you can always cast your views to sparse matrices if it pays. That seems like an effective and simple solution; and as far as I know there are no sparse matrix formats which can easily be read partially from disk.

No space benefit using sparse Pandas dataframe despite extremely low density

I am using Python/Pandas to deal with very large and very sparse single-column data frames, but when I pickle them, there is virtually no benefit. If I try the same thing on Matlab, the difference is colossal, so I am trying to understand what is going on.
Using Pandas:
len(SecondBins)
>> 34300801
dense = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary'],index=SecondBins)
sparse = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary'],index=SecondBins).to_sparse(fill_value=0)
pickle.dump(dense,open('dense.p','wb'))
pickle.dump(sparse,open('sparse.p','wb'))
Looking at the sizes of the pickled files,
dense = 548.8MB
sparse = 274.4MB
However, when I look at memory usage associated with these variables,
dense.memory_usage()
>>Binary 274406408
>>dtype: int64
sparse.memory_usage()
>>Binary 0
>>dtype: int64
So, for a completely empty sparse vector, there slightly more than 50% savings. Perhaps it was something to do with the fact that variable 'SecondBins' is composed of pd.Timestamp which I use in the Pandas as indices, so I tried a similar procedure using default indices.
dense_defaultindex = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary'])
sparse_defaultindex = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary']).to_sparse(fill_value=0)
pickle.dump(dense_defaultindex,open('dense_defaultindex.p','wb'))
pickle.dump(sparse_defaultindex,open('sparse_defaultindex.p','wb'))
But it yields the same sizes on disk.
What is pickle doing under the hood?
If I create a similar zero-filled vector in Matlab, and save it in a .mat file, it's ~180 bytes!?
Please advise.
Regards
Remember that pandas is labelled data. The column labels and the index labels are essentially specialized arrays, and those arrays take up space. So in practice the index acts as an additional column as far as space usage goes, and the column headings act as an additional row.
In the dense case, you essentially have two columns, the data and the index. In the sparse case, you have essentially one column, the index (since the spare data column contains close to no data). So from this perspective, you would expect the sparse case to be about half the size of the dense case. And that is what you see in your file sizes.
In the MATLAB case, however, the data is not labeled. Therefore, the sparse case takes up almost no space. The equivalent to the MATLAB case would be a sparse matrix, not a spare dataframe structure. So if you want to take full advantage of the space savings you should use scipy.sparse, which provides sparse matrix support similar to what you get in MATLAB.

Creating new numpy scalar through C API and implementing a custom view

Short version
Given a built-in quaternion data type, how can I view a numpy array of quaternions as a numpy array of floats with an extra dimension of size 4 (without copying memory)?
Long version
Numpy has built-in support for floats and complex floats. I need to use quaternions -- which generalize complex numbers, but rather than having two components, they have four. There's already a very nice package that uses the C API to incorporate quaternions directly into numpy, which seems to do all the operations perfectly fast. There are a few more quaternion functions that I need to add to it, but I think I can mostly handle those.
However, I would also like to be able to use these quaternions in other functions that I need to write using the awesome numba package. Unfortunately, numba cannot currently deal with custom types. But I don't need the fancy quaternion functions in those numba-ed functions; I just need the numbers themselves. So I'd like to be able to just re-cast an array of quaternions as an array of floats with one extra dimension (of size 4). In particular, I'd like to just use the data that's already in the array without copying, and view it as a new array. I've found the PyArray_View function, but I don't know how to implement it.
(I'm pretty confident the data are held contiguously in memory, which I assume would be required for having a simple view of them. Specifically, elsize = 8*4 and alignment = 8 in the quaternion package.)
Turns out that was pretty easy. The magic of numpy means it's already possible. While thinking about this, I just tried the following with complex numbers:
import numpy as np
a = np.array([1+2j, 3+4j, 5+6j])
a.view(np.float).reshape(a.shape[0],2)
And this gave exactly what I was looking for. Somehow the same basic idea works with the quaternion type. I guess the internals just rely on that elsize, divide by sizeof(float) and use that to set the new size in the last dimension???
To answer my own question then, the same idea can be applied to the quaternion module:
import numpy as np, quaternions
a = np.array([np.quaternion(1,2,3,4), np.quaternion(5,6,7,8), np.quaternion(9,0,1,2)])
a.view(np.float).reshape(a.shape[0],4)
The view transformation and reshaping combined seem to take about 1 microsecond on my laptop, independent of the size of the input array (presumably because there's no memory copying, other than a few members in some basic python object).
The above is valid for simple 1-d arrays of quaternions. To apply it to general shapes, I just write a function inside the quaternion namespace:
def as_float_array(a):
"View the quaternion array as an array of floats with one extra dimension of size 4"
return a.view(np.float).reshape(a.shape+(4,))
Different shapes don't seem to slow the function down significantly.
Also, it's easy to convert back to from a float array to a quaternion array:
def as_quat_array(a):
"View a float array as an array of floats with one extra dimension of size 4"
if(a.shape[-1]==4) :
return a.view(np.quaternion).reshape(a.shape[:-1])
return a.view(np.quaternion).reshape(a.shape[:-1]+(a.shape[-1]//4,))

Construct huge numpy array with pytables

I generate feature vectors for examples from large amount of data, and I would like to store them incrementally while i am reading the data. The feature vectors are numpy arrays. I do not know the number of numpy arrays in advance, and I would like to store/retrieve them incrementally.
Looking at pytables, I found two options:
Arrays: They require predetermined size and I am not quite sure how
much appending is computationally efficient.
Tables: The column types do not support list or arrays.
If it is a plain numpy array, you should probably use Extendable Arrays (EArray) http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-earray-class
If you have a numpy structured array, you should use a Table.
Can't you just store them into an array? You have your code and it should be a loop that will grab things from the data to generate your examples and then it generates the example. create an array outside the loop and append your vector into the array for storage!
array = []
for row in file:
#here is your code that creates the vector
array.append(vector)
then after you have gone through the whole file, you have an array with all of your generated vectors! Hopefully that is what you need, you were a bit unclear...next time please provide some code.
Oh, and you did say you wanted pytables, but I don't think it's necessary, especially because of the limitations you mentioned

implementing exotic complex numbers to use with numpy

i'm using python + numpy + scipy to do some convolution filtering over a complex-number array.
field = np.zeros((field_size, field_size), dtype=complex)
...
field = scipy.signal.convolve(field, kernel, 'same')
So, when i want to use a complex array in numpy all i need to do is pass the dtype=complex parameter.
For my research i need to implement two other types of complex numbers: dual (i*i=0) and double (i*i=1). It's not a big deal - i just take the python source code for complex numbers and change the multiplication function.
The problem: how do i make a numpy array of those exotic numeric types?
It looks like you are trying to create a new dtype for e.g. dual numbers. It is possible to do this with the following code:
dual_type = np.dtype([("a", np.float), ("b", np.float)])
dual_array = np.zeros((10,), dtype=dual_type)
However this is just a way of storing the data type, and doesn't tell numpy anything about the special algebra which it obeys.
You can partially achieve the desired effect by subclassing numpy.ndarray and overriding the relevant member functions, such as __mul__ for multiply and so on. This should work fine for any python code, but I am fairly sure that any C or fortran-based routines (i.e. most of numpy and scipy) would multiply the numbers directly, rather than calling the __mul__. I suspect that convolve would fall into this basket, therefore it would not respect the rules which you define unless you wrote your own pure python version.
Here's my solution:
from iComplex import SplitComplex as c_split
...
ctype = c_split
constructor = np.vectorize(ctype, otypes=[np.object])
field = constructor(np.zeros((field_size, field_size)))
That is the easy way to create numpy object array.
What about scipy.signal.convolve - it doesn't seem to work with my complex numbers and i had to make my own convolution and it works deadly slow. So now i am looking for ways to speed it up.
Would it work to turn things inside-out? I mean instead of an array as the outer container holding small containers holding a couple floating point values as a complex number, turn that around so that your complex number is the outer container. You'd have two arrays, one of plain floats as the real part, and another array as the imaginary part. The basic super-fast convolver can do its job although you'd have to write code to use it four times, for all combinations of real/imaginary of the two factors.
In color image processing, I have often refactored my code from using arrays of RGB values to three arrays of scalar values, and found a good speed-up due to simpler convolutions and other operations working much faster on arrays of bytes or floats.
YMMV, since locality of the components of the complex (or color) can be important.

Categories