I generate feature vectors for examples from large amount of data, and I would like to store them incrementally while i am reading the data. The feature vectors are numpy arrays. I do not know the number of numpy arrays in advance, and I would like to store/retrieve them incrementally.
Looking at pytables, I found two options:
Arrays: They require predetermined size and I am not quite sure how
much appending is computationally efficient.
Tables: The column types do not support list or arrays.
If it is a plain numpy array, you should probably use Extendable Arrays (EArray) http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-earray-class
If you have a numpy structured array, you should use a Table.
Can't you just store them into an array? You have your code and it should be a loop that will grab things from the data to generate your examples and then it generates the example. create an array outside the loop and append your vector into the array for storage!
array = []
for row in file:
#here is your code that creates the vector
array.append(vector)
then after you have gone through the whole file, you have an array with all of your generated vectors! Hopefully that is what you need, you were a bit unclear...next time please provide some code.
Oh, and you did say you wanted pytables, but I don't think it's necessary, especially because of the limitations you mentioned
Related
I would like to store a non-rectangular array in Python. The array has millions of elements and I will be applying a function to each element in the array, so I am concerned about performance. What data structure should I use? Should I use a Python list or a numpy array of type object? Is there another data structure that would work even better?
You can use the dictionary data structure to store everything. If you have ample memory, dictionaries is a good option. The hashing process makes them faster.
I'd suggest you to use scipy sparse matrices.
UPD. Some elaboration goes below.
I assume that "non-rectangular" implies there will be empty elements in plain 2D array. Having millions of elements will make these 'holes' tax on memory usage. Sparse matrix provide a way to have familiar array interface and occupy only necessary amount of memory.
Though if array-ish indexing is not required, dictionary is pretty fine storage to use.
Let's say I have a 2d python (rectangular i.e. like matrix) list which I want to sort in descending order based on its 2nd column, and I want the list to change and not interested in a copy. What is the best alternative to this approach (using numpy, ...)
arr.sort(key= lambda i:i[1],revere=True)
I've searched a lot but couldn't find an intuitive way that seemed better that the code above.
Any help would be highly appreciated.
You have to give some more informations. A 2d list can be something more than a rectangular 2d numpy array.
In this answer i asume that your data can be represented as a 2d- array.
import numpy as np
np_arr=np.array(arr) #create a numpy array from your list
idx=np.argsort(-np_arr[:,1])
np_arr=np_arr[idx,:]
I have a function that generates square ndarrays of [ex. shape (10,10)]. The values are floats.
I need to be able to say, "tell me the standard deviation of an arbitrary cell [ex. (3,6)] in all of the 10x10 ndarrays I just generated"
I don't know what the best structure to store these 10x10 ndarrays is. I was searching through older StackOverflow questions and people were warning against making "arrays of arrays" for example.
I'd like something that is efficient, but also easily manipulated (being able to do descriptive statistics on slices of the three dimensional structure).
Not sure how to assemble this, and whether I should be making it a dataFrame (which the original data that I have been processing was in) or a numpy array, or something else.
Wisdom please?
A pandas Panel seems to fit your requirements nicely. An example of creating the data structure you describe (with n=15, filled with random numbers) and extracting the standard deviation for each data point in the 10x10 square across all squares is:
import pandas
import numpy
wp = pandas.Panel(numpy.random.randn(15, 10, 10))
wp.std(axis=0)
I wrote a program using normal Python, and I now think it would be a lot better to use numpy instead of standard lists. The problem is there are a number of things where I'm confused how to use numpy, or whether I can use it at all.
In general how do np.arrays work? Are they dynamic in size like a C++ vector or do I have declare their length and type beforehand like a standard C++ array? In my program I've got a lot of cases where I create a list
ex_list = [] and then cycle through something and append to it ex_list.append(some_lst). Can I do something like with a numpy array? What if I knew the size of ex_list, could I declare and empty one and then add to it?
If I can't, let's say I only call this list, would it be worth it to convert it to numpy afterwards, i.e. is calling a numpy list faster?
Can I do more complicated operations for each element using a numpy array (not just adding 5 to each etc), example below.
full_pallete = [(int(1+i*(255/127.5)),0,0) for i in range(0,128)]
full_pallete += [col for col in right_palette if col[1]!=0 or col[2]!=0 or col==(0,0,0)]
In other words, does it make sense to convert to a numpy array and then cycle through it using something other than for loop?
Numpy arrays can be appended to (see http://docs.scipy.org/doc/numpy/reference/generated/numpy.append.html), although in general calling the append function many times in a loop has a heavy performance cost - it is generally better to pre-allocate a large array and then fill it as necessary. This is because the arrays themselves do have fixed size under the hood, but this is hidden from you in python.
Yes, Numpy is well designed for many operations similar to these. In general, however, you don't want to be looping through numpy arrays (or arrays in general in python) if they are very large. By using inbuilt numpy functions, you basically make use of all sorts of compiled speed up benefits. As an example, rather than looping through and checking each element for a condition, you would use numpy.where().
The real reason to use numpy is to benefit from pre-compiled mathematical functions and data processing utilities on large arrays - both those in the core numpy library as well as many other packages that use them.
I have some NumPy arrays that are are pickled and stored in MongoDB using the bson module. For instance, if x is a NumPy array, then I set a field of a MongoDB record to:
bson.binary.Binary(x.dumps())
My question is whether it is possible to recover a subset of the array x without reloading the entire array via np.loads(). So, first, how can I get MongoDB to only give me back a chunk of the binary array, and then second, how can I turn that chunk into a NumPy array. I should mention here that I also have all the NumPy metadata regarding the array already, such as it's dimensions and datatype.
A concrete example might be that I have a 2-dimensional array of size (100000,10) with datatype np.float64 and I want to retrieve just x[50,10].
I can not say for sure, but checking the api docs of BSON C++ I get the idea that it was not designed for partial retrieval...
If you can at all, consider using pytables, which is designed for large data and inter-operating nicely with numpy. Mongo is great for certain distributed applications, though, while pytables is not.
If you store the array directly inside of MongoDB, you can also try using the $slice operator to get a contiguous subset of an array. You could linearize your 2D array into an 1D array, and the $slice operator will get you matrix rows, but if you want to select columns or generally select noncontiguous indicies, then you're out of luck.
Background on $slice.