I would like to store a non-rectangular array in Python. The array has millions of elements and I will be applying a function to each element in the array, so I am concerned about performance. What data structure should I use? Should I use a Python list or a numpy array of type object? Is there another data structure that would work even better?
You can use the dictionary data structure to store everything. If you have ample memory, dictionaries is a good option. The hashing process makes them faster.
I'd suggest you to use scipy sparse matrices.
UPD. Some elaboration goes below.
I assume that "non-rectangular" implies there will be empty elements in plain 2D array. Having millions of elements will make these 'holes' tax on memory usage. Sparse matrix provide a way to have familiar array interface and occupy only necessary amount of memory.
Though if array-ish indexing is not required, dictionary is pretty fine storage to use.
Related
I have done over the time many things that require me using the list's .append() function, and also numpy.append() function for numpy arrays. I noticed that both grow really slow when sizes of the arrays are big.
I need an array that is dynamically growing for sizes of about 1 million elements. I can implement this myself, just like std::vector is made in C++, by adding buffer length (reserve length) that is not accessible from the outside. But do I have to reinvent the wheel? I imagine it should be implemented somewhere. So my question is: Does such a thing exist already in Python?
What I mean: Is there in Python an array type that is capable of dynamically growing with time complexity of O(C) most of the time?
The memory of numpy arrays is well described in its docs, and has been discussed here a lot. List memory layout has also been discussed, though usually just contrast to numpy.
A numpy array has a fixed size data buffer. 'growing' it requires creating a new array, and copying data to it. np.concatenate does that in compiled code. np.append as well as all the stack functions use concatenate.
A list has, as I understand it, a contiguous data buffer that contains pointers to objects else where in memeory. Python maintains some freespace in that buffer, so additions with list.append are relatively fast and easy. But when the freespace fills up, it has to create a new buffer and copy pointers. I can see where that could get expensive with large lists.
So a list will have store a pointer for each element, plus the element itself (e.g. a float) somewhere else in memory. In contrast the array of floats stores the floats themselves as contiguous bytes in its buffer. (Object dtype arrays are more like lists).
The recommended way to create an array iteratively is to build the list with append, and create the array once at the end. Repeated np.append or np.concatenate is relatively expensive.
deque was mentioned. I don't know much about how it stores its data. The docs say it can add elements at the start just as easily as at the end, but random access is slower than for a list. That implies that it stores data in some sort of linked list, so that finding the nth element requires traversing the n-1 links before it. So there's a trade off between growth ease and access speed.
Adding elements to the start of a list requires making a new list of pointers, with the new one(s) at the start. So adding, and removing elements from the start of a regular list, is much more expensive than doing that at the end.
Recommending software is outside of the core SO purpose. Others may make suggestions, but don't be surprised if this gets closed.
There are file formats like HDF5 that a designed for large data sets. They accommodate growth with features like 'chunking'. And there are all kinds of database packages.
Both use an underlying array. Instead, you can use collections.deque which is made for specifically adding and removing elements at both ends with O(1) complexity
I generate feature vectors for examples from large amount of data, and I would like to store them incrementally while i am reading the data. The feature vectors are numpy arrays. I do not know the number of numpy arrays in advance, and I would like to store/retrieve them incrementally.
Looking at pytables, I found two options:
Arrays: They require predetermined size and I am not quite sure how
much appending is computationally efficient.
Tables: The column types do not support list or arrays.
If it is a plain numpy array, you should probably use Extendable Arrays (EArray) http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-earray-class
If you have a numpy structured array, you should use a Table.
Can't you just store them into an array? You have your code and it should be a loop that will grab things from the data to generate your examples and then it generates the example. create an array outside the loop and append your vector into the array for storage!
array = []
for row in file:
#here is your code that creates the vector
array.append(vector)
then after you have gone through the whole file, you have an array with all of your generated vectors! Hopefully that is what you need, you were a bit unclear...next time please provide some code.
Oh, and you did say you wanted pytables, but I don't think it's necessary, especially because of the limitations you mentioned
I have some NumPy arrays that are are pickled and stored in MongoDB using the bson module. For instance, if x is a NumPy array, then I set a field of a MongoDB record to:
bson.binary.Binary(x.dumps())
My question is whether it is possible to recover a subset of the array x without reloading the entire array via np.loads(). So, first, how can I get MongoDB to only give me back a chunk of the binary array, and then second, how can I turn that chunk into a NumPy array. I should mention here that I also have all the NumPy metadata regarding the array already, such as it's dimensions and datatype.
A concrete example might be that I have a 2-dimensional array of size (100000,10) with datatype np.float64 and I want to retrieve just x[50,10].
I can not say for sure, but checking the api docs of BSON C++ I get the idea that it was not designed for partial retrieval...
If you can at all, consider using pytables, which is designed for large data and inter-operating nicely with numpy. Mongo is great for certain distributed applications, though, while pytables is not.
If you store the array directly inside of MongoDB, you can also try using the $slice operator to get a contiguous subset of an array. You could linearize your 2D array into an 1D array, and the $slice operator will get you matrix rows, but if you want to select columns or generally select noncontiguous indicies, then you're out of luck.
Background on $slice.
I will need to create array of integer arrays like [[0,1,2],[4,4,5,7]...[4,5]]. The size of internal arrays changeable. Max number of internal arrays is 2^26. So what do you recommend for the fastest way for updating this array.
When I use list=[[]] * 2^26 initialization is very fast but update is very slow. Instead I use
list=[] , for i in range(2**26): list.append.([]) .
Now initialization is slow, update is fast. For example, for 16777216 internal array and 0.213827311993 avarage number of elements on each array for 2^26-element array it takes 1.67728900909 sec. It is good but I will work much bigger datas, hence I need the best way. Initialization time is not important.
Thank you.
What you ask is quite of a problem. Different data structures have different properties. In general, if you need quick access, do not use lists! They have linear access time, which means, the more you put in them, the longer it will take in average to access an element.
You could perhaps use numpy? That library has matrices that can be accessed quite fast, and can be reshaped on the fly. However, if you want to add or delete rows, it will might be a bit slow because it generally reallocates (thus copies) the entire data. So it is a trade off.
If you are gonna have so many internal arrays of different sizes, perhaps you could have a dictionary that contains the internal arrays. I think if it is indexed by integers it will be much faster than a list. Then, the internal arrays could be created with numpy.
I'm still confused whether to use list or numpy array.
I started with the latter, but since I have to do a lot of append
I ended up with many vstacks slowing my code down.
Using list would solve this problem, but I also need to delete elements
which again works well with delete on numpy array.
As it looks now I'll have to write my own data type (in a compiled language, and wrap).
I'm just curious if there isn't a way to get the job done using a python type.
To summarize this are the criterions my data type would have to fulfil:
2d n (variable) rows, each row k (fixed) elements
in memory in one piece (would be nice for efficient operating)
append row (with an in average constant time, like C++ vector just always k elements)
delete a set of elements (best: inplace, keep free space at the end for later append)
access element given the row and column index ( O(1) like data[row*k+ column]
It appears generally useful to me to have a data type like this and not impossible to implement in C/Fortran.
What would be the closest I could get with python?
(Or maybe, Do you think it would work to write a python class for the datatype? what performance should I expect in this case?)
As I see it, if you were doing this in C or Fortran, you'd have to have an idea of the size of the array so that you can allocate the correct amount of memory (ignoring realloc!). So assuming you do know this, why do you need to append to the array?
In any case, numpy arrays have the resize method, which you can use to extend the size of the array.