Selecting between shelve and sqlite for really large dictionary (Python) - python

I have a large Python dictionary of vectors (150k vectors, 10k dimensions each) of float numbers that can't be loaded into memory, so I have to use one of the two methods for storing this on disk and retrieving specific vectors when appropriate. The vectors will be created and stored once, but might be read many (thousands of) times -- so it is really important to have efficient reading. After some tests with shelve module, I tend to believe that sqlite will be a better option for this kind of task, but before I start writing code I would like to hear some more opinions on this... For example, are there any other options except of those two that I'm not aware of?
Now, assuming we agree that the best option is sqlite, another question relates to the exact form of the table. I'm thinking of using a fine-grained structure with rows of the form vector_key, element_no, value to help efficient pagination, instead of storing all 10k elements of a vector into the same record. I would really appreciate any suggestions on this issue.

You want sqlite3, then if you use an ORM like sqlalchemy then you can easily grow to expand and use other back end databases.
Shelve is more of a "toy" than actually useful in production code.
The other point you are talking about is called normalization and I have personally never been very good at it this should explain it for you.
Just as an extra note this shows performance failures in shelve vs sqlite3

As you are dealing with numeric vectors, you may find PyTables an interesting alternative.

Related

PyTables, Creating a table without opening an hdf5 file

is it possible to create a PyTables table without opening or creating an hdf5 file?
What I mean, and what I need, is to create a table (well actually very many tables) in different processes, work with these tables and store the tables into an hdf5 file only in the end after some calculations (and ensuring only one process at a time performs the storage).
In principle I could do all the calculations on normal Python data (arrays strings etc) and perform the storage in the end. However, why I would appreciate to work on PyTables right from the start are the sanity checks. I want to always ensure that the data I work with fits into the predefined tables and does not violate shape constraints etc (and since PyTables checks for those problems I don't need to implement it all by myself).
Thanks a lot and kind regards,
Robert
You are looking for pandas which has great Pytables integration. You will be working with tables all the way and in the end you will be able to save to hdf5 in the easiest possible way.
You can create a numpy-array with given shape and datatype.
my_array = num.empty(shape=my_shape, dtype=num.float)
If you need indexing by name have a look at numpy record-arrays (nee numpy recarray)
But if you work directly with the PyTable-Object it can be faster (see benchmark here).

Pickle as a performance benchmark, or, speedier containment checks

I'm loading a data structure from a CSV. The resultant structure is several collection classes composed of item objects, each of which have one or two attributes that point children or parents in the other collections.
It's not very much data. The CSV is 800KB. Loading the data structure from CSV currently takes me about 3.5 seconds. Initially, I had some O(n^2) code that was getting clobbered by containment lookups that took execution to the full n^2, but I've added a bit of indexing and really reduced that.
Now my weak points are still containment lookups, but linear. It's frustrating when the most time-consuming components of your program are repeated executions of code like:
def add_thing(thing):
if thing not in things:
things.append(thing)
I may be missing something (I hope I am!) but it appears there's nothing much I can do to speed these up. I don't want the things collection to contain the same thing more than once. To do that, I must check every single item that's already there. Perhaps there is a faster list that can do this for me? A C-based, dupe-free list?
Anyway, I thought I would pickle the resultant data structure and see how long it took to load. Admittidely, I have no idea what goes into unpickling, but my thinking is that unpickling must be closer to raw I/O since the data structure itself is already laid out in the pickle. I was surprised to learn pickle.loading this data structure takes 2.4 seconds. Only 30% faster than my CSV slop. cPickle does the same job in 0.4 seconds, but I'm trying to keep this as apples-to-apples as possible, which I realize I may have already failed at. But I hope you can see my train of thought and criticize its weak points.
So, what am I to gather, if anything, from the exercise?
The performance of pickle.load and my CSV loading code have nothing to do with each other, and I should not be comparing them?
Things like this just take a long time?
I can optimize my containment lookups further?
I've got it all wrong?
(N.B. I'm using cProfile for profiling)

Increasing efficiency of Python copying large datasets

I'm having a bit of trouble with an implementation of random forests I'm working on in Python. Bare in mind, I'm well aware that Python is not intended for highly efficient number crunching. The choice was based more on wanting to get a deeper understanding of and additional experience in Python. I'd like to find a solution to make it "reasonable".
With that said, I'm curious if anyone here can make some performance improvement suggestions to my implementation. Running it through the profiler, it's obvious the most time is being spent executing the list "append" command and my dataset split operation. Essentially I have a large dataset implemented as a matrix (rather, a list of lists). I'm using that dataset to build a decision tree, so I'll split on columns with the highest information gain. The split consists of creating two new dataset with only the rows matching some critera. The new dataset is generated by initializing two empty lista and appending appropriate rows to them.
I don't know the size of the lists in advance, so I can't pre-allocate them, unless it's possible to preallocate abundant list space but then update the list size at the end (I haven't seen this referenced anywhere).
Is there a better way to handle this task in python?
Without seeing your codes, it is really hard to give any specific suggestions since optimisation is code-dependent process that varies case by case. However there are still some general things:
review your algorithm, try to reduce the number of loops. It seems
you have a lot of loops and some of them are deeply embedded in
other loops (I guess).
if possible use higher performance utility modules such as itertools
instead of naive codes written by yourself.
If you are interested, try PyPy (http://pypy.org/), it is a
performance-oriented implementation of Python.

Loop through list from specific point?

I was wondering if there was a way to keep extremely large lists in the memory and then process those lists from specific points. Since these lists will have as many as almost 400 billion numbers before processing we need to split them up but I haven't the slightest idea (since I can't find an example) of where to start when trying to process a list from a specific point in Python. Edit: Right now we are not trying to create multiple-dimensions but if it's easier then I'll for sure do it.
Even if your numbers are bytes, 400GB (or 400TB if you use billion in the long-scale meaning) does not normally fit in RAM. Therefore I guess numpy.memmap or h5py may be what you're looking for.
Further to the #lazyr's point, if you use the numpy.memmap method, then my previous discussion on views into numpy arrays might well be useful.
This is also the way you should be thinking if you have stacks of memory and everything actually is in RAM.

python solutions for managing scientific data dependency graph by specification values

I have a scientific data management problem which seems general, but I can't find an existing solution or even a description of it, which I have long puzzled over. I am about to embark on a major rewrite (python) but I thought I'd cast about one last time for existing solutions, so I can scrap my own and get back to the biology, or at least learn some appropriate language for better googling.
The problem:
I have expensive (hours to days to calculate) and big (GB's) data attributes that are typically built as transformations of one or more other data attributes. I need to keep track of exactly how this data is built so I can reuse it as input for another transformation if it fits the problem (built with right specification values) or construct new data as needed. Although it shouldn't matter, I typically I start with 'value-added' somewhat heterogeneous molecular biology info, for example, genomes with genes and proteins annotated by other processes by other researchers. I need to combine and compare these data to make my own inferences. A number of intermediate steps are often required, and these can be expensive. In addition, the end results can become the input for additional transformations. All of these transformations can be done in multiple ways: restricting with different initial data (eg using different organisms), by using different parameter values in the same inferences, or by using different inference models, etc. The analyses change frequently and build on others in unplanned ways. I need to know what data I have (what parameters or specifications fully define it), both so I can reuse it if appropriate, as well as for general scientific integrity.
My efforts in general:
I design my python classes with the problem of description in mind. All data attributes built by a class object are described by a single set of parameter values. I call these defining parameters or specifications the 'def_specs', and these def_specs with their values the 'shape' of the data atts. The entire global parameter state for the process might be quite large (eg a hundred parameters), but the data atts provided by any one class require only a small number of these, at least directly. The goal is to check whether previously built data atts are appropriate by testing if their shape is a subset of the global parameter state.
Within a class it is easy to find the needed def_specs that define the shape by examining the code. The rub arises when a module needs a data att from another module. These data atts will have their own shape, perhaps passed as args by the calling object, but more often filtered from the global parameter state. The calling class should be augmented with the shape of its dependencies in order to maintain a complete description of its data atts.
In theory this could be done manually by examining the dependency graph, but this graph can get deep, and there are many modules, which I am constantly changing and adding, and ... I'm too lazy and careless to do it by hand.
So, the program dynamically discovers the complete shape of the data atts by tracking calls to other classes attributes and pushing their shape back up to the caller(s) through a managed stack of __get__ calls. As I rewrite I find that I need to strictly control attribute access to my builder classes to prevent arbitrary info from influencing the data atts. Fortunately python is making this easy with descriptors.
I store the shape of the data atts in a db so that I can query whether appropriate data (i.e. its shape is a subset of the current parameter state) already exists. In my rewrite I am moving from mysql via the great SQLAlchemy to an object db (ZODB or couchdb?) as the table for each class has to be altered when additional def_specs are discovered, which is a pain, and because some of the def_specs are python lists or dicts, which are a pain to translate to sql.
I don't think this data management can be separated from my data transformation code because of the need for strict attribute control, though I am trying to do so as much as possible. I can use existing classes by wrapping them with a class that provides their def_specs as class attributes, and db management via descriptors, but these classes are terminal in that no further discovery of additional dependency shape can take place.
If the data management cannot easily be separated from the data construction, I guess it is unlikely that there is an out of the box solution but a thousand specific ones. Perhaps there is an applicable pattern? I'd appreciate any hints at how to go about looking or better describing the problem. To me it seems a general issue, though managing deeply layered data is perhaps at odds with the prevailing winds of the web.
I don't have specific python-related suggestions for you, but here are a few thoughts:
You're encountering a common challenge in bioinformatics. The data is large, heterogeneous, and comes in constantly changing formats as new technologies are introduced. My advice is to not overthink your pipelines, as they're likely to be changing tomorrow. Choose a few well defined file formats, and massage incoming data into those formats as often as possible. In my experience, it's also usually best to have loosely coupled tools that do one thing well, so that you can chain them together for different analyses quickly.
You might also consider taking a version of this question over to the bioinformatics stack exchange at http://biostar.stackexchange.com/
ZODB has not been designed to handle massive data, it is just for web-based applications and in any case it is a flat-file based database.
I recommend you to try PyTables, a python library to handle HDF5 files, which is a format used in astronomy and physics to store results from big calculations and simulations. It can be used as an hierarchical-like database and has also an efficient way to pickle python objects. By the way, the author of pytables explained that ZOdb was too slow for what he needed to do, and I can confirm you that. If you are interested in HDF5, there is also another library, h5py.
As a tool for managing the versioning of the different calculations you have, you can have a try at sumatra, which is something like an extension to git/trac but designed for simulations.
You should ask this question on biostar, you will find better answers there.

Categories