Pickle as a performance benchmark, or, speedier containment checks - python

I'm loading a data structure from a CSV. The resultant structure is several collection classes composed of item objects, each of which have one or two attributes that point children or parents in the other collections.
It's not very much data. The CSV is 800KB. Loading the data structure from CSV currently takes me about 3.5 seconds. Initially, I had some O(n^2) code that was getting clobbered by containment lookups that took execution to the full n^2, but I've added a bit of indexing and really reduced that.
Now my weak points are still containment lookups, but linear. It's frustrating when the most time-consuming components of your program are repeated executions of code like:
def add_thing(thing):
if thing not in things:
things.append(thing)
I may be missing something (I hope I am!) but it appears there's nothing much I can do to speed these up. I don't want the things collection to contain the same thing more than once. To do that, I must check every single item that's already there. Perhaps there is a faster list that can do this for me? A C-based, dupe-free list?
Anyway, I thought I would pickle the resultant data structure and see how long it took to load. Admittidely, I have no idea what goes into unpickling, but my thinking is that unpickling must be closer to raw I/O since the data structure itself is already laid out in the pickle. I was surprised to learn pickle.loading this data structure takes 2.4 seconds. Only 30% faster than my CSV slop. cPickle does the same job in 0.4 seconds, but I'm trying to keep this as apples-to-apples as possible, which I realize I may have already failed at. But I hope you can see my train of thought and criticize its weak points.
So, what am I to gather, if anything, from the exercise?
The performance of pickle.load and my CSV loading code have nothing to do with each other, and I should not be comparing them?
Things like this just take a long time?
I can optimize my containment lookups further?
I've got it all wrong?
(N.B. I'm using cProfile for profiling)

Related

Faster repetitive uses of bz2.BZ2File for pickling

I'm pickling multiple objects repeatedly, but not consecutively. But as it turned out, pickled output files were too large (about 256MB each).
So I tried bz2.BZ2File instead of open, and each file became 1.3MB. (Yeah, wow.) The problem is that it takes too long (like 95 secs pickling one object) and I want to speed it up.
Each object is a dictionary, and most of them have similar structures (or hierarchies, if that describes it better: almost the same set of keys, and each value that corresponds to each key normally has some specific structure, and so on). Many of the dictionary values are numpy arrays, and I think many zeros will appear there.
Can you give me some advice to make it faster?
Thank you!
I ended up using lz4, which is a blazingly fast compression algorithm.
There is a python wrapper, which can be installed easily:
pip install lz4

Converting lists to dictionaries to check existence?

If I instantiate/update a few lists very, very few times, in most cases only once, but I check for the existence of an object in that list a bunch of times, is it worth it to convert the lists into dictionaries and then check by key existence?
Or in other words is it worth it for me to convert lists into dictionaries to achieve possible faster object existence checks?
Dictionary lookups are faster the list searches. Also a set would be an option. That said:
If "a bunch of times" means "it would be a 50% performance increase" then go for it. If it doesn't but makes the code better to read then go for it. If you would have fun doing it and it does no harm then go for it. Otherwise it's most likely not worth it.
You should be using a set, since from your description I am guessing you wouldn't have a value to associate. See Python: List vs Dict for look up table for more info.
Usually it's not important to tune every line of code for utmost performance.
As a rule of thumb, if you need to look up more than a few times, creating a set is usually worthwhile.
However consider that pypy might do the linear search 100 times faster than CPython, then a "few" times might be "dozens". In other words, sometimes the constant part of the complexity matters.
It's probably safest to go ahead and use a set there. You're less likely to find that a bottleneck as the system scales than the other way around.
If you really need to microtune everything, keep in mind that the implementation, cpu cache, etc... can affect it, so you may need to remicrotune differently for different platforms, and if you need performance that badly, Python was probably a bad choice - although maybe you can pull the hotspots out into C. :)
random access (look up) in dictionary are faster but creating hash table consumes more memory.
more performance = more memory usages
it depends on how many items in your list.

Selecting between shelve and sqlite for really large dictionary (Python)

I have a large Python dictionary of vectors (150k vectors, 10k dimensions each) of float numbers that can't be loaded into memory, so I have to use one of the two methods for storing this on disk and retrieving specific vectors when appropriate. The vectors will be created and stored once, but might be read many (thousands of) times -- so it is really important to have efficient reading. After some tests with shelve module, I tend to believe that sqlite will be a better option for this kind of task, but before I start writing code I would like to hear some more opinions on this... For example, are there any other options except of those two that I'm not aware of?
Now, assuming we agree that the best option is sqlite, another question relates to the exact form of the table. I'm thinking of using a fine-grained structure with rows of the form vector_key, element_no, value to help efficient pagination, instead of storing all 10k elements of a vector into the same record. I would really appreciate any suggestions on this issue.
You want sqlite3, then if you use an ORM like sqlalchemy then you can easily grow to expand and use other back end databases.
Shelve is more of a "toy" than actually useful in production code.
The other point you are talking about is called normalization and I have personally never been very good at it this should explain it for you.
Just as an extra note this shows performance failures in shelve vs sqlite3
As you are dealing with numeric vectors, you may find PyTables an interesting alternative.

Increasing efficiency of Python copying large datasets

I'm having a bit of trouble with an implementation of random forests I'm working on in Python. Bare in mind, I'm well aware that Python is not intended for highly efficient number crunching. The choice was based more on wanting to get a deeper understanding of and additional experience in Python. I'd like to find a solution to make it "reasonable".
With that said, I'm curious if anyone here can make some performance improvement suggestions to my implementation. Running it through the profiler, it's obvious the most time is being spent executing the list "append" command and my dataset split operation. Essentially I have a large dataset implemented as a matrix (rather, a list of lists). I'm using that dataset to build a decision tree, so I'll split on columns with the highest information gain. The split consists of creating two new dataset with only the rows matching some critera. The new dataset is generated by initializing two empty lista and appending appropriate rows to them.
I don't know the size of the lists in advance, so I can't pre-allocate them, unless it's possible to preallocate abundant list space but then update the list size at the end (I haven't seen this referenced anywhere).
Is there a better way to handle this task in python?
Without seeing your codes, it is really hard to give any specific suggestions since optimisation is code-dependent process that varies case by case. However there are still some general things:
review your algorithm, try to reduce the number of loops. It seems
you have a lot of loops and some of them are deeply embedded in
other loops (I guess).
if possible use higher performance utility modules such as itertools
instead of naive codes written by yourself.
If you are interested, try PyPy (http://pypy.org/), it is a
performance-oriented implementation of Python.

Loop through list from specific point?

I was wondering if there was a way to keep extremely large lists in the memory and then process those lists from specific points. Since these lists will have as many as almost 400 billion numbers before processing we need to split them up but I haven't the slightest idea (since I can't find an example) of where to start when trying to process a list from a specific point in Python. Edit: Right now we are not trying to create multiple-dimensions but if it's easier then I'll for sure do it.
Even if your numbers are bytes, 400GB (or 400TB if you use billion in the long-scale meaning) does not normally fit in RAM. Therefore I guess numpy.memmap or h5py may be what you're looking for.
Further to the #lazyr's point, if you use the numpy.memmap method, then my previous discussion on views into numpy arrays might well be useful.
This is also the way you should be thinking if you have stacks of memory and everything actually is in RAM.

Categories