I was wondering if there was a way to keep extremely large lists in the memory and then process those lists from specific points. Since these lists will have as many as almost 400 billion numbers before processing we need to split them up but I haven't the slightest idea (since I can't find an example) of where to start when trying to process a list from a specific point in Python. Edit: Right now we are not trying to create multiple-dimensions but if it's easier then I'll for sure do it.
Even if your numbers are bytes, 400GB (or 400TB if you use billion in the long-scale meaning) does not normally fit in RAM. Therefore I guess numpy.memmap or h5py may be what you're looking for.
Further to the #lazyr's point, if you use the numpy.memmap method, then my previous discussion on views into numpy arrays might well be useful.
This is also the way you should be thinking if you have stacks of memory and everything actually is in RAM.
Related
I have this mathematical task in which I am supposed to find some combinations, etc. That doesn't matter, the problem is that I am trying to do it with itertools module and it worked fine on smaller combinations (6 - places), but now I want to do the same for large combination (18 - places) so here I run into problem because I only have 8GB of RAM and this list comes around 5GB and with my system running it consumes all RAM and then program drops MemoryError. So my question is: what would be good alternative to the method I'm using(code below)?
poliedar_kom = list(itertools.combinations_with_replacement(range(0, 13), 18))
poliedar_len = len(poliedar_kom)
So when I have this list and it's length, the rest of program is going through every value in list and checking for condition with values in another smaller list. As I already said that's problem because this list gets too big for my PC, but I'm probably doing something wrong.
Note: I am using latest Python 3.8 64-bit
Summary: I have too big list of lists through which I have to loop to check values for conditions.
EDIT: I appreciate all answers, I have to try them now, if you have any new possible solution to the problem please post it.
EDIT 2: Thanks everyone, you helped me really much. I marked answer that pointed me to Youtube video because it made me realize that my code is already generator. Thanks everyone!!!
Use generators for large data ranges, time and space complexity of the code will not increase exponentially with large data size, refer to the link for more details:
https://www.youtube.com/watch?v=bD05uGo_sVI
For any application requiring more than say, 1e4 items, you should refrain from using python lists, which are very memory- and processor-intesive
For such uses, I generally go to numpy arrays or pandas dataframes
If you aren't comfortable with these, is there some way you could refactor your algorithm so that you don't hold every value in memory at once, like with a generator?
in your case!
1) store this amount of data not in the RAM but inside a file or something in your HDD/SDD (say some SQL databases or NoSQL databases)
2) write a generator that processes each list (group of list for more efficiency) inside the whole list one after the other until the end
it will be good for you to use something like mongodb or mysql/mariadb/postgresql to store this amount of datas.
I have a really big, like huge, Dictionary (it isn't really but pretend because it is easier and not relevant) that contains the same strings over and over again. I have verified that I can store a lot more in memory if I do poor man's compression on the system and instead store INTs that correspond to the string.
animals = ['ape','butterfly,'cat','dog']
exists in a list and therefore has an index value such that animals.index('cat') returns 2
This allows me to store in my object BobsPets = set(2,3)
rather than Cat and Dog
For the number of items the memory savings are astronomical. (Really Don't try and dissuade me that is well tested.
Currently I then convert the INTs back to Strings with a FOR loop
tempWordList = set()
for IntegOfIndex in TempSet:
tempWordList.add(animals[IntegOfIndex])
return tempWordList
This code works. It feels "Pythonic," but it feels like there should be a better way. I am in Python 2.7 on AppEngine if that matters. It may since I wonder if Numpy has something I missed.
I have about 2.5 Million things in my object, and each has an average of 3 of these "pets" and there 7500-ish INTs that represent the pets. (no they aren't really pets)
I have considered using a Dictionary with the position instead of using Index. This doesn't seem faster, but am interested if anyone thinks it should be. (it took more memory and seemed to be the same speed or really close)
I am considering running a bunch of tests with Numpy and its array's rather than lists, but before I do, I thought I'd ask the audience and see if I would be wasting time on something that I have already reached the best solution for.
Last thing, The solution should be pickable since I do that for loading and transferring data.
It turns out that since my list of strings is fixed, and I just wish the index of the string, I am building what is essentially an index array that is immutable. Which is in short a Tuple.
Moving to a Tuple rather than a list gains about 30% improvement in speed. Far more than I would have anticipated.
The bonus is largest on very large lists. It seems that each time you cross a bit threshold the bonus increases, so in sub 1024 lists their is basically no bonus and at a million there is pretty significant.
The Tuple also uses very slightly less memory for the same data.
An aside, playing with the lists of integers, you can make these significantly smaller by using a NUMPY array, but the advantage doesn't extend to pickling. The Pickles will be about 15% larger. I think this is because of the object description being stored in the pickle, but I didn't spend much time looking.
So in short the only change was to make the Animals list a Tuple. I really was hoping the answer was something more exotic.
If I instantiate/update a few lists very, very few times, in most cases only once, but I check for the existence of an object in that list a bunch of times, is it worth it to convert the lists into dictionaries and then check by key existence?
Or in other words is it worth it for me to convert lists into dictionaries to achieve possible faster object existence checks?
Dictionary lookups are faster the list searches. Also a set would be an option. That said:
If "a bunch of times" means "it would be a 50% performance increase" then go for it. If it doesn't but makes the code better to read then go for it. If you would have fun doing it and it does no harm then go for it. Otherwise it's most likely not worth it.
You should be using a set, since from your description I am guessing you wouldn't have a value to associate. See Python: List vs Dict for look up table for more info.
Usually it's not important to tune every line of code for utmost performance.
As a rule of thumb, if you need to look up more than a few times, creating a set is usually worthwhile.
However consider that pypy might do the linear search 100 times faster than CPython, then a "few" times might be "dozens". In other words, sometimes the constant part of the complexity matters.
It's probably safest to go ahead and use a set there. You're less likely to find that a bottleneck as the system scales than the other way around.
If you really need to microtune everything, keep in mind that the implementation, cpu cache, etc... can affect it, so you may need to remicrotune differently for different platforms, and if you need performance that badly, Python was probably a bad choice - although maybe you can pull the hotspots out into C. :)
random access (look up) in dictionary are faster but creating hash table consumes more memory.
more performance = more memory usages
it depends on how many items in your list.
I'm having a bit of trouble with an implementation of random forests I'm working on in Python. Bare in mind, I'm well aware that Python is not intended for highly efficient number crunching. The choice was based more on wanting to get a deeper understanding of and additional experience in Python. I'd like to find a solution to make it "reasonable".
With that said, I'm curious if anyone here can make some performance improvement suggestions to my implementation. Running it through the profiler, it's obvious the most time is being spent executing the list "append" command and my dataset split operation. Essentially I have a large dataset implemented as a matrix (rather, a list of lists). I'm using that dataset to build a decision tree, so I'll split on columns with the highest information gain. The split consists of creating two new dataset with only the rows matching some critera. The new dataset is generated by initializing two empty lista and appending appropriate rows to them.
I don't know the size of the lists in advance, so I can't pre-allocate them, unless it's possible to preallocate abundant list space but then update the list size at the end (I haven't seen this referenced anywhere).
Is there a better way to handle this task in python?
Without seeing your codes, it is really hard to give any specific suggestions since optimisation is code-dependent process that varies case by case. However there are still some general things:
review your algorithm, try to reduce the number of loops. It seems
you have a lot of loops and some of them are deeply embedded in
other loops (I guess).
if possible use higher performance utility modules such as itertools
instead of naive codes written by yourself.
If you are interested, try PyPy (http://pypy.org/), it is a
performance-oriented implementation of Python.
I have a problem with my code running on google app engine. I dont know how to modify my code to suit GAE. The following is my problem
for j in range(n):
for d in range(j):
for d1 in range(d):
for d2 in range(d1):
# block which runs in O(n^2)
Efficiently the entire code block is O(N^6) and it will run for more than 10 mins depending on n. Thus I am using task queues. I will also be needing a 4 dimensional array which is stored as a list (eg A[j][d][d1][d2]) of n x n x n x n ie needs memory space O(N^4)
Since the limitation of put() is 10 MB, I cant store the entire array. So I tried chopping into smaller chunks and store it and when retrieve combine them. I used the json function for this but it doesnt support for larger n (> 40).
Then I stored the whole matrix as individual entities of lists in datastore ie each A[j][d][d1] entity. So there is no local variable. When i access A[j][d][d1][d2] in my code I would call my own functions getitem and putitem to get and put data from datastore (used caching also). As a result, my code takes more time for computation. After few iterations, I get the error 203 raised by GAE and task fails with code 500.
I know that my code may not be best suited for GAE. But what is the best way to implement it on GAE ?
There may be even more efficient ways to store your data and to iterate over it.
Questions:
What datatype are you storing, list of list ... of int?
What range of the nested list does your innermost loop O(n^2) portion typically operate over?
When you do the putitem, getitem how many values are you retrieving in a single put or get?
Ideas:
You could try compressing your json (and base64 for cut and pasting). 'myjson'.encode('zlib').encode('base64')
Using a divide and conquer (map reduce) as #Robert suggested. You may be able to use a dictionary with tuples for keys, this may be fewer lookups then A[j][d][d1][d2] in your inner loop. It would also allow you to sparsly populate your structure. You would need to track and know your bounds of what data you loaded in another way. A[j][d][d1][d2] becomes D[(j,d,d1,d2)] or D[j,d,d1,d2]
You've omitted important details like the expected size of n from your question. Also, does the "# block which runs in O(n^2)" need access to the entire matrix, or are you simply populating the matrix based on the index values?
Here is a general answer: you need to find a way to break this up into smaller chunks. Maybe you can use some type of divide and conquer strategy and use tasks for parallelism. How you store your matrix depends on how you split the problem up. You might be able to store submatrices, or perhaps subvectors using the index values as key-names; again, this will depend on your problem and the strategy you use.
An alternative, if for some reason you can not figure out how to parallelize your algorithm, is to use a continuation strategy of some type. In other works, figure out about how many iterations you can typically do within the time constraints (leaving a safety margin), then once you hit that limit save your data and insert a new task to continue the processing. You'll just need to pass in the starting position, then resume running from there. You may be able to do this easily by giving a starting parameter to the outermost range, but again it depends on the specifics of your problem.
Sam, just give you an idea and pointer on where to start.
If what you need is somewhere between storing the whole matrix and storing the numbers one-by-one, may be you will be interested to use pickle to serialize your list, and store them in datastore for later retrieval.
list is a python object, and you should be able to serialize it.
http://appengine-cookbook.appspot.com/recipe/how-to-put-any-python-object-in-a-datastore