Increasing efficiency of Python copying large datasets

Increasing efficiency of Python copying large datasets - python

I'm having a bit of trouble with an implementation of random forests I'm working on in Python. Bare in mind, I'm well aware that Python is not intended for highly efficient number crunching. The choice was based more on wanting to get a deeper understanding of and additional experience in Python. I'd like to find a solution to make it "reasonable".
With that said, I'm curious if anyone here can make some performance improvement suggestions to my implementation. Running it through the profiler, it's obvious the most time is being spent executing the list "append" command and my dataset split operation. Essentially I have a large dataset implemented as a matrix (rather, a list of lists). I'm using that dataset to build a decision tree, so I'll split on columns with the highest information gain. The split consists of creating two new dataset with only the rows matching some critera. The new dataset is generated by initializing two empty lista and appending appropriate rows to them.
I don't know the size of the lists in advance, so I can't pre-allocate them, unless it's possible to preallocate abundant list space but then update the list size at the end (I haven't seen this referenced anywhere).
Is there a better way to handle this task in python?

Without seeing your codes, it is really hard to give any specific suggestions since optimisation is code-dependent process that varies case by case. However there are still some general things:
review your algorithm, try to reduce the number of loops. It seems
you have a lot of loops and some of them are deeply embedded in
other loops (I guess).
if possible use higher performance utility modules such as itertools
instead of naive codes written by yourself.
If you are interested, try PyPy (http://pypy.org/), it is a
performance-oriented implementation of Python.

Related

Handling large list of lists in python

I have this mathematical task in which I am supposed to find some combinations, etc. That doesn't matter, the problem is that I am trying to do it with itertools module and it worked fine on smaller combinations (6 - places), but now I want to do the same for large combination (18 - places) so here I run into problem because I only have 8GB of RAM and this list comes around 5GB and with my system running it consumes all RAM and then program drops MemoryError. So my question is: what would be good alternative to the method I'm using(code below)?
poliedar_kom = list(itertools.combinations_with_replacement(range(0, 13), 18))
poliedar_len = len(poliedar_kom)
So when I have this list and it's length, the rest of program is going through every value in list and checking for condition with values in another smaller list. As I already said that's problem because this list gets too big for my PC, but I'm probably doing something wrong.
Note: I am using latest Python 3.8 64-bit
Summary: I have too big list of lists through which I have to loop to check values for conditions.
EDIT: I appreciate all answers, I have to try them now, if you have any new possible solution to the problem please post it.
EDIT 2: Thanks everyone, you helped me really much. I marked answer that pointed me to Youtube video because it made me realize that my code is already generator. Thanks everyone!!!

Use generators for large data ranges, time and space complexity of the code will not increase exponentially with large data size, refer to the link for more details:
https://www.youtube.com/watch?v=bD05uGo_sVI

For any application requiring more than say, 1e4 items, you should refrain from using python lists, which are very memory- and processor-intesive
For such uses, I generally go to numpy arrays or pandas dataframes
If you aren't comfortable with these, is there some way you could refactor your algorithm so that you don't hold every value in memory at once, like with a generator?

in your case!
1) store this amount of data not in the RAM but inside a file or something in your HDD/SDD (say some SQL databases or NoSQL databases)
2) write a generator that processes each list (group of list for more efficiency) inside the whole list one after the other until the end
it will be good for you to use something like mongodb or mysql/mariadb/postgresql to store this amount of datas.

Most Time efficient way to convert list to a dict of frequencies in Python?

as the title states what is the fastest way to convert a list of strings to a frequency dict. I've been using this, but I figured this isn't probably a good option:
return dict(Counter(re.findall(f"(?={pattern})",text)))
Or, will a simple loop be better:
freq = {}
ls = re.findall(f"(?={pattern})",text)
for l in ls:
freq[l] = 1 if l not in freq else freq[l]+1
return freq
Or are there better ways?
Additionally: What is the most space efficient way?
Thanks!

Using a collections.Counter will be faster than a crude manual for loop - and since Counter is already a dict subclass you don't necessarily need to make it a plain dict either.
If you want to compare execution times of two snippets, you can easily do it using the timeit module:
import re
from collections import Counter
import timeit
text = """
as the title states what is the fastest way to convert a list of strings to a frequency dict. I've been using this, but I figured this isn't probably a good option. I'm getting into Bioinformatics [Youtube, Coursera lol]. I beginning to understand that a vast majority of problems involve parsing through huge chunks of data to identify numerous patterns. As such, time/space efficiency seems to be the top priority. For simple programs like a Frequency Map such optimizations won't make a difference but I'm trying to focus on the optimizations early on for future problems. as the title states what is the fastest way to convert a list of strings to a frequency dict. I've been using this, but I figured this isn't probably a good option. I'm getting into Bioinformatics [Youtube, Coursera lol]. I beginning to understand that a vast majority of problems involve parsing through huge chunks of data to identify numerous patterns. As such, time/space efficiency seems to be the top priority. For simple programs like a Frequency Map such optimizations won't make a difference but I'm trying to focus on the optimizations early on for future problems.as the title states what is the fastest way to convert a list of strings to a frequency dict. I've been using this, but I figured this isn't probably a good option. I'm getting into Bioinformatics [Youtube, Coursera lol]. I beginning to understand that a vast majority of problems involve parsing through huge chunks of data to identify numerous patterns. As such, time/space efficiency seems to be the top priority. For simple programs like a Frequency Map such optimizations won't make a difference but I'm trying to focus on the optimizations early on for future problems.
"""
def fast():
return dict(Counter(re.findall(r"(\w)", text)))
def slow():
freq = {}
ls = re.findall(r"(\w)", text)
for l in ls:
freq[l] = 1 if l not in freq else freq[l]+1
return freq
rf = timeit.timeit("fast()", "from __main__ import fast", number=10000)
rs = timeit.timeit("slow()", "from __main__ import slow", number=10000)
print("fast: {}".format(rf))
print("slow: {}".format(rs))
This being said:
I'm getting into Bioinformatics. I beginning to understand that a vast majority of problems involve parsing through huge chunks of data to identify numerous patterns. As such, time/space efficiency seems to be the top priority. For simple programs like a Frequency Map such optimizations won't make a difference but I'm trying to focus on the optimizations early on for future problems – roy05 2 hours ago
While there are indeed some (sometimes very huge) performance gains to be had from using the correct data type (ie set vs list for containment testing) or implementation (ie list expressions instead of for loop + list.append), most of the time we are really bad at guessing where the real bottlenecks are, so you want to use a profiler to find out. Except for the couple obvious cases mentionned above, trying to optimize without profiling first is a waste of time.
More importantly ever, when it comes to "huge chunks of data", those kind of "optimizations" won't get you very far and the proper way to solve both time and space issues is massive parallelization (think map-reduce frameworks), which means you have to think the whole code architecture with this in mind right from the start.

Should I use two Hashmap for fast lookup on two entities instead of linear search of one hashmap?

I had an interview problem where I was asked to make an optimized solution to implement search on two instance: Student Number and class(only one per student).
sn_to_class() should return class for student number. Also, class_sns() should return list of student numbers for a given class.
My First solution was to use hashmap sn_to_class_map (number as key and student number as data) and hashmap class_to_sns_map(class as key and student number as data). So, the search will be minimized to O(1), but the data will be increased.
pseudo code:
sn_map = dict()
cl_map = dict()
fun addStudents(sn, cl):
sn_map[sn] = cl
cl_map[cl].add(sn) # List
fun getStudents(cl)
return cl_map[cl]
fun getClass(sn)
return sn_map[sn]
Is my approach correct?

It is not always possible to optimize for everything; there's very often a tradeoff between time and space, or between consistency and availability, or between the time needed for one operation and the time needed for a different operation, . . .
In your case, you have been asked to make an "optimized" solution, and you're faced with such a tradeoff:
If you keep a map from student-numbers to classes, then getClass and addStudents are fast, and you only use the space for that one representation of the data, but getStudents is slower because it needs to read the entire map.
If you keep a map from classes to lists of student-numbers, and don't worry about the order student-numbers in those lists, then getStudents and addStudents are fast, and you only use the space for that one representation of the data, but getClass is slower because it needs to read the entire map.
If you keep a map from classes to sorted lists of student-numbers, then getStudents is fast, getClass is a bit faster than with unsorted lists (it needs to examine every class in the map, but at least it can do binary search within each list), and you only use the space for that one representation of the data, but getClass is still relatively slow if classes are small, and addStudents is significantly slower because inserting a student into a list can take a lot of time.
If you keep two maps, as you propose, then all operations are pretty fast, but you now need the space for both representations of the data.
Your question is, what's the right tradeoff? And we can't answer that for you. Maybe memory is very limited, and one operation is only called very rarely, and only in non-time-sensitive contexts, such that it's better to make that operation slower than to waste memory; but maybe memory is not an issue at all, and speed is what matters. In a real program, I think it'd be much more likely that you'll care about speed than about a factor-of-two difference in memory usage, so your proposed two-maps solution would likely be the best one; but we can't know.
So in an interview situation like you describe, the best approach is to describe multiple options, explain the tradeoff, explain why you might choose one or the other, and optionally explain why the two-maps solution is likely to be best in a real program — but that last part is not the most important part IMHO.

Python Code Optimization LineProfiler [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
My function gets a combinations of numbers, can be small or large depending on user input, and will loop through each combinations to perform some operations.
Below is a line-profiling I ran on my function and it is takes 0.336 seconds to run. While this is fast, this is only a subset of a bigger framework. In this bigger framework, I will need to run this function 50-20000 times which when multiplied to 0.336, takes 16.8 to 6720 seconds (I hope this is right). Before it takes 0.996 seconds but I've manage to cut it by half through avoiding functions calls.
The major contributor to time is the two __getitem__ which is accessing dictionary for information N times depending on the number of combinations. My dictionary is a collection data and it looks something like this:
dic = {"array1", a,
"array2", b,
"array3", c,
"listofarray", [ [list of 5 array], [list of 5 array], [list of 5 2d Array ] ]
}
I was able to cut it by another ~0.01 seconds when I placed the dictionary lookback outside of the loop..
x = dic['listofarray']['list of 5 2d array']
So when I loop to get access to the the 5 different elements I just did x[i].
Other than that I am lost in terms of where to add more performance boost.
Note: I apologize that I haven't provided any code. I'd love to show but its proprietary. I just wanted to get some thoughts on whether I am looking at the right place for speed ups.
I am willing to learn and apply new things so if cython or some other data structure can speed things up, i am all ears. Thanks so much
PS:
inside my first __getitem__:
inside my second __getitem__:
EDIT:
I am using iter tools product(xrange(10), repeat=len(food_choices)) and iterating over this. I covert everything into numpy arrays np.array(i).astype(float).

The major contributor to time is the two __getitem__ which is accessing dictionary for information N times depending on the number of combinations.
No it isn't. Your two posted profile traces clearly show that they're NumPy/Pandas __getitem__ functions, not dict.__getitem__. So, you're trying to optimize the wrong place.
Which explains why moving all the dict stuff out of the loop made a difference of a small fraction of a percent.
Most likely the problem is that you're looping over some NumPy object, or using some fake-vectorized function (e.g., via vectorize), rather than performing some NumPy-optimized broadcasting operation. That's what you need to fix.
For example, if you compare these:
np.vectorize(lambda x: x*2)(a)
a * 2
… the second one will go at least 10x faster on any sizable array, and it's mostly because of all the time spending doing __getitem__—which includes boxing up numbers to be usable by your Python function. (There's also some additional cost in not being able to use CPU-vectorized operations, cacheable tight loops, etc., but even if you arrange things to be complicated enough that those don't enter into it, you're still going to get much faster code.)
Meanwhile:
I am using itertools.product(xrange(10), repeat=len(food_choices)) and iterating over this. I covert everything into numpy arrays np.array(i).astype(float).
So you're creating 10**n separate n-element arrays? That's not making sensible use of NumPy. Each array is tiny, and most likely you're spending as much time building and pulling apart the arrays as you are doing actual work. Do you have the memory to build a single giant array with an extra 10**n-long axis instead? Or, maybe, batch it up into groups of, say, 100K? Because then you could actually build and process the whole array in native NumPy-vectorized code.
However, the first thing you might want to try is to just run your code in PyPy instead of CPython. Some NumPy code doesn't work right with PyPy/NumPyPy, but there's fewer problems with each version, so you should definitely try it.
If you're lucky (and there's a pretty good chance of that), PyPy will JIT the repeated __getitem__ calls inside the loop, and make it much faster, with no change in your code.
If that helps (or if NumPyPy won't work on your code), Cython may be able to help more. But only if you do all the appropriate static type declarations, etc. And often, PyPy already helps enough that you're done.

Loop through list from specific point?

I was wondering if there was a way to keep extremely large lists in the memory and then process those lists from specific points. Since these lists will have as many as almost 400 billion numbers before processing we need to split them up but I haven't the slightest idea (since I can't find an example) of where to start when trying to process a list from a specific point in Python. Edit: Right now we are not trying to create multiple-dimensions but if it's easier then I'll for sure do it.

Even if your numbers are bytes, 400GB (or 400TB if you use billion in the long-scale meaning) does not normally fit in RAM. Therefore I guess numpy.memmap or h5py may be what you're looking for.

Further to the #lazyr's point, if you use the numpy.memmap method, then my previous discussion on views into numpy arrays might well be useful.
This is also the way you should be thinking if you have stacks of memory and everything actually is in RAM.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.