More optimized way to do itertools.combinations

More optimized way to do itertools.combinations - python

I'm trying to find unique combinations of ~70,000 IDs.
I'm currently doing an itertools.combinations([list name], 2) to get unique 2 ID combinations but it's been running for more than 800 minutes.
Is there a faster way to do this?
I tried converting the IDs into a matrix where the IDs are both the index and the columns and populating the matrix using itertools.product.
I tried doing it the manual way with loops too.
But after more than a full day of letting them run, none of my methods have actually finished running.
For additional information, I'm storing these into a data frame, to later run a function that compares each of the unique set of IDs.

(70_000 ** 69_000) / 2== 2.4 billion - it is not such a large number as to be not computable in a few hours (update I run a dry-run on itertools.product(range(70000), 2) and it took less than 70 seconds, on a 2017 era i7 #3GHz, naively using a single core) But if you are trying to keep this data in memory at once, them it won't fit - and if your system is configured to swap memory to disk before erroring with a MemoryError, this may slow-down the program by 2 or more orders of magnitude, and thus, that is when your problem come from.
itertools.combination does the right thing in this respect, and no need to try to change it for something else: it will yield one combination at a time. What you are doing with the result, however, do change things: if you are streaming the combination to a file and not keeping it in memory, it should be fine, and then, it is just computational time you can't speed up anyway.
If, on the other hand, you are collecting the combinations to a list or other data structure: there is your problem - don't do it.
Now. going a step further than your question, since these combinations are check-able and predictable, maybe trying to generate these is not the right approach at all - you don't give details on how these are to be used, but if used in a reactive form, or on a lazy form, you might have an instantaneous workflow instead.

Your Ram will run full. You can counter this with gc.collect() or emtpying the results but the found results have to be saved inbetween.
You could try something similar to the code below. I would create individual file names or save the results into a database since the result file will be some gb big. Additionaly range of the second loop can probably be divided by 2.
import gc
new_set=set()
for i in range(70000):
new_set.add(i)
print(new_set)
combined_set=set()
for i in range(len(new_set)):
print(i)
if i % 300 ==0:
with open("results","a") as f:
f.write(str(combined_set))
combined_set=set()
gc.collect()
for b in range(len(new_set)):
combined_set.add((i,b))

Related

How do I perform deduplication with the python record linkage toolkit with large data sets?

I am currently using Python Record Linkage Toolkit to perform deduplication on data sets at work. In an ideal world, I would just use blocking or sortedneighborhood to trim down the size of the index of record pairs, but sometimes I need to do a full index on a data set with over 75k records, which results in a couple billion records pairs.
The issue I'm running into is that the workstation I'm able to use is running out of memory, so it can't store the full 2.5-3 billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, which is simple enough for my usage, but doesn't provide anything for deduplication within a single dataframe. I actually incorporated this subset suggestion into a method for splitting the multiindex into subsets and running those, but it doesn't get around the issue of the .index() call seemingly loading the entire multiindex into memory and causing an out of memory error.
Is there a way to split a dataframe and compute the matched pairs iteratively so I don't have to load the whole kit and kaboodle into memory at once? I was looking at dask, but I'm still pretty green on the whole python thing, so I don't know how to incorporate the dask dataframes into the record linkage toolkit.

While I was able to solve this, sort of, I am going to leave it open because I suspect given my inexperience with python, my process could be improved.
Basically, I had to ditch the index function from record linkage toolkit. I pulled out the Index of the dataframe I was using, and then converted it to a list, and passed it through the itertools combinations function.
candidates = fl
candidates = candidates.index
candidates = candidates.tolist()
candidates = combinations(candidates,2)
This then gave me an iteration object full of tuples, without having to load everything in to memory. I then passed it into an islice grouper as a for loop.
for x in iter(lambda: list(islice(candidates,1000000)),[]):
I then proceeded to perform all of the necessary comparisons in the for loop, and added the resultant dataframe to a dictionary, which I then concatenate at the end for the full list. Python's memory usage hasn't risen above 3GB the entire time.
I would still love some information on how to incorporate dask into this, so I will accept any answer that can provide that (unless the mods think I should open a new question).

Handling large list of lists in python

I have this mathematical task in which I am supposed to find some combinations, etc. That doesn't matter, the problem is that I am trying to do it with itertools module and it worked fine on smaller combinations (6 - places), but now I want to do the same for large combination (18 - places) so here I run into problem because I only have 8GB of RAM and this list comes around 5GB and with my system running it consumes all RAM and then program drops MemoryError. So my question is: what would be good alternative to the method I'm using(code below)?
poliedar_kom = list(itertools.combinations_with_replacement(range(0, 13), 18))
poliedar_len = len(poliedar_kom)
So when I have this list and it's length, the rest of program is going through every value in list and checking for condition with values in another smaller list. As I already said that's problem because this list gets too big for my PC, but I'm probably doing something wrong.
Note: I am using latest Python 3.8 64-bit
Summary: I have too big list of lists through which I have to loop to check values for conditions.
EDIT: I appreciate all answers, I have to try them now, if you have any new possible solution to the problem please post it.
EDIT 2: Thanks everyone, you helped me really much. I marked answer that pointed me to Youtube video because it made me realize that my code is already generator. Thanks everyone!!!

Use generators for large data ranges, time and space complexity of the code will not increase exponentially with large data size, refer to the link for more details:
https://www.youtube.com/watch?v=bD05uGo_sVI

For any application requiring more than say, 1e4 items, you should refrain from using python lists, which are very memory- and processor-intesive
For such uses, I generally go to numpy arrays or pandas dataframes
If you aren't comfortable with these, is there some way you could refactor your algorithm so that you don't hold every value in memory at once, like with a generator?

in your case!
1) store this amount of data not in the RAM but inside a file or something in your HDD/SDD (say some SQL databases or NoSQL databases)
2) write a generator that processes each list (group of list for more efficiency) inside the whole list one after the other until the end
it will be good for you to use something like mongodb or mysql/mariadb/postgresql to store this amount of datas.

Dask: why has CPU usage suddenly dropped?

I'm doing some Monte Carlo for a model and figured that Dask could be quite useful for this purpose. For the first 35 hours or so, things were running quite "smoothly" (apart from the fan noise giving a sense that the computer was taking off). Each model run would take about 2 seconds and there were 8 partitions running it in parallel. Activity monitor was showing 8 python3.6 instances.
However, the computer has become "silent" and CPU usage (as displayed in Spyder) hardly exceeds 20%. Model runs are happening sequentially (not in parallel) and taking about 4 seconds each. This happened today at some point while I was working on other things. I understand that depending on the sequence of actions, Dask won't use all cores at the same time. However, in this case there is really just one task to be performed (see further below), so one could expect all partitions to run and finish more or less simultaneously. Edit: the whole set up has run successfully for 10.000 simulations in the past, the difference now being that there are nearly 500.000 simulations to run.
Edit 2: now it has shifted to doing 2 partitions in parallel (instead of the previous 1 and original 8). It appears that something is making it change how many partitions are simultaneously processed.
Edit 3: Following recommendations, I have used a dask.distributed.Client to track what is happening, and ran it for the first 400 rows. An illustration of what it looks like after completing is included below. I am struggling to understand the x-axis labels, hovering over the rectangles shows about 143 s.
Some questions therefore are:
Is there any relationship between running other software (Chrome, MS Word) and having the computer "take back" some CPU from python?
Or instead, could it be related to the fact that at some point I ran a second Spyder instance?
Or even, could the computer have somehow run out of memory? But then wouldn't the command have stopped running?
... any other possible explanation?
Is it possible to "tell" Dask to keep up the hard work and go back to using all CPU power while it is still running the original command?
Is it possible to interrupt an execution and keep whichever calculations have already been performed? I have noticed that stopping the current command doesn't seem to do much.
Is it possible to inquire on the overall progress of the computation while it is running? I would like to know how many model runs are left to have an idea of how long it would take to complete in this slow pace. I have tried using the ProgressBar in the past but it hangs on 0% until a few seconds before the end of the computations.
To be clear, uploading the model and the necessary data would be very complex. I haven't created a reproducible example either out of fear of making the issue worse (for now the model is still running at least...) and because - as you can probably tell by now - I have very little idea of what could be causing it and I am not expecting anyone to be able to reproduce it. I'm aware this is not best practice and apologise in advance. However, I would really appreciate some thoughts on what could be going on and possible ways to go about it, if anyone has been thorough something similar before and/or has experience with Dask.
Running:
- macOS 10.13.6 (Memory: 16 GB | Processor: 2.5 GHz Intel Core i7 | 4 cores)
- Spyder 3.3.1
- dask 0.19.2
- pandas 0.23.4
Please let me know if anything needs to be made clearer
If you believe it can be relevant, the main idea of the script is:
# Create a pandas DataFrame where each column is a parameter and each row is a possible parameter combination (cartesian product). At the end of each row some columns to store the respective values of some objective functions are pre-allocated too.
# Generate a dask dataframe that is the DataFrame above split into 8 partitions
# Define a function that takes a partition and, for each row:
# Runs the model with the coefficient values defined in the row
# Retrieves the values of objective functions
# Assigns these values to the respective columns of the current row in the partition (columns have been pre-allocated)
# and then returns the partition with columns for objective functions populated with the calculated values
# map_partitions() to this function in the dask dataframe
Any thoughts?
This shows how simple the script is:
The dashboard:
Update: The approach I took was to:
Set a large number of partitions (npartitions=nCores*200). This made it much easier to visualise the progress. I'm not sure if setting so many partitions is good practice but it worked without much of a slowdown.
Instead of trying to get a single huge pandas DataFrame in the end by .compute(), I got the dask dataframe to be written to Parquet (in this way each partition was written to a separate file). Later, reading all files into a dask dataframe and computeing it to a pandas DataFrame wasn't difficult, and if something went wrong in the middle at least I wouldn't lose the partitions that had been successfully processed and written.
This is what it looked like at a given point:

Dask has many diagnostic tools to help you understand what is going on inside your computation. See http://docs.dask.org/en/latest/understanding-performance.html
In particular I recommend using the distributed scheduler locally and watching the Dask dashboard to get a sense of what is going on in your computation. See http://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard
This is a webpage that you can visit that will tell you exactly what is going on in all of your processors.

Python: slow nested for loop

I need to find out an optimal selection of media, based on certain constraints. I am doing it in FOUR nested for loop and since it would take about O(n^4) iterations, it is slow. I had been trying to make it faster but it is still damn slow. My variables can be as high as couple of thousands.
Here is a small example of what I am trying to do:
max_disks = 5
max_ssds = 5
max_tapes = 1
max_BR = 1
allocations = []
for i in range(max_disks):
for j in range(max_ssds):
for k in range(max_tapes):
for l in range(max_BR):
allocations.append((i,j,k,l)) # this is just for example. In actual program, I do processing here, like checking for bandwidth and cost constraints, and choosing the allocation based on that.
It wasn't slow for up to hundreds of each media type but would slow down for thousands.
Other way I tried is :
max_disks = 5
max_ssds = 5
max_tapes = 1
max_BR = 1
allocations = [(i,j,k,l) for i in range(max_disks) for j in range(max_ssds) for k in range(max_tapes) for l in range(max_BR)]
This way it is slow even for such small numbers.
Two questions:
Why the second one is slow for small numbers?
How can I make my program work for big numbers (in thousands)?
Here is the version with itertools.product
max_disks = 500
max_ssds = 100
max_tapes = 100
max_BR = 100
# allocations = []
for i, j, k,l in itertools.product(range(max_disks),range(max_ssds),range(max_tapes),range(max_BR)):
pass
It takes 19.8 seconds to finish with these numbers.

From the comments, I got that you're working on a problem that can be rewritten as an ILP. You have several constraints, and need to find a (near) optimal solution.
Now, ILPs are quite difficult to solve, and brute-forcing them quickly becomes intractable (as you've already witnessed). This is why there are several really clever algorithms used in the industry that truly work magic.
For Python, there are quite a few interfaces that hook-up to modern solvers; for more details, see e.g. this SO post. You could also consider using an optimizer, like SciPy optimize, but those generally don't do integer programming.

Doing any operation in Python a trillion times is going to be slow. However, that's not all you're doing. By attempting to store all the trillion items in a single list you are storing lots of data in memory and manipulating it in a way that creates a lot of work for the computer to swap memory in and out once it no longer fits in RAM.
The way that Python lists work is that they allocate some amount of memory to store the items in the list. When you fill up the list and it needs to allocate more, Python will allocate twice as much memory and copy all the old entries into the new storage space. This is fine so long as it fits in memory - even though it has to copy all the contents of the list each time it expands the storage, it has to do so less frequently as it keeps doubling the size. The problem comes when it runs out of memory and has to swap unused memory out to disk. The next time it tries to resize the list, it has to reload from disk all the entries that are now swapped out to disk, then swap them all back out again to get space to write the new entries. So this creates lots of slow disk operations that will get in the way of your task and slow it down even more.
Do you really need to store every item in a list? What are you going to do with them when you're done? You could perhaps write them out to disk as you're going instead of accumulating them in a giant list, though if you have a trillion of them, that's still a very large amount of data! Or perhaps you're filtering most of them out? That will help.
All that said, without seeing the actual program itself, it's hard to know if you have a hope of completing this work by an exhaustive search. Can all the variables be on the thousands scale at once? Do you really need to consider every combination of these variables? When max_disks==2000, do you really need to distinguish the results for i=1731 from i=1732? For example, perhaps you could consider values of i 1,2,3,4,5,10,20,30,40,50,100,200,300,500,1000,2000? Or perhaps there's a mathematical solution instead? Are you just counting items?

How to design a memory and computationally intensive program to run on Google App Engine

I have a problem with my code running on google app engine. I dont know how to modify my code to suit GAE. The following is my problem
for j in range(n):
for d in range(j):
for d1 in range(d):
for d2 in range(d1):
# block which runs in O(n^2)
Efficiently the entire code block is O(N^6) and it will run for more than 10 mins depending on n. Thus I am using task queues. I will also be needing a 4 dimensional array which is stored as a list (eg A[j][d][d1][d2]) of n x n x n x n ie needs memory space O(N^4)
Since the limitation of put() is 10 MB, I cant store the entire array. So I tried chopping into smaller chunks and store it and when retrieve combine them. I used the json function for this but it doesnt support for larger n (> 40).
Then I stored the whole matrix as individual entities of lists in datastore ie each A[j][d][d1] entity. So there is no local variable. When i access A[j][d][d1][d2] in my code I would call my own functions getitem and putitem to get and put data from datastore (used caching also). As a result, my code takes more time for computation. After few iterations, I get the error 203 raised by GAE and task fails with code 500.
I know that my code may not be best suited for GAE. But what is the best way to implement it on GAE ?

There may be even more efficient ways to store your data and to iterate over it.
Questions:
What datatype are you storing, list of list ... of int?
What range of the nested list does your innermost loop O(n^2) portion typically operate over?
When you do the putitem, getitem how many values are you retrieving in a single put or get?
Ideas:
You could try compressing your json (and base64 for cut and pasting). 'myjson'.encode('zlib').encode('base64')
Using a divide and conquer (map reduce) as #Robert suggested. You may be able to use a dictionary with tuples for keys, this may be fewer lookups then A[j][d][d1][d2] in your inner loop. It would also allow you to sparsly populate your structure. You would need to track and know your bounds of what data you loaded in another way. A[j][d][d1][d2] becomes D[(j,d,d1,d2)] or D[j,d,d1,d2]

You've omitted important details like the expected size of n from your question. Also, does the "# block which runs in O(n^2)" need access to the entire matrix, or are you simply populating the matrix based on the index values?
Here is a general answer: you need to find a way to break this up into smaller chunks. Maybe you can use some type of divide and conquer strategy and use tasks for parallelism. How you store your matrix depends on how you split the problem up. You might be able to store submatrices, or perhaps subvectors using the index values as key-names; again, this will depend on your problem and the strategy you use.
An alternative, if for some reason you can not figure out how to parallelize your algorithm, is to use a continuation strategy of some type. In other works, figure out about how many iterations you can typically do within the time constraints (leaving a safety margin), then once you hit that limit save your data and insert a new task to continue the processing. You'll just need to pass in the starting position, then resume running from there. You may be able to do this easily by giving a starting parameter to the outermost range, but again it depends on the specifics of your problem.

Sam, just give you an idea and pointer on where to start.
If what you need is somewhere between storing the whole matrix and storing the numbers one-by-one, may be you will be interested to use pickle to serialize your list, and store them in datastore for later retrieval.
list is a python object, and you should be able to serialize it.
http://appengine-cookbook.appspot.com/recipe/how-to-put-any-python-object-in-a-datastore

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.