Itertools.combinations performance issue with large lists - python

I've got a list of tickers in tickers_list and I need to make a combination of the unique elements.
Here's my code:
corr_list = []
i = 0
for ticker in tickers_list:
tmp_list = tickers_list[i+1:]
for tl_ticker in tmp_list:
corr_list.append({'ticker1':ticker, 'ticker2':tl_ticker})
i = i+1
len(corr_list)
Here's the code using iteritems:
from itertools import combinations
result = combinations(tickers_list, 2)
new_list = [{'ticker1':comb[0], 'ticker2':comb[1]} for comb in combinations(tickers_list, 2)]
len(new_list)
The produce the exact same results. Of course, the iteritems code is much more elegant, and both work perfectly for 10, 100, and 1,000 items. The time required is almost identical (I timed it). However, at 4,000 items, my code takes about 12 seconds to run on my machine while iteritems crashes my computer. The resultant sent is only around 20m rows, so am I doing something wrong or is this an issue with iteritems?

from itertools import combinations, product
# 20 * 20 * 20 == 8000 items
tickers_list = ["".join(chars) for chars in product("ABCDEFGHIJKLMNOPQRST", repeat=3)]
# 8000 * 7999 / 2 == 31996000 (just under 32 million) items
unique_pairs = [{"ticker1":t1, "ticker2":t2} for t1,t2 in combinations(tickers_list, 2)]
works fine on my machine (Win7Pro x64, Python 3.4) up to len(tickers_list) == 8000 (taking 38s and consuming almost 76 GiB of RAM to do so!).
Are you running 32-bit or 64-bit Python? Do you actually need to store all combinations, or can you use and discard them as you go?
Edit: I miscalculated the size; it is actually about 9.5 GiB. Try it yourself:
from sys import getsizeof as size
size(unique_pairs) + len(unique_pairs) * size(unique_pairs[0])
which gives
9479596592 # total size of structure in bytes == 9.5 GiB
In any case, this is not a result of using itertools; it is the memory footprint of the resulting list-of-dicts and would be the same regardless of how you generated it. Your computer can probably handle it anyway using virtual RAM (swapping to disk), though it is much slower.
If you are just doing this to get it into a database, I suggest letting the database do the work: make a table containing the items of tickers_list, then select a cross-join to produce the final table, something like
SELECT a.ticker, b.ticker
FROM
tickers as a,
tickers as b
WHERE
a.ticker < b.ticker

Related

Most Efficient Way of Chunking a Large Iterable in Python for Brute Forcing

I am trying to develop a way to address large parallel tasks for bruteforcing a keyspace. I'd like to be able to come up with a way to pass a worker a value in such a way that given a chunk size, that value would tell the work what to output.
Simply Speaking:
given a charset (a-z) and a max legth of 1 (basically a-z) and a chunk size of 5
If I send worker 1 the number 0 then it will take 0-4 of the iterator, a, b, d, e, f) if I send worker 2 the number 1 it will take 5-9 etc. I have that code basically working:
#!/usr/bin/python
import itertools
maxlen = 5
charset = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
chunksize = 1000
chunkpart = 5
for s in itertools.islice(itertools.chain.from_iterable((''.join(l) for l in itertools.product(charset, repeat=i)) for i in range(1, maxlen + 1)), chunksize*chunkpart, chunksize*(chunkpart + 1)):
print s
Ok that works great, if I send chunkpart 5 to worker 1 it will do what it needs to do on that chunkpart.
The issue comes into play when I need to get a small chunk (1000 records) but way far into a large set. Let's say the maxlen was 10 and the chunkpart was 50,000,000. It takes a LONG time for Python to get to that point.
So, I think I know WHY this happens, it needs to do quite a bit of math to figure out where to get to in the iterator; what I am wondering, is there a better way for me to do this that shortcuts something? My gut tells me itertools has the answer, my brain says you need to understand itertools better.

Python: Number ranges that are extremely large?

val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = range(0, val)
shuffle(numbers)
I cannot find a simple way to make this work with extremely large inputs - can anyone help?
I saw a question like this - but I could not implement the range function they described in a way that works with shuffle. Thanks.
To get a random permutation of the range [0, n) in a memory efficient manner; you could use numpy.random.permutation():
import numpy as np
numbers = np.random.permutation(n)
If you need only small fraction of values from the range e.g., to get k random values from [0, n) range:
import random
from functools import partial
def sample(n, k):
# assume n is much larger than k
randbelow = partial(random.randrange, n)
# from random.py
result = [None] * k
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow()
while j in selected:
j = randbelow()
selected_add(j)
result[i] = j
return result
print(sample(10**100, 10))
If you don't need the full list of numbers (and if you are getting billions, its hard to imagine why you would need them all), you might be better off taking a random.sample of your number range, rather than shuffling them all. In Python 3, random.sample can work on a range object too, so your memory use can be quite modest.
For example, here's code that will sample ten thousand random numbers from a range up to whatever maximum value you specify. It should require only a relatively small amount of memory beyond the 10000 result values, even if your maximum is 100 billion (or whatever enormous number you want):
import random
def get10kRandomNumbers(maximum):
pop = range(1, maximum+1) # this is memory efficient in Python 3
sample = random.sample(pop, 10000)
return sample
Alas, this doesn't work as nicely in Python 2, since xrange objects don't allow maximum values greater than the system's integer type can hold.
An important point to note is that it will be impossible for a computer to have the list of numbers in memory if it is larger than a few billion elements: its memory footprint becomes larger than the typical RAM size (as it takes about 4 GB for 1 billion 32-bit numbers).
In the question, val is a long integer, which seems to indicate that you are indeed using more than a billion integer, so this cannot be done conveniently in memory (i.e., shuffling will be slow, as the operating system will swap).
That said, if the number of elements is small enough (let's say smaller than 0.5 billion), then a list of elements can fit in memory thanks to the compact representation offered by the array module, and be shuffled. This can be done with the standard module array:
import array, random
numbers = array.array('I', xrange(10**8)) # or 'L', if the number of bytes per item (numbers.itemsize) is too small with 'I'
random.shuffle(numbers)

Possibility of memory error?

a=raw_input()
prefix_dict = {}
for j in xrange(1,len(a)+1):
prefix = a[:j]
prefix_dict[prefix] = len(prefix)
print prefix_dict
Is there any possibility of memory error in the above code? This code is running on a server, the server is a quad core Xeon machines running 32-bit Ubuntu (Ubuntu 12.04 LTS). For few cases its working and for few its showing memory error. FYI: I do not know the cases that they are testing but inputs are lower case alphabets. Size of input <= 10,000
The amount of memory for just the data alone for that will be 1 + 2 + 3 ... + n-2 + n-1 + n where n is the length of the input, in other words, len(a). That works out to (n+1) * n/2. If n is 10,000, this works out to about 50 MB of string data plus however much RAM is used by a python dictionary to store 10,000 entries. Testing on my OSX box, this seems minimal and indeed if I run this code on it, the process shows 53.9 MB used:
str = "a"
d = {}
for i in xrange(10000):
d[str] = i
str = str + "a"
I don't see anything overtly wrong with your code and when I run it on a string that is 10,000 letters long, it happily spits out about 50mb to output, so something else must be going wrong.
What does top show as memory usage for the process?
Maybe a smaller piece of code will help:
prefix_dict = { a[:j]:j for j in xrange(1, len(a) + 1) }

Python 2.7 memory error for large complexity

i am new to python. I am running the following code and it is giving memory error with python2.7.
Since I am using opencv therefore I am working with python2.7. I have read the previous posts but I am not understanding much from them.
s={}
ns={}
ts={}
for i in range(0,256): #for red component
for j in range(0,256): #for green component
for k in range(0,256): # for blue component
s[(i,j,k)]=0
ns[(i,j,k)]=0
ts[(i,j,k)]=i*j*k
Please help. The code tries to store the frequency of red, green and blue components. And for that I am inititializing these values to zero
Thing 1: use itertools instead of constructing all the range lists each time around the loop. xrange will return an iterator object like range, and product will return an iterator choosing tuples of elements from the given iterable.
Thing 2: use numpy for large data. It's a matrix implementation designed for this sort of thing.
>>> import numpy as np
>>> from itertools import product
>>> x=np.zeros((256,256,256))
>>> for i, j, k in product(xrange(256), repeat=3):
... x[i,j,k]= i*j*k
...
Takes about five seconds for me, and the expected amount of memory.
$ cat /proc/27240/status
Name: python
State: S (sleeping)
...
VmPeak: 420808 kB
VmSize: 289732 kB
Note that you may actually run into system-wide memory limits if you try to allocate three 256*256*256 arrays, since each one has about 17 million entries. Fortunately numpy lets you persist arrays to disk.
Have you come across the PIL (Python Imaging Library)? You may find it helpful.
As a matter of fact, your program needs at least(!) 300*300*300*4*3 bytes solely for the value data of the dicts. Besides, your key tuples occupy 300*300*300*3*3*4 bytes.
This is in total 1296000000 bytes, or 1.2 GiB of data.
This calculation doesn't even include the overhead of maintaining the data in the dict.
So it depends on the amount of memory which your machine has if it fails or not.
You could do a first step and do
s = {}
ns = {}
ts = {}
for i in range(0, 300):
for j in range(0, 300):
for k in range(0, 300):
index=(i, j, k)
s[index]=j
ns[index]=k
ts[index]=i*j*k
which (in theory) will only occupy half the memory as before - as well, only for the data, as the index tuples are reused.
From what you describe (you want a mere counting), you don't need the full range of combinations to be pre-initialized. So you can omit your initialization shown in the question and instead build a storage where you only store these values where you actually have data, which are supposedly much fewer than possible.
You either could use a defaultdict() or imitate its behavoiur manually, as I think that most of the combinations are not used in your color "palette".
from collections import defaultdict
make0 = lambda: 0
s = defaultdict(make0)
ns = defaultdict(make0)
# what is ts? do you need it?
Now you have three dict-like objects which can be written to if needed. Then, for every combination of colors which you really have, you can do s[index] += 1 resp. ns[index] += 1.
I don't know about your ts - maybe you either can calculate it, or you'll have to find a different solution.
Even if all your variables used a single byte, that program would need 405 MB of RAM.
You should use compression to store more in limited space.
Edit: If you want to make a histogram in Python, see this nice example of using the Python Image Library (PIL). The hard work is done with these 3 lines:
import Image
img = Image.open(imagepath)
hist = img.histogram()

Huge memory usage of loading large dictionaries in memory

I have a file on disk that's only 168MB. It's just a comma separated list of word,id.
The word can be 1-5 characters long. There's 6.5 million lines.
I created a dictionary in python to load this up into memory so I can search incoming text against that list of words. When python loads it up into memory it shows 1.3 GB's of RAM space used. Any idea why that is?
So let's say my word file looks like this...
1,word1
2,word2
3,word3
Then add 6.5 million to that.
I then loop through that file and create a dictionary (python 2.6.1):
def load_term_cache():
"""will load the term cache from our cached file instead of hitting mysql. If it didn't
preload into memory it would be 20+ million queries per process"""
global cached_terms
dumpfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt')
f = open(dumpfile)
cache = csv.reader(f)
for term_id, term in cache:
cached_terms[term] = term_id
f.close()
Just doing that blows up the memory. I view activity monitor and it pegs the memory to all available up to around 1.5GB of RAM On my laptop it just starts to swap. Any ideas how to most efficiently store key/value pairs in memory with python?
Update: I tried to use the anydb module and after 4.4 million records it just dies
the floating point number is the elapsed seconds since I tried to load it
56.95
3400018
60.12
3600019
63.27
3800020
66.43
4000021
69.59
4200022
72.75
4400023
83.42
4600024
168.61
4800025
338.57
You can see it was running great. 200,000 rows every few seconds inserted until I hit a wall and time doubled.
import anydbm
i=0
mark=0
starttime = time.time()
dbfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms')
db = anydbm.open(dbfile, 'c')
#load from existing baseterm file
termfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt.LARGE')
for line in open(termfile):
i += 1
pieces = line.split(',')
db[str(pieces[1])] = str(pieces[0])
if i > mark:
print i
print round(time.time() - starttime, 2)
mark = i + 200000
db.close()
Lots of ideas. However, if you want practical help, edit your question to show ALL of your code. Also tell us what is the "it" that shows memory used, what it shows when you load a file with zero entries, and what platform you are on, and what version of Python.
You say that "the word can be 1-5 words long". What is the average length of the key field in BYTES? Are the ids all integer? If so what are the min and max integer? If not, what is the average length if ID in bytes? To enable cross-achecking of all of above, how many bytes are there in your 6.5M-line file?
Looking at your code, a 1-line file word1,1 will create a dict d['1'] = 'word1' ... isn't that bassackwards?
Update 3: More questions: How is the "word" encoded? Are you sure you are not carrying a load of trailing spaces on any of the two fields?
Update 4 ... You asked "how to most efficiently store key/value pairs in memory with python" and nobody's answered that yet with any accuracy.
You have a 168 Mb file with 6.5 million lines. That's 168 * 1.024 ** 2 / 6.5 = 27.1 bytes per line. Knock off 1 byte for the comma and 1 byte for the newline (assuming it's a *x platform) and we're left with 25 bytes per line. Assuming the "id" is intended to be unique, and as it appears to be an integer, let's assume the "id" is 7 bytes long; that leaves us with an average size of 18 bytes for the "word". Does that match your expectation?
So, we want to store an 18-byte key and a 7-byte value in an in-memory look-up table.
Let's assume a 32-bit CPython 2.6 platform.
>>> K = sys.getsizeof('123456789012345678')
>>> V = sys.getsizeof('1234567')
>>> K, V
(42, 31)
Note that sys.getsizeof(str_object) => 24 + len(str_object)
Tuples were mentioned by one answerer. Note carefully the following:
>>> sys.getsizeof(())
28
>>> sys.getsizeof((1,))
32
>>> sys.getsizeof((1,2))
36
>>> sys.getsizeof((1,2,3))
40
>>> sys.getsizeof(("foo", "bar"))
36
>>> sys.getsizeof(("fooooooooooooooooooooooo", "bar"))
36
>>>
Conclusion: sys.getsizeof(tuple_object) => 28 + 4 * len(tuple_object) ... it only allows for a pointer to each item, it doesn't allow for the sizes of the items.
A similar analysis of lists shows that sys.getsizeof(list_object) => 36 + 4 * len(list_object) ... again it is necessary to add the sizes of the items. There is a further consideration: CPython overallocates lists so that it doesn't have to call the system realloc() on every list.append() call. For sufficiently large size (like 6.5 million!) the overallocation is 12.5 percent -- see the source (Objects/listobject.c). This overallocation is not done with tuples (their size doesn't change).
Here are the costs of various alternatives to dict for a memory-based look-up table:
List of tuples:
Each tuple will take 36 bytes for the 2-tuple itself, plus K and V for the contents. So N of them will take N * (36 + K + V); then you need a list to hold them, so we need 36 + 1.125 * 4 * N for that.
Total for list of tuples: 36 + N * (40.5 + K + v)
That's 26 + 113.5 * N (about 709 MB when is 6.5 million)
Two parallel lists:
(36 + 1.125 * 4 * N + K * N) + (36 + 1.125 * 4 * N + V * N)
i.e. 72 + N * (9 + K + V)
Note that the difference between 40.5 * N and 9 * N is about 200MB when N is 6.5 million.
Value stored as int not str:
But that's not all. If the IDs are actually integers, we can store them as such.
>>> sys.getsizeof(1234567)
12
That's 12 bytes instead of 31 bytes for each value object. That difference of 19 * N is a further saving of about 118MB when N is 6.5 million.
Use array.array('l') instead of list for the (integer) value:
We can store those 7-digit integers in an array.array('l'). No int objects, and no pointers to them -- just a 4-byte signed integer value. Bonus: arrays are overallocated by only 6.25% (for large N). So that's 1.0625 * 4 * N instead of the previous (1.125 * 4 + 12) * N, a further saving of 12.25 * N i.e. 76 MB.
So we're down to 709 - 200 - 118 - 76 = about 315 MB.
N.B. Errors and omissions excepted -- it's 0127 in my TZ :-(
Take a look (Python 2.6, 32-bit version)...:
>>> sys.getsizeof('word,1')
30
>>> sys.getsizeof(('word', '1'))
36
>>> sys.getsizeof(dict(word='1'))
140
The string (taking 6 bytes on disk, clearly) gets an overhead of 24 bytes (no matter how long it is, add 24 to its length to find how much memory it takes). When you split it into a tuple, that's a little bit more. But the dict is what really blows things up: even an empty dict takes 140 bytes -- pure overhead of maintaining a blazingly-fast hash-based lookup take. To be fast, a hash table must have low density -- and Python ensures a dict is always low density (by taking up a lot of extra memory for it).
The most memory-efficient way to store key / value pairs is as a list of tuples, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict).
Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be!-).
convert your data into a dbm (import anydbm, or use berkerley db by import bsddb ...), and then use dbm API to access it.
the reason to explode is that python has extra meta information for any objects, and the dict needs to construct a hash table (which would require more memory). you just created so many objects (6.5M) so the metadata becomes too huge.
import bsddb
a = bsddb.btopen('a.bdb') # you can also try bsddb.hashopen
for x in xrange(10500) :
a['word%d' %x] = '%d' %x
a.close()
This code takes only 1 second to run, so I think the speed is OK (since you said 10500 lines per second).
btopen creates a db file with 499,712 bytes in length, and hashopen creates 319,488 bytes.
With xrange input as 6.5M and using btopen, I got 417,080KB in ouput file size and around 1 or 2 minute to complete insertion. So I think it's totally suitable for you.
I have the same problem though I'm later. The others has answered this question well. And I offer an easy to use(maybe not so easy :-) ) and rather efficient alternative, that's pandas.DataFrame. It performs well in memory usage when saving large data.

Categories