I have a file on disk that's only 168MB. It's just a comma separated list of word,id.
The word can be 1-5 characters long. There's 6.5 million lines.
I created a dictionary in python to load this up into memory so I can search incoming text against that list of words. When python loads it up into memory it shows 1.3 GB's of RAM space used. Any idea why that is?
So let's say my word file looks like this...
1,word1
2,word2
3,word3
Then add 6.5 million to that.
I then loop through that file and create a dictionary (python 2.6.1):
def load_term_cache():
"""will load the term cache from our cached file instead of hitting mysql. If it didn't
preload into memory it would be 20+ million queries per process"""
global cached_terms
dumpfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt')
f = open(dumpfile)
cache = csv.reader(f)
for term_id, term in cache:
cached_terms[term] = term_id
f.close()
Just doing that blows up the memory. I view activity monitor and it pegs the memory to all available up to around 1.5GB of RAM On my laptop it just starts to swap. Any ideas how to most efficiently store key/value pairs in memory with python?
Update: I tried to use the anydb module and after 4.4 million records it just dies
the floating point number is the elapsed seconds since I tried to load it
56.95
3400018
60.12
3600019
63.27
3800020
66.43
4000021
69.59
4200022
72.75
4400023
83.42
4600024
168.61
4800025
338.57
You can see it was running great. 200,000 rows every few seconds inserted until I hit a wall and time doubled.
import anydbm
i=0
mark=0
starttime = time.time()
dbfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms')
db = anydbm.open(dbfile, 'c')
#load from existing baseterm file
termfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt.LARGE')
for line in open(termfile):
i += 1
pieces = line.split(',')
db[str(pieces[1])] = str(pieces[0])
if i > mark:
print i
print round(time.time() - starttime, 2)
mark = i + 200000
db.close()
Lots of ideas. However, if you want practical help, edit your question to show ALL of your code. Also tell us what is the "it" that shows memory used, what it shows when you load a file with zero entries, and what platform you are on, and what version of Python.
You say that "the word can be 1-5 words long". What is the average length of the key field in BYTES? Are the ids all integer? If so what are the min and max integer? If not, what is the average length if ID in bytes? To enable cross-achecking of all of above, how many bytes are there in your 6.5M-line file?
Looking at your code, a 1-line file word1,1 will create a dict d['1'] = 'word1' ... isn't that bassackwards?
Update 3: More questions: How is the "word" encoded? Are you sure you are not carrying a load of trailing spaces on any of the two fields?
Update 4 ... You asked "how to most efficiently store key/value pairs in memory with python" and nobody's answered that yet with any accuracy.
You have a 168 Mb file with 6.5 million lines. That's 168 * 1.024 ** 2 / 6.5 = 27.1 bytes per line. Knock off 1 byte for the comma and 1 byte for the newline (assuming it's a *x platform) and we're left with 25 bytes per line. Assuming the "id" is intended to be unique, and as it appears to be an integer, let's assume the "id" is 7 bytes long; that leaves us with an average size of 18 bytes for the "word". Does that match your expectation?
So, we want to store an 18-byte key and a 7-byte value in an in-memory look-up table.
Let's assume a 32-bit CPython 2.6 platform.
>>> K = sys.getsizeof('123456789012345678')
>>> V = sys.getsizeof('1234567')
>>> K, V
(42, 31)
Note that sys.getsizeof(str_object) => 24 + len(str_object)
Tuples were mentioned by one answerer. Note carefully the following:
>>> sys.getsizeof(())
28
>>> sys.getsizeof((1,))
32
>>> sys.getsizeof((1,2))
36
>>> sys.getsizeof((1,2,3))
40
>>> sys.getsizeof(("foo", "bar"))
36
>>> sys.getsizeof(("fooooooooooooooooooooooo", "bar"))
36
>>>
Conclusion: sys.getsizeof(tuple_object) => 28 + 4 * len(tuple_object) ... it only allows for a pointer to each item, it doesn't allow for the sizes of the items.
A similar analysis of lists shows that sys.getsizeof(list_object) => 36 + 4 * len(list_object) ... again it is necessary to add the sizes of the items. There is a further consideration: CPython overallocates lists so that it doesn't have to call the system realloc() on every list.append() call. For sufficiently large size (like 6.5 million!) the overallocation is 12.5 percent -- see the source (Objects/listobject.c). This overallocation is not done with tuples (their size doesn't change).
Here are the costs of various alternatives to dict for a memory-based look-up table:
List of tuples:
Each tuple will take 36 bytes for the 2-tuple itself, plus K and V for the contents. So N of them will take N * (36 + K + V); then you need a list to hold them, so we need 36 + 1.125 * 4 * N for that.
Total for list of tuples: 36 + N * (40.5 + K + v)
That's 26 + 113.5 * N (about 709 MB when is 6.5 million)
Two parallel lists:
(36 + 1.125 * 4 * N + K * N) + (36 + 1.125 * 4 * N + V * N)
i.e. 72 + N * (9 + K + V)
Note that the difference between 40.5 * N and 9 * N is about 200MB when N is 6.5 million.
Value stored as int not str:
But that's not all. If the IDs are actually integers, we can store them as such.
>>> sys.getsizeof(1234567)
12
That's 12 bytes instead of 31 bytes for each value object. That difference of 19 * N is a further saving of about 118MB when N is 6.5 million.
Use array.array('l') instead of list for the (integer) value:
We can store those 7-digit integers in an array.array('l'). No int objects, and no pointers to them -- just a 4-byte signed integer value. Bonus: arrays are overallocated by only 6.25% (for large N). So that's 1.0625 * 4 * N instead of the previous (1.125 * 4 + 12) * N, a further saving of 12.25 * N i.e. 76 MB.
So we're down to 709 - 200 - 118 - 76 = about 315 MB.
N.B. Errors and omissions excepted -- it's 0127 in my TZ :-(
Take a look (Python 2.6, 32-bit version)...:
>>> sys.getsizeof('word,1')
30
>>> sys.getsizeof(('word', '1'))
36
>>> sys.getsizeof(dict(word='1'))
140
The string (taking 6 bytes on disk, clearly) gets an overhead of 24 bytes (no matter how long it is, add 24 to its length to find how much memory it takes). When you split it into a tuple, that's a little bit more. But the dict is what really blows things up: even an empty dict takes 140 bytes -- pure overhead of maintaining a blazingly-fast hash-based lookup take. To be fast, a hash table must have low density -- and Python ensures a dict is always low density (by taking up a lot of extra memory for it).
The most memory-efficient way to store key / value pairs is as a list of tuples, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict).
Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be!-).
convert your data into a dbm (import anydbm, or use berkerley db by import bsddb ...), and then use dbm API to access it.
the reason to explode is that python has extra meta information for any objects, and the dict needs to construct a hash table (which would require more memory). you just created so many objects (6.5M) so the metadata becomes too huge.
import bsddb
a = bsddb.btopen('a.bdb') # you can also try bsddb.hashopen
for x in xrange(10500) :
a['word%d' %x] = '%d' %x
a.close()
This code takes only 1 second to run, so I think the speed is OK (since you said 10500 lines per second).
btopen creates a db file with 499,712 bytes in length, and hashopen creates 319,488 bytes.
With xrange input as 6.5M and using btopen, I got 417,080KB in ouput file size and around 1 or 2 minute to complete insertion. So I think it's totally suitable for you.
I have the same problem though I'm later. The others has answered this question well. And I offer an easy to use(maybe not so easy :-) ) and rather efficient alternative, that's pandas.DataFrame. It performs well in memory usage when saving large data.
Related
I am working with strings that are recursively concatenated to lengths of around 80 million characters. Python slows down dramatically as the string length increases.
Consider the following loop:
s = ''
for n in range(0,r):
s += 't'
I measure the time it takes to run to be 86ms for r = 800,000, 3.11 seconds for r = 8,000,000 and 222 seconds for r = 80,000,000
I am guessing this has something to do with how python allocates additional memory for the string. Is there a way to speed this up, such as allocating the full 80MB to the string s when it is declared?
When you have a string value and change it in your program, then the previous value will remain in a part of memory and the changed string will be placed in a new part of RAM.
As a result, the old values in RAM remain unused.
For this purpose, Garbage Collector is used and cleans your RAM from old, unused values But it will take time.
You can do this yourself. You can use the gc module to see different generations of your objects See this:
import gc
print(gc.get_threshold())
result:
(596, 2, 1)
In this example, we have 596 objects in our youngest generation, two objects in the next generation, and one object in the oldest generation.
For this reason, the allocation speed may be slow and your program may slow down
use this link to efficient String Concatenation in Python
Good luck.
It can't be done in a straightforward way with text (string) objects, but it it is trivial if you are dealing with bytes - in that case, you can create a bytearray object larger than your final outcome and insert your values into it.
If you need the final object as text, you can then decode it to text, the single step will be fast enough.
As you don't state what is the nature of your data, it may become harder - if no single-byte encoding can cover all the characters you need, you have to resort to a variable-lenght encoding, such as utf-8, or a multibyte encoding, such as utf-16 or 32. In both cases, it is no problem if you keep proper track of your insetion index - which will be also your final datasize for re-encoding. (If all you are using are genetic "GATACA" strings, just use ASCII encoding, and you are done)
data = bytearray(100_000_000) # 100 million positions -
index = 0
for character in input_data:
v = character.encode("utf-8")
s = len(v)
if s == 1:
data[index] = v
else:
data[index: index + len(v)] = v
index += len(v)
data_as_text = data[:index].decode("utf-8")
I have an assignment to use a greedy approach to satisfy TSP. The problem has 33708 cities. because I had a lot of helpful tools for this from the previous assignment, I decided to reuse that approach and precompute the distances.
so that is barely more than half a billion entries (33708 choose 2), each comfortably fitting in a float32. The x and y coordinates, likewise, are numbers $|n| < 10000 $ with no more than 4 decimal places.
My python for the same was:
def get_distance(left, right):
""" return the euclidean distance between tuples left and right, which are coordinates"""
return ((left[0] - right[0]) ** 2 + (left[1] - right[1]) ** 2) ** 0.5
# precompute all distances
distances = {}
for i in range(len(cities)):
for j in range(i + 1, len(cities)):
d = get_distance(cities[i], cities[j])
distances[frozenset((i, j)))] = d
and I expected this to occupy (3 * 32b) * 568m ≈ 6.7 gigabytes of memory. But in fact, watching the live runtime in my jupyter notebook, it appears to be shooting past even 35GB. (442s and counting) I had to kill it as I was well into my swap space and it slowed down a lot. Anyone know why this is so surprisingly large?
update: trying again with tuple(sorted((i,j))) -- but already at 110s it is 15GB and counting
sizes
>>> import sys
>>> a = frozenset((1,2))
>>> sys.getsizeof(a)
216
>>> sys.getsizeof(tuple(sorted((1,2))))
56
>>> sys.getsizeof(1)
28
is there anything like float32 and int16 in python?? -- ans: numpy has them
updated attempt:
from numpy import float32, int16
from itertools import combinations
import sys
def get_distance(left, right):
""" return the euclidean distance between tuples left and right, which are coordinates"""
return float32(((left[0] - right[0]) ** 2 + (left[1] - right[1]) ** 2) ** 0.5)
# precompute all distances
distances = {}
for i, j in combinations(range(len(cities)), 2):
distances[tuple(sorted((int16(i), int16(j))))] = get_distance(cities[i], cities[j])
print(sys.getsizeof(distances))
observed sizes:
with cities = cities[:2] : 232
with cities = cities[:3] : also 232
with cities = cities[:10] : 2272
with cities = cities[:100] : 147552
with cities = cities[:1000] : 20971608 (20MB)
with cities = cities[:10000] : 2684354656 (2.6GB)
note the growth rate does not scale with the data even as we approach 50 million entries ie 10000 choose 2 (10% of the total size of the data):
2684354656/(1000 choose 2 / 100 choose 2 * 20971608) ≈ 1.27
20971608/(1000 choose 2 / 100 choose 2 * 147552) ≈ 1.4
I decided to halt my attempt at the full cities list, as my OS snapshot of the memory grew to well over 30GB and I was going to swap. This means that, even if the final object ends up that big, the amount of memory the notebook is requiring is much larger still.
Python objects have an overhead because of dynamic typing and reference counting. The absolute minimal object object() has a size of 16 bytes (on 64 bit machines). 8 byte reference count, 8 bytes type pointer. No python object can be smaller than that. float and int are slightly larger which 24 bytes at least. list are at least an array of pointers, which adds an additional 8 bytes. So the small possible memory footprint of a list of half a billion ints is 32 * 500_000_000 ~= 16Gb. sets and dicts are even larger than that since they store more than just one pointer per element.
Use numpy (maybe the stdlib array module is already enough).
(Note: The numpy float32 types can't be smaller than 16 bytes either)
I've got a list of tickers in tickers_list and I need to make a combination of the unique elements.
Here's my code:
corr_list = []
i = 0
for ticker in tickers_list:
tmp_list = tickers_list[i+1:]
for tl_ticker in tmp_list:
corr_list.append({'ticker1':ticker, 'ticker2':tl_ticker})
i = i+1
len(corr_list)
Here's the code using iteritems:
from itertools import combinations
result = combinations(tickers_list, 2)
new_list = [{'ticker1':comb[0], 'ticker2':comb[1]} for comb in combinations(tickers_list, 2)]
len(new_list)
The produce the exact same results. Of course, the iteritems code is much more elegant, and both work perfectly for 10, 100, and 1,000 items. The time required is almost identical (I timed it). However, at 4,000 items, my code takes about 12 seconds to run on my machine while iteritems crashes my computer. The resultant sent is only around 20m rows, so am I doing something wrong or is this an issue with iteritems?
from itertools import combinations, product
# 20 * 20 * 20 == 8000 items
tickers_list = ["".join(chars) for chars in product("ABCDEFGHIJKLMNOPQRST", repeat=3)]
# 8000 * 7999 / 2 == 31996000 (just under 32 million) items
unique_pairs = [{"ticker1":t1, "ticker2":t2} for t1,t2 in combinations(tickers_list, 2)]
works fine on my machine (Win7Pro x64, Python 3.4) up to len(tickers_list) == 8000 (taking 38s and consuming almost 76 GiB of RAM to do so!).
Are you running 32-bit or 64-bit Python? Do you actually need to store all combinations, or can you use and discard them as you go?
Edit: I miscalculated the size; it is actually about 9.5 GiB. Try it yourself:
from sys import getsizeof as size
size(unique_pairs) + len(unique_pairs) * size(unique_pairs[0])
which gives
9479596592 # total size of structure in bytes == 9.5 GiB
In any case, this is not a result of using itertools; it is the memory footprint of the resulting list-of-dicts and would be the same regardless of how you generated it. Your computer can probably handle it anyway using virtual RAM (swapping to disk), though it is much slower.
If you are just doing this to get it into a database, I suggest letting the database do the work: make a table containing the items of tickers_list, then select a cross-join to produce the final table, something like
SELECT a.ticker, b.ticker
FROM
tickers as a,
tickers as b
WHERE
a.ticker < b.ticker
I'm working on a project that involves accessing data from a large list that's kept in memory. Because the list is quite voluminous (millions of lines) I keep an eye on how much memory is being used. I use OS X so I keep Activity Monitor open as I create these lists.
I've noticed that the amount of memory used by a list can vary wildly depending on how it is constructed but I can't seem to figure out why.
Now for some example code:
(I am using Python 2.7.4 on OSX 10.8.3)
The first function below creates a list and fills it with all the same three random numbers.
The second function below creates a list and fills it with all different random numbers.
import random
import sys
def make_table1(size):
list1 = size *[(float(),float(),float())] # initialize the list
line = (random.random(),
random.random(),
random.random())
for count in xrange(0, size): # Now fill it
list1[count] = line
return list1
def make_table2(size):
list1 = size *[(float(),float(),float())] # initialize the list
for count in xrange(0, size): # Now fill it
list1[count] = (random.random(),
random.random(),
random.random())
return list1
(First let me say that I realize the code above could have been written much more efficiently. It's written this way to keep the two examples as similar as possible.)
Now I create some lists using these functions:
In [2]: thing1 = make_table1(6000000)
In [3]: sys.getsizeof(thing1)
Out[3]: 48000072
At this point my memory used jumps by about 46 MB, which is what I would expect from the information given above.
Now for the next function:
In [4]: thing2 = make_table2(6000000)
In [5]: sys.getsizeof(thing2)
Out[5]: 48000072
As you can see, the memory taken up by the two lists is the same. They are exactly the same length so that's to be expected. What I didn't expect is that my memory used as given by Activity Monitor jumps to over 1 GB!
I understand there is going to be some overhead but 20x as much? 1 GB for a 46MB list?
Seriously?
Okay, on to diagnostics...
The first thing I tried is to collect any garbage:
In [5]: import gc
In [6]: gc.collect()
Out[6]: 0
It made zero difference to the amount of memory used.
Next I used guppy to see where the memory is going:
In [7]: from guppy import hpy
In [8]: hpy().heap()
Out[8]:
Partition of a set of 24217689 objects. Total size = 1039012560 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 6054789 25 484821768 47 484821768 47 tuple
1 18008261 74 432198264 42 917020032 88 float
2 2267 0 96847576 9 1013867608 98 list
3 99032 0 11392880 1 1025260488 99 str
4 585 0 1963224 0 1027223712 99 dict of module
5 1712 0 1799552 0 1029023264 99 dict (no owner)
6 13606 0 1741568 0 1030764832 99 types.CodeType
7 13355 0 1602600 0 1032367432 99 function
8 1494 0 1348088 0 1033715520 99 type
9 1494 0 1300752 0 1035016272 100 dict of type
<691 more rows. Type e.g. '_.more' to view.>
okay, my memory is taken up by:
462 MB of of tuple (huh?)
412 MB of float (what?)
92 MB of list (Okay, this one makes sense. 2*46MB = 92)
My lists are preallocated so I don't think that there is over-allocation going on.
Questions:
Why is the amount of memory used by these two very similar lists so different?
Is there a different way to populate a list that doesn't have so much overhead?
Is there a way to free up all that memory?
Note: Please don't suggest storing on the disk or using array.array or numpy or pandas data structures. Those are all great options but this question isn't about them. This question is about plain old lists.
I have tried similar code with Python 3.3 and the result is the same.
Here is someone with a similar problem. It contains some hints but it's not the same question.
Thank you all!
Both functions make a list of 6000000 references.
sizeof(thelist) ≅ sizeof(reference_to_a_python_object) * 6000000
First list contains 6000000 references to the same one tuple of three floats.
Second list contains references to 6000000 different tuples containing 18000000 different floats.
As you can see, a float takes 24 bytes and a triple takes 80 bytes (using your build of python). No, there's no way around that except numpy.
To turn the lists into collectible garbage, you need to get rid of any references to them:
del thing1
del thing2
The program I've written stores a large amount of data in dictionaries. Specifically, I'm creating 1588 instances of a class, each of which contains 15 dictionaries with 1500 float to float mappings. This process has been using up the 2GB of memory on my laptop pretty quickly (I start writing to swap at about the 1000th instance of the class).
My question is, which of the following is using up my memory?
34 million some pairs of floats?
The overhead of 22,500 dictionaries?
the overhead of 1500 classes?
To me it seems like the memory hog should be the huge number of floating point numbers that I'm holding in memory. However, If what I've read so far is correct, each of my floating point numbers take up 16 bytes. Since I have 34 million pairs, this should be about 108 million bytes, which should be just over a gigabyte.
Is there something I'm not taking into consideration here?
The floats do take up 16 bytes apiece, and a dict with 1500 entries about 100k:
>> sys.getsizeof(1.0)
16
>>> d = dict.fromkeys((float(i) for i in range(1500)), 2.0)
>>> sys.getsizeof(d)
98444
so the 22,500 dicts take over 2GB all by themselves, the 68 million floats another GB or so. Not sure how you compute 68 million times 16 equal only 100M -- you may have dropped a zero somewhere.
The class itself takes up a negligible amount, and 1500 instances thereof (net of the objects they refer to of course, just as getsizeof gives us such net amounts for the dicts) not much more than a smallish dict each, so, that's hardly the problem. I.e.:
>>> sys.getsizeof(Sic)
452
>>> sys.getsizeof(Sic())
32
>>> sys.getsizeof(Sic().__dict__)
524
452 for the class, (524 + 32) * 1550 = 862K for all the instances, as you see that's not the worry when you have gigabytes each in dicts and floats.