First of all this is my computer Spec :
Memory - https://gist.github.com/vyscond/6425304
CPU - https://gist.github.com/vyscond/6425322
So this morning I've tested the following 2 code snippets:
code A
a = 'a' * 1000000000
and code B
a = 'a' * 10000000000
The code A works fine. But the code B give me some error message :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
So I started a researching about method to measuring the size of data on python.
The first thing I've found is the classic built-in function len().
for code A function len() returned the value 1000000000, but for code B the same memory error was returned.
After this I decided to get more precision on this tests. So I've found a function from the sys module called getsizeof(). With this function I made the same test on code A:
sys.getsizeof( 'a' * 1000000000 )
the result return is 1000000037 (in bytes)
question 1 - which means 0.9313226090744 gigabytes?
So I checked the amount of bytes of a string with a single character 'a'
sys.getsizeof( 'a' )
the result return is 38 (in bytes)
question 02 - which means if we need a string composed of 1000000000 character 'a' this will result in 38 * 1000000000 = 38.000.000.000 bytes?
question 03 - which means we need a 35.390257835388 gigabytes to hold a string like this?
I would like to know where is the error in this reasoning! Because this not any sense to me '-'
Python objects have a minimal size, the overhead of keeping several pieces of bookkeeping data attached to the object.
A Python str object is no exception. Take a look at the difference between a string with no, one, two and three characters:
>>> import sys
>>> sys.getsizeof('')
37
>>> sys.getsizeof('a')
38
>>> sys.getsizeof('aa')
39
>>> sys.getsizeof('aaa')
40
The Python str object overhead is 37 bytes on my machine, but each character in the string only takes one byte over the fixed overhead.
Thus, a str value with 1000 million characters requires 1000 million bytes + 37 bytes overhead of memory. That is indeed about 0.931 gigabytes.
Your sample code 'B' created ten times more characters, so you needed nearly 10 gigabyte of memory just to hold that one string, not counting the rest of Python, and the OS and whatever else might be running on that machine.
Related
This question already has answers here:
Python string interning
(2 answers)
About the changing id of an immutable string
(5 answers)
Closed 3 years ago.
The following two codes are equivalent, but the first one takes about 700M memory, the latter one takes only about 100M memory(via windows task manager). What happens here?
def a():
lst = []
for i in range(10**7):
t = "a"
t = t * 2
lst.append(t)
return lst
_ = a()
def a():
lst = []
for i in range(10**7):
t = "a" * 2
lst.append(t)
return lst
_ = a()
#vurmux presented the right reason for the different memory usage: string interning, but some important details seem to be missing.
CPython-implementation interns some strings during the compilation, e.g "a"*2 - for more info about how/why "a"*2 gets interned see this SO-post.
Clarification: As #MartijnPieters has correctly pointed out in his comment: the important thing is whether the compiler does the constant-folding (e.g. evaluates the multiplication of two constants "a"*2) or not. If constant-folding is done, the resulting constant will be used and all elements in the list will be references to the same object, otherwise not. Even if all string constants get interned (and thus constant folding performed => string interned) - still it was sloppy to speak about interning: constant folding is the key here, as it explains the behavior also for types which have no interning at all, for example floats (if we would use t=42*2.0).
Whether constant folding has happened, can be easily verified with dis-module (I call your second version a2()):
>>> import dis
>>> dis.dis(a2)
...
4 18 LOAD_CONST 2 ('aa')
20 STORE_FAST 2 (t)
...
As we can see, during the run time the multiplication isn't performed, but directly the result (which was computed during the compiler time) of the multiplication is loaded - the resulting list consists of references to the same object (the constant loaded with 18 LOAD_CONST 2):
>>> len({id(s) for s in a2()})
1
There, only 8 bytes per reference are needed, that means about 80Mb (+overalocation of the list + memory needed for the interpreter) memory needed.
In Python3.7 constant folding isn't performed if the resulting string has more than 4096 characters, so replacing "a"*2 with "a"*4097 leads to the following byte-code:
>>> dis.dis(a1)
...
4 18 LOAD_CONST 2 ('a')
20 LOAD_CONST 3 (4097)
22 BINARY_MULTIPLY
24 STORE_FAST 2 (t)
...
Now, the multiplication isn't precalculated, the references in the resulting string will be of different objects.
The optimizer is yet not smart enough to recognize, that t is actually "a" in t=t*2, otherwise it would be able to perform the constant folding, but for now the resulting byte-code for your first version (I call it a2()):
...
5 22 LOAD_CONST 3 (2)
24 LOAD_FAST 2 (t)
26 BINARY_MULTIPLY
28 STORE_FAST 2 (t)
...
and it will return a list with 10^7 different objects (but all object being equal) inside:
>>> len({id(s) for s in a1()})
10000000
i.e. you will need about 56 bytes per string (sys.getsizeof returns 51, but because the pymalloc-memory-allocator is 8-byte aligned, 5 bytes will be wasted) + 8 bytes per reference (assuming 64bit-CPython-version), thus about 610Mb (+overalocation of the list + memory needed for the interpreter).
You can enforce the interning of the string via sys.intern:
import sys
def a1_interned():
lst = []
for i in range(10**7):
t = "a"
t = t * 2
# here ensure, that the string-object gets interned
# returned value is the interned version
t = sys.intern(t)
lst.append(t)
return lst
And realy, we can now not only see, that less memory is needed, but also that the list has references to the same object (see it online for a slightly smaller size(10^5) here):
>>> len({id(s) for s in a1_interned()})
1
>>> all((s=="aa" for s in a1_interned())
True
String interning can save a lot of memory, but it is sometimes tricky to understand, whether/why a string gets interned or not. Calling sys.intern explicitly eliminates this uncertainty.
The existence of additional temporary objects referenced by t is not the problem: CPython uses reference counting for memory managment, so an object gets deleted as soon as there is no references to it - without any interaction from the garbage collector, which in CPython is only used to break-up cycles (which is different to for example Java's GC, as Java doesn't use reference counting). Thus, temporary variables are really temporaries - those objects cannot be accumulated to make any impact on memory usage.
The problem with the temporary variable t is only that it prevents peephole optimization during the compilation, which is performed for "a"*2 but not for t*2.
This difference is exist because of string interning in Python interpreter:
String interning is the method of caching particular strings in memory as they are instantiated. The idea is that, since strings in Python are immutable objects, only one instance of a particular string is needed at a time. By storing an instantiated string in memory, any future references to that same string can be directed to refer to the singleton already in existence, instead of taking up new memory.
Let me show it in a simple example:
>>> t1 = 'a'
>>> t2 = t1 * 2
>>> t2 is 'aa'
False
>>> t1 = 'a'
>>> t2 = 'a'*2
>>> t2 is 'aa'
True
When you use the first variant, the Python string interning is not used so the interpreter creates additional internal variables to store temporal data. It can't optimize many-lines-code this way.
I am not a Python guru, but I think the interpreter works this way:
t = "a"
t = t * 2
In the first line it creates an object for t. In the second line it creates a temporary object for t right of the = sign and writes the result in the third place in the memory (with GC called later). So the second variant should use at least 3 times less memory than the first.
P.S. You can read more about string interning here.
I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.
Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'>. However, when I did us = unicode(s), us was of the type <type 'unicode'>. Is us a str type or is there actually a unicode type in python?
Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>".
What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?
>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>
How does str(object) work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?
Thanks in advance.
Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.
In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode class, while str is rather a synonym to bytes.
While it's not a direct answer to your question, have a look at these examples:
import sys
e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49
a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54
u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53
j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100
Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:
Empty string object has an overhead of 49 bytes.
String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.
String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8.
The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.
Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.
Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1) access by index. However, we also know that UTF-8 uses variadic length representation of symbols. Let's prove it:
j = 'hello😋😋😋'
b = j.encode('utf8') # b'hello\xf0\x9f\x98\x8b\xf0\x9f\x98\x8b\xf0\x9f\x98\x8b'
print(len(b)) # 17
So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4 bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.
And another way around, when we have a sequence of bytes and decode it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.
I am trying to compare sizes of data types in Python with sys.getsizeof(). However, for integers and floats, it returns same - 24 (not customary 4 or 8 bytes). Also, size of an array declared with array.array() with 4 integer elements is returned 72 (not 96). and with 4 float elements- 88 (not 96). What is going on?
import array, sys
arr1 = array.array('d', [1,2,3,4])
arr2 = array.array('i', [1,2,3,4])
print sys.getsizeof(arr1[1]), sys.getsizeof(arr2[1]) # 24, 24
print sys.getsizeof(arr1), sys.getsizeof(arr2) # 88, 72
The function sys.getsizeof() returns the amount of space the Python object takes. Not the amount of space you would need to represent the data in that object in the memory of the underlying system.
Python objects have overhead to cover reference counting (for garbage collection) and other implementation-related stuff. In addition, an array is not a naive sequence of floats or ints; the data structure has a fair amount of stuff under the hood that keeps track of datatype, number of elements and so on. That's where the 'd' or 'i' lives, for example.
To get the answers I think you are expecting, try
print (arr1.itemsize * len(arr1))
print (arr2.itemsize * len(arr2))
This question already has answers here:
In-memory size of a Python structure
(7 answers)
Closed 9 years ago.
I am writing Python code to do some big number calculation, and have serious concern about the memory used in the calculation.
Thus, I want to count every bit of each variable.
For example, I have a variable x, which is a big number, and want to count the number of bits for representing x.
The following code is obviously useless:
x=2**1000
len(x)
Thus, I turn to use the following code:
x=2**1000
len(repr(x))
The variable x is (in decimal) is:
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
but the above code returns 303
The above long long sequence is of length 302, and so I believe that 303 should be related to the string length only.
So, here comes my original question:
How can I know the memory size of variable x?
One more thing; in C/C++ language, if I define
int z=1;
This means that there are 4 bytes= 32 bits allocated for z, and the bits are arranged as 00..001(31 0's and one 1).
Here, my variable x is huge, I don't know whether it follows the same memory allocation rule?
Use sys.getsizeof to get the size of an object, in bytes.
>>> from sys import getsizeof
>>> a = 42
>>> getsizeof(a)
12
>>> a = 2**1000
>>> getsizeof(a)
146
>>>
Note that the size and layout of an object is purely implementation-specific. CPython, for example, may use totally different internal data structures than IronPython. So the size of an object may vary from implementation to implementation.
Regarding the internal structure of a Python long, check sys.int_info (or sys.long_info for Python 2.7).
>>> import sys
>>> sys.int_info
sys.int_info(bits_per_digit=30, sizeof_digit=4)
Python either stores 30 bits into 4 bytes (most 64-bit systems) or 15 bits into 2 bytes (most 32-bit systems). Comparing the actual memory usage with calculated values, I get
>>> import math, sys
>>> a=0
>>> sys.getsizeof(a)
24
>>> a=2**100
>>> sys.getsizeof(a)
40
>>> a=2**1000
>>> sys.getsizeof(a)
160
>>> 24+4*math.ceil(100/30)
40
>>> 24+4*math.ceil(1000/30)
160
There are 24 bytes of overhead for 0 since no bits are stored. The memory requirements for larger values matches the calculated values.
If your numbers are so large that you are concerned about the 6.25% unused bits, you should probably look at the gmpy2 library. The internal representation uses all available bits and computations are significantly faster for large values (say, greater than 100 digits).
I have a file on disk that's only 168MB. It's just a comma separated list of word,id.
The word can be 1-5 characters long. There's 6.5 million lines.
I created a dictionary in python to load this up into memory so I can search incoming text against that list of words. When python loads it up into memory it shows 1.3 GB's of RAM space used. Any idea why that is?
So let's say my word file looks like this...
1,word1
2,word2
3,word3
Then add 6.5 million to that.
I then loop through that file and create a dictionary (python 2.6.1):
def load_term_cache():
"""will load the term cache from our cached file instead of hitting mysql. If it didn't
preload into memory it would be 20+ million queries per process"""
global cached_terms
dumpfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt')
f = open(dumpfile)
cache = csv.reader(f)
for term_id, term in cache:
cached_terms[term] = term_id
f.close()
Just doing that blows up the memory. I view activity monitor and it pegs the memory to all available up to around 1.5GB of RAM On my laptop it just starts to swap. Any ideas how to most efficiently store key/value pairs in memory with python?
Update: I tried to use the anydb module and after 4.4 million records it just dies
the floating point number is the elapsed seconds since I tried to load it
56.95
3400018
60.12
3600019
63.27
3800020
66.43
4000021
69.59
4200022
72.75
4400023
83.42
4600024
168.61
4800025
338.57
You can see it was running great. 200,000 rows every few seconds inserted until I hit a wall and time doubled.
import anydbm
i=0
mark=0
starttime = time.time()
dbfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms')
db = anydbm.open(dbfile, 'c')
#load from existing baseterm file
termfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt.LARGE')
for line in open(termfile):
i += 1
pieces = line.split(',')
db[str(pieces[1])] = str(pieces[0])
if i > mark:
print i
print round(time.time() - starttime, 2)
mark = i + 200000
db.close()
Lots of ideas. However, if you want practical help, edit your question to show ALL of your code. Also tell us what is the "it" that shows memory used, what it shows when you load a file with zero entries, and what platform you are on, and what version of Python.
You say that "the word can be 1-5 words long". What is the average length of the key field in BYTES? Are the ids all integer? If so what are the min and max integer? If not, what is the average length if ID in bytes? To enable cross-achecking of all of above, how many bytes are there in your 6.5M-line file?
Looking at your code, a 1-line file word1,1 will create a dict d['1'] = 'word1' ... isn't that bassackwards?
Update 3: More questions: How is the "word" encoded? Are you sure you are not carrying a load of trailing spaces on any of the two fields?
Update 4 ... You asked "how to most efficiently store key/value pairs in memory with python" and nobody's answered that yet with any accuracy.
You have a 168 Mb file with 6.5 million lines. That's 168 * 1.024 ** 2 / 6.5 = 27.1 bytes per line. Knock off 1 byte for the comma and 1 byte for the newline (assuming it's a *x platform) and we're left with 25 bytes per line. Assuming the "id" is intended to be unique, and as it appears to be an integer, let's assume the "id" is 7 bytes long; that leaves us with an average size of 18 bytes for the "word". Does that match your expectation?
So, we want to store an 18-byte key and a 7-byte value in an in-memory look-up table.
Let's assume a 32-bit CPython 2.6 platform.
>>> K = sys.getsizeof('123456789012345678')
>>> V = sys.getsizeof('1234567')
>>> K, V
(42, 31)
Note that sys.getsizeof(str_object) => 24 + len(str_object)
Tuples were mentioned by one answerer. Note carefully the following:
>>> sys.getsizeof(())
28
>>> sys.getsizeof((1,))
32
>>> sys.getsizeof((1,2))
36
>>> sys.getsizeof((1,2,3))
40
>>> sys.getsizeof(("foo", "bar"))
36
>>> sys.getsizeof(("fooooooooooooooooooooooo", "bar"))
36
>>>
Conclusion: sys.getsizeof(tuple_object) => 28 + 4 * len(tuple_object) ... it only allows for a pointer to each item, it doesn't allow for the sizes of the items.
A similar analysis of lists shows that sys.getsizeof(list_object) => 36 + 4 * len(list_object) ... again it is necessary to add the sizes of the items. There is a further consideration: CPython overallocates lists so that it doesn't have to call the system realloc() on every list.append() call. For sufficiently large size (like 6.5 million!) the overallocation is 12.5 percent -- see the source (Objects/listobject.c). This overallocation is not done with tuples (their size doesn't change).
Here are the costs of various alternatives to dict for a memory-based look-up table:
List of tuples:
Each tuple will take 36 bytes for the 2-tuple itself, plus K and V for the contents. So N of them will take N * (36 + K + V); then you need a list to hold them, so we need 36 + 1.125 * 4 * N for that.
Total for list of tuples: 36 + N * (40.5 + K + v)
That's 26 + 113.5 * N (about 709 MB when is 6.5 million)
Two parallel lists:
(36 + 1.125 * 4 * N + K * N) + (36 + 1.125 * 4 * N + V * N)
i.e. 72 + N * (9 + K + V)
Note that the difference between 40.5 * N and 9 * N is about 200MB when N is 6.5 million.
Value stored as int not str:
But that's not all. If the IDs are actually integers, we can store them as such.
>>> sys.getsizeof(1234567)
12
That's 12 bytes instead of 31 bytes for each value object. That difference of 19 * N is a further saving of about 118MB when N is 6.5 million.
Use array.array('l') instead of list for the (integer) value:
We can store those 7-digit integers in an array.array('l'). No int objects, and no pointers to them -- just a 4-byte signed integer value. Bonus: arrays are overallocated by only 6.25% (for large N). So that's 1.0625 * 4 * N instead of the previous (1.125 * 4 + 12) * N, a further saving of 12.25 * N i.e. 76 MB.
So we're down to 709 - 200 - 118 - 76 = about 315 MB.
N.B. Errors and omissions excepted -- it's 0127 in my TZ :-(
Take a look (Python 2.6, 32-bit version)...:
>>> sys.getsizeof('word,1')
30
>>> sys.getsizeof(('word', '1'))
36
>>> sys.getsizeof(dict(word='1'))
140
The string (taking 6 bytes on disk, clearly) gets an overhead of 24 bytes (no matter how long it is, add 24 to its length to find how much memory it takes). When you split it into a tuple, that's a little bit more. But the dict is what really blows things up: even an empty dict takes 140 bytes -- pure overhead of maintaining a blazingly-fast hash-based lookup take. To be fast, a hash table must have low density -- and Python ensures a dict is always low density (by taking up a lot of extra memory for it).
The most memory-efficient way to store key / value pairs is as a list of tuples, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict).
Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be!-).
convert your data into a dbm (import anydbm, or use berkerley db by import bsddb ...), and then use dbm API to access it.
the reason to explode is that python has extra meta information for any objects, and the dict needs to construct a hash table (which would require more memory). you just created so many objects (6.5M) so the metadata becomes too huge.
import bsddb
a = bsddb.btopen('a.bdb') # you can also try bsddb.hashopen
for x in xrange(10500) :
a['word%d' %x] = '%d' %x
a.close()
This code takes only 1 second to run, so I think the speed is OK (since you said 10500 lines per second).
btopen creates a db file with 499,712 bytes in length, and hashopen creates 319,488 bytes.
With xrange input as 6.5M and using btopen, I got 417,080KB in ouput file size and around 1 or 2 minute to complete insertion. So I think it's totally suitable for you.
I have the same problem though I'm later. The others has answered this question well. And I offer an easy to use(maybe not so easy :-) ) and rather efficient alternative, that's pandas.DataFrame. It performs well in memory usage when saving large data.