I just experimented with the size of python data structures in memory. I wrote the following snippet:
import sys
lst1=[]
lst1.append(1)
lst2=[1]
print(sys.getsizeof(lst1), sys.getsizeof(lst2))
I tested the code on the following configurations:
Windows 7 64bit, Python3.1: the output is: 52 40 so lst1 has 52 bytes and lst2 has 40 bytes.
Ubuntu 11.4 32bit with Python3.2: output is 48 32
Ubuntu 11.4 32bit Python2.7: 48 36
Can anyone explain to me why the two sizes differ although both are lists containing a 1?
In the python documentation for the getsizeof function I found the following: ...adds an additional garbage collector overhead if the object is managed by the garbage collector. Could this be the case in my little example?
Here's a fuller interactive session that will help me explain what's going on (Python 2.6 on Windows XP 32-bit, but it doesn't matter really):
>>> import sys
>>> sys.getsizeof([])
36
>>> sys.getsizeof([1])
40
>>> lst = []
>>> lst.append(1)
>>> sys.getsizeof(lst)
52
>>>
Note that the empty list is a bit smaller than the one with [1] in it. When an element is appended, however, it grows much larger.
The reason for this is the implementation details in Objects/listobject.c, in the source of CPython.
Empty list
When an empty list [] is created, no space for elements is allocated - this can be seen in PyList_New. 36 bytes is the amount of space required for the list data structure itself on a 32-bit machine.
List with one element
When a list with a single element [1] is created, space for one element is allocated in addition to the memory required by the list data structure itself. Again, this can be found in PyList_New. Given size as argument, it computes:
nbytes = size * sizeof(PyObject *);
And then has:
if (size <= 0)
op->ob_item = NULL;
else {
op->ob_item = (PyObject **) PyMem_MALLOC(nbytes);
if (op->ob_item == NULL) {
Py_DECREF(op);
return PyErr_NoMemory();
}
memset(op->ob_item, 0, nbytes);
}
Py_SIZE(op) = size;
op->allocated = size;
So we see that with size = 1, space for one pointer is allocated. 4 bytes (on my 32-bit box).
Appending to an empty list
When calling append on an empty list, here's what happens:
PyList_Append calls app1
app1 asks for the list's size (and gets 0 as an answer)
app1 then calls list_resize with size+1 (1 in our case)
list_resize has an interesting allocation strategy, summarized in this comment from its source.
Here it is:
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
/* check for integer overflow */
if (new_allocated > PY_SIZE_MAX - newsize) {
PyErr_NoMemory();
return -1;
} else {
new_allocated += newsize;
}
Let's do some math
Let's see how the numbers I quoted in the session in the beginning of my article are reached.
So 36 bytes is the size required by the list data structure itself on 32-bit. With a single element, space is allocated for one pointer, so that's 4 extra bytes - total 40 bytes. OK so far.
When app1 is called on an empty list, it calls list_resize with size=1. According to the over-allocation algorithm of list_resize, the next largest available size after 1 is 4, so place for 4 pointers will be allocated. 4 * 4 = 16 bytes, and 36 + 16 = 52.
Indeed, everything makes sense :-)
sorry, previous comment was a bit curt.
what's happening is that you're looking at how lists are allocated (and i think maybe you just wanted to see how big things were - in that case, use sys.getsizeof())
when something is added to a list, one of two things can happen:
the extra item fits in spare space
extra space is needed, so a new list is made, and the contents copied across, and the extra thing added.
since (2) is expensive (copying things, even pointers, takes time proportional to the number of things to be copied, so grows as lists get large) we want to do it infrequently. so instead of just adding a little more space, we add a whole chunk. typically the size of the amount added is similar to what is already in use - that way the maths works out that the average cost of allocating memory, spread out over many uses, is only proportional to the list size.
so what you are seeing is related to this behaviour. i don't know the exact details, but i wouldn't be surprised if [] or [1] (or both) are special cases, where only enough memory is allocated (to save memory in these common cases), and then appending does the "grab a new chunk" described above that adds more.
but i don't know the exact details - this is just how dynamic arrays work in general. the exact implementation of lists in python will be finely tuned so that it is optimal for typical python programs. so all i am really saying is that you can't trust the size of a list to tell you exactly how much it contains - it may contain extra space, and the amount of extra free space is difficult to judge or predict.
ps a neat alternative to this is to make lists as (value, pointer) pairs, where each pointer points to the next tuple. in this way you can grow lists incrementally, although the total memory used is higher. that is a linked list (what python uses is more like a vector or a dynamic array).
[update] see Eli's excellent answer. they explain that both [] and [1] are allocated exactly, but that appending to [] allocates an extra chunk. the comment in the code is what i am saying above (this is called "over-allocation" and the amount is porportional to what we have so that the average ("amortised") cost is proportional to size).
Here's a quick demonstration of the list growth pattern. Changing the third argument in range() will change the output so it doesn't look like the comments in listobject.c, but the result when simply appending one element seem to be perfectly accurate.
allocated = 0
for newsize in range(0,100,1):
if (allocated < newsize):
new_allocated = (newsize >> 3) + (3 if newsize < 9 else 6)
allocated = newsize + new_allocated;
print newsize, allocated
formula changes based on the system architecture
(size-36)/4 for 32 bit machines and
(size-64)/8 for 64 bit machines
36,64 - size of an empty list based on machine
4,8 - size of a single element in the list based on machine
Related
This question already has answers here:
Does Python do slice-by-reference on strings?
(2 answers)
Closed 2 years ago.
I'm wondering if :
a = "abcdef"
b = "def"
if a[3:] == b:
print("something")
does actually perform a copy of the "def" part of a somewhere in memory, or if the letters checking is done in-place ?
Note : I'm speaking about a string, not a list (for which I know the answer)
String slicing makes a copy in CPython.
Looking in the source, this operation is handled in unicodeobject.c:unicode_subscript. There is evidently a special-case to re-use memory when the step is 1, start is 0, and the entire content of the string is sliced - this goes into unicode_result_unchanged and there will not be a copy. However, the general case calls PyUnicode_Substring where all roads lead to a memcpy.
To empirically verify these claims, you can use a stdlib memory profiling tool tracemalloc:
# s.py
import tracemalloc
tracemalloc.start()
before = tracemalloc.take_snapshot()
a = "." * 7 * 1024**2 # 7 MB of ..... # line 6, first alloc
b = a[1:] # line 7, second alloc
after = tracemalloc.take_snapshot()
for stat in after.compare_to(before, 'lineno')[:2]:
print(stat)
You should see the top two statistics output like this:
/tmp/s.py:6: size=7168 KiB (+7168 KiB), count=1 (+1), average=7168 KiB
/tmp/s.py:7: size=7168 KiB (+7168 KiB), count=1 (+1), average=7168 KiB
This result shows two allocations of 7 meg, strong evidence of the memory copying, and the exact line numbers of those allocations will be indicated.
Try changing the slice from b = a[1:] into b = a[0:] to see that entire-string-special-case in effect: there should be only one large allocation now, and sys.getrefcount(a) will increase by one.
In theory, since strings are immutable, an implementation could re-use memory for substring slices. This would likely complicate any reference-counting based garbage collection process, so it may not be a useful idea in practice. Consider the case where a small slice from a much larger string was taken - unless you implemented some kind of sub-reference counting on the slice, the memory from the much larger string could not be freed until the end of the substring's lifetime.
For users that specifically need a standard type which can be sliced without copying the underlying data, there is memoryview. See What exactly is the point of memoryview in Python for more information about that.
Possible talking point (feel free to edit adding information).
I have just written this test to verify empirically what the answer to the question might be (this cannot and does not want to be a certain answer).
import sys
a = "abcdefg"
print("a id:", id(a))
print("a[2:] id:", id(a[2:]))
print("a[2:] is a:", a[2:] is a)
print("Empty string memory size:", sys.getsizeof(""))
print("a memory size:", sys.getsizeof(a))
print("a[2:] memory size:", sys.getsizeof(a[2:]))
Output:
a id: 139796109961712
a[2:] id: 139796109962160
a[2:] is a: False
Empty string memory size: 49
a memory size: 56
a[2:] memory size: 54
As we can see here:
the size of an empty string object is 49 bytes
a single character occupies 1 byte (Latin-1 encoding)
a and a[2:] ids are different
the occupied memory of each a and a[2:] is consistent with the memory occupied by a string with that number of char assigned
I want to check the size of int data type in python:
import sys
sys.getsizeof(int)
It comes out to be "436", which doesn't make sense to me.
Anyway, I want to know how many bytes (2,4,..?) int will take on my machine.
The short answer
You're getting the size of the class, not of an instance of the class. Call int to get the size of an instance:
>>> sys.getsizeof(int())
24
If that size still seems a little bit large, remember that a Python int is very different from an int in (for example) c. In Python, an int is a fully-fledged object. This means there's extra overhead.
Every Python object contains at least a refcount and a reference to the object's type in addition to other storage; on a 64-bit machine, that takes up 16 bytes! The int internals (as determined by the standard CPython implementation) have also changed over time, so that the amount of additional storage taken depends on your version.
Some details about int objects in Python 2 and 3
Here's the situation in Python 2. (Some of this is adapted from a blog post by Laurent Luce). Integer objects are represented as blocks of memory with the following structure:
typedef struct {
PyObject_HEAD
long ob_ival;
} PyIntObject;
PyObject_HEAD is a macro defining the storage for the refcount and the object type. It's described in some detail by the documentation, and the code can be seen in this answer.
The memory is allocated in large blocks so that there's not an allocation bottleneck for every new integer. The structure for the block looks like this:
struct _intblock {
struct _intblock *next;
PyIntObject objects[N_INTOBJECTS];
};
typedef struct _intblock PyIntBlock;
These are all empty at first. Then, each time a new integer is created, Python uses the memory pointed at by next and increments next to point to the next free integer object in the block.
I'm not entirely sure how this changes once you exceed the storage capacity of an ordinary integer, but once you do so, the size of an int gets larger. On my machine, in Python 2:
>>> sys.getsizeof(0)
24
>>> sys.getsizeof(1)
24
>>> sys.getsizeof(2 ** 62)
24
>>> sys.getsizeof(2 ** 63)
36
In Python 3, I think the general picture is the same, but the size of integers increases in a more piecemeal way:
>>> sys.getsizeof(0)
24
>>> sys.getsizeof(1)
28
>>> sys.getsizeof(2 ** 30 - 1)
28
>>> sys.getsizeof(2 ** 30)
32
>>> sys.getsizeof(2 ** 60 - 1)
32
>>> sys.getsizeof(2 ** 60)
36
These results are, of course, all hardware-dependent! YMMV.
The variability in integer size in Python 3 is a hint that they may behave more like variable-length types (like lists). And indeed, this turns out to be true. Here's the definition of the C struct for int objects in Python 3:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
The comments that accompany this definition summarize Python 3's representation of integers. Zero is represented not by a stored value, but by an object with size zero (which is why sys.getsizeof(0) is 24 bytes while sys.getsizeof(1) is 28). Negative numbers are represented by objects with a negative size attribute! So weird.
A tuple takes less memory space in Python:
>>> a = (1,2,3)
>>> a.__sizeof__()
48
whereas lists takes more memory space:
>>> b = [1,2,3]
>>> b.__sizeof__()
64
What happens internally on the Python memory management?
I assume you're using CPython and with 64bits (I got the same results on my CPython 2.7 64-bit). There could be differences in other Python implementations or if you have a 32bit Python.
Regardless of the implementation, lists are variable-sized while tuples are fixed-size.
So tuples can store the elements directly inside the struct, lists on the other hand need a layer of indirection (it stores a pointer to the elements). This layer of indirection is a pointer, on 64bit systems that's 64bit, hence 8bytes.
But there's another thing that lists do: They over-allocate. Otherwise list.append would be an O(n) operation always - to make it amortized O(1) (much faster!!!) it over-allocates. But now it has to keep track of the allocated size and the filled size (tuples only need to store one size, because allocated and filled size are always identical). That means each list has to store another "size" which on 64bit systems is a 64bit integer, again 8 bytes.
So lists need at least 16 bytes more memory than tuples. Why did I say "at least"? Because of the over-allocation. Over-allocation means it allocates more space than needed. However, the amount of over-allocation depends on "how" you create the list and the append/deletion history:
>>> l = [1,2,3]
>>> l.__sizeof__()
64
>>> l.append(4) # triggers re-allocation (with over-allocation), because the original list is full
>>> l.__sizeof__()
96
>>> l = []
>>> l.__sizeof__()
40
>>> l.append(1) # re-allocation with over-allocation
>>> l.__sizeof__()
72
>>> l.append(2) # no re-alloc
>>> l.append(3) # no re-alloc
>>> l.__sizeof__()
72
>>> l.append(4) # still has room, so no over-allocation needed (yet)
>>> l.__sizeof__()
72
Images
I decided to create some images to accompany the explanation above. Maybe these are helpful
This is how it (schematically) is stored in memory in your example. I highlighted the differences with red (free-hand) cycles:
That's actually just an approximation because int objects are also Python objects and CPython even reuses small integers, so a probably more accurate representation (although not as readable) of the objects in memory would be:
Useful links:
tuple struct in CPython repository for Python 2.7
list struct in CPython repository for Python 2.7
int struct in CPython repository for Python 2.7
Note that __sizeof__ doesn't really return the "correct" size! It only returns the size of the stored values. However when you use sys.getsizeof the result is different:
>>> import sys
>>> l = [1,2,3]
>>> t = (1, 2, 3)
>>> sys.getsizeof(l)
88
>>> sys.getsizeof(t)
72
There are 24 "extra" bytes. These are real, that's the garbage collector overhead that isn't accounted for in the __sizeof__ method. That's because you're generally not supposed to use magic methods directly - use the functions that know how to handle them, in this case: sys.getsizeof (which actually adds the GC overhead to the value returned from __sizeof__).
I'll take a deeper dive into the CPython codebase so we can see how the sizes are actually calculated. In your specific example, no over-allocations have been performed, so I won't touch on that.
I'm going to use 64-bit values here, as you are.
The size for lists is calculated from the following function, list_sizeof:
static PyObject *
list_sizeof(PyListObject *self)
{
Py_ssize_t res;
res = _PyObject_SIZE(Py_TYPE(self)) + self->allocated * sizeof(void*);
return PyInt_FromSsize_t(res);
}
Here Py_TYPE(self) is a macro that grabs the ob_type of self (returning PyList_Type) while _PyObject_SIZE is another macro that grabs tp_basicsize from that type. tp_basicsize is calculated as sizeof(PyListObject) where PyListObject is the instance struct.
The PyListObject structure has three fields:
PyObject_VAR_HEAD # 24 bytes
PyObject **ob_item; # 8 bytes
Py_ssize_t allocated; # 8 bytes
these have comments (which I trimmed) explaining what they are, follow the link above to read them. PyObject_VAR_HEAD expands into three 8 byte fields (ob_refcount, ob_type and ob_size) so a 24 byte contribution.
So for now res is:
sizeof(PyListObject) + self->allocated * sizeof(void*)
or:
40 + self->allocated * sizeof(void*)
If the list instance has elements that are allocated. the second part calculates their contribution. self->allocated, as it's name implies, holds the number of allocated elements.
Without any elements, the size of lists is calculated to be:
>>> [].__sizeof__()
40
i.e the size of the instance struct.
tuple objects don't define a tuple_sizeof function. Instead, they use object_sizeof to calculate their size:
static PyObject *
object_sizeof(PyObject *self, PyObject *args)
{
Py_ssize_t res, isize;
res = 0;
isize = self->ob_type->tp_itemsize;
if (isize > 0)
res = Py_SIZE(self) * isize;
res += self->ob_type->tp_basicsize;
return PyInt_FromSsize_t(res);
}
This, as for lists, grabs the tp_basicsize and, if the object has a non-zero tp_itemsize (meaning it has variable-length instances), it multiplies the number of items in the tuple (which it gets via Py_SIZE) with tp_itemsize.
tp_basicsize again uses sizeof(PyTupleObject) where the PyTupleObject struct contains:
PyObject_VAR_HEAD # 24 bytes
PyObject *ob_item[1]; # 8 bytes
So, without any elements (that is, Py_SIZE returns 0) the size of empty tuples is equal to sizeof(PyTupleObject):
>>> ().__sizeof__()
24
huh? Well, here's an oddity which I haven't found an explanation for, the tp_basicsize of tuples is actually calculated as follows:
sizeof(PyTupleObject) - sizeof(PyObject *)
why an additional 8 bytes is removed from tp_basicsize is something I haven't been able to find out. (See MSeifert's comment for a possible explanation)
But, this is basically the difference in your specific example. lists also keep around a number of allocated elements which helps determine when to over-allocate again.
Now, when additional elements are added, lists do indeed perform this over-allocation in order to achieve O(1) appends. This results in greater sizes as MSeifert's covers nicely in his answer.
MSeifert answer covers it broadly; to keep it simple you can think of:
tuple is immutable. Once set, you can't change it. So you know in advance how much memory you need to allocate for that object.
list is mutable. You can add or remove items to or from it. It has to know its current size. It resizes as needed.
There are no free meals - these capabilities comes with a cost. Hence the overhead in memory for lists.
The size of the tuple is prefixed, i.e. at tuple initialization the interpreter allocates enough space for the contained data and hence it's immutable (can't be modified). Whereas a list is a mutable object, hence implying dynamic allocation of memory, so to avoid allocating space each time you append or modify the list (allocate enough space to contain the changed data and copy the data to it), it allocates additional space for future runtime changes, like appends and modifications.
That pretty much sums it up.
So, I'm making a game in Python 3.4. In the game I need to keep track of a map. It is a map of joined rooms, starting at (0,0) and continuing in every direction, generated in a filtered-random way(only correct matches for the next position are used for a random list select).
I have several types of rooms, which have a name, and a list of doors:
RoomType = namedtuple('Room','Type,EntranceLst')
typeA = RoomType("A",["Bottom"])
...
For the map at the moment I keep a dict of positions and the type of room:
currentRoomType = typeA
currentRoomPos = (0,0)
navMap = {currentRoomPos: currentRoomType}
I have loop that generates 9.000.000 rooms, to test the memory usage.
I get around 600 and 800Mb when I run it.
I was wondering if there is a way to optimize that.
I tried with instead of doing
navMap = {currentRoomPos: currentRoomType}
I would do
navMap = {currentRoomPos: "A"}
but this doesn't have a real change in usage.
Now I was wondering if I could - and should - keep a list of all the types, and for every type keep the positions on which it occurs. I do not know however if it will make a difference with the way python manages its variables.
This is pretty much a thought-experiment, but if anything useful comes from it I will probably implement it.
You can use sys.getsizeof(object) to get the size of a Python object. However, you have to be careful when calling sys.getsizeof on containers: it only gives the size of the container, not the content -- see this recipe for an explanation of how to get the total size of a container, including contents. In this case, we don't need to go quite so deep: we can just manually add up the size of the container and the size of its contents.
The sizes of the types in question are:
# room type size
>>> sys.getsizeof(RoomType("A",["Bottom"])) + sys.getsizeof("A") + sys.getsizeof(["Bottom"]) + sys.getsizeof("Bottom")
233
# position size
>>> sys.getsizeof((0,0)) + 2*sys.getsizeof(0)
120
# One character size
>>> sys.getsizeof("A")
38
Let's look at the different options, assuming you have N rooms:
Dictionary from position -> room_type. This involves keeping N*(size(position) + size(room_type)) = 353 N bytes in memory.
Dictionary from position -> 1-character string. This involves keeping N*158 bytes in memory.
Dictionary from type -> set of positions. This involves keeping N*120 bytes plus a tiny overhead with storing dictionary keys.
In terms of memory usage, the third option is clearly better. However, as is often the case, you have a CPU memory tradeoff. It's worth thinking briefly about the computational complexity of the queries you are likely to do. To find the type of a room given its position, with each of the three choices above you have to:
Look up the position in a dictionary. This is an O(1) lookup, so you'll always have the same run time (approximately), independent of the number of rooms (for a large number of rooms).
Same
Look at each type, and for each type, ask if that position is in the set of positions for that type. This is an O(ntypes) lookup, that is, the time it takes is proportional to the number of types that you have. Note that, if you had gone for list instead of a set to store the rooms of a given type, this would grow to O(nrooms * ntypes), which would kill your performance.
As always, when optimising, it is important to consider the effect of an optimisation on both memory usage and CPU time. The two are often at odds.
As an alternative, you could consider keeping the types in a 2-dimensional numpy array of characters, if your map is sufficiently rectangular. I believe this would be far more efficient. Each character in a numpy array is a single byte, so the memory usage would be much less, and the CPU time would still be O(1) lookup from room position to type:
# Generate random 20 x 10 rectangular map
>>> map = np.repeat('a', 100).reshape(20, 10)
>>> map.nbytes
200 # ie. 1 byte per character.
Some additionally small scale optimisations:
Encode the room type as an int rather than a string. Ints have size 24 bytes, while one-character strings have size 38.
Encode the position as a single integer, rather than a tuple. For instance:
# Random position
xpos = 5
ypos = 92
# Encode the position as a single int, using high-order bits for x and low-order bits for y
pos = 5*1000 + ypos
# Recover the x and y values of the position.
xpos = pos / 1000
ypos = pos % 1000
Note that this kills readability, so it's only worth doing if you want to squeeze the last bits of performance. In practice, you might want to use a power of 2, rather than a power of 10, as your delimiter (but a power of 10 helps with debugging and readability). Note that this brings your number of bytes per position from 120 to 24. If you do go down this route, consider defining a Position class using __slots__ to tell Python how to allocate memory, and add xpos and ypos properties to the class. You don't want to litter your code with pos / 1000 and pos % 1000 statements.
I have a file on disk that's only 168MB. It's just a comma separated list of word,id.
The word can be 1-5 characters long. There's 6.5 million lines.
I created a dictionary in python to load this up into memory so I can search incoming text against that list of words. When python loads it up into memory it shows 1.3 GB's of RAM space used. Any idea why that is?
So let's say my word file looks like this...
1,word1
2,word2
3,word3
Then add 6.5 million to that.
I then loop through that file and create a dictionary (python 2.6.1):
def load_term_cache():
"""will load the term cache from our cached file instead of hitting mysql. If it didn't
preload into memory it would be 20+ million queries per process"""
global cached_terms
dumpfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt')
f = open(dumpfile)
cache = csv.reader(f)
for term_id, term in cache:
cached_terms[term] = term_id
f.close()
Just doing that blows up the memory. I view activity monitor and it pegs the memory to all available up to around 1.5GB of RAM On my laptop it just starts to swap. Any ideas how to most efficiently store key/value pairs in memory with python?
Update: I tried to use the anydb module and after 4.4 million records it just dies
the floating point number is the elapsed seconds since I tried to load it
56.95
3400018
60.12
3600019
63.27
3800020
66.43
4000021
69.59
4200022
72.75
4400023
83.42
4600024
168.61
4800025
338.57
You can see it was running great. 200,000 rows every few seconds inserted until I hit a wall and time doubled.
import anydbm
i=0
mark=0
starttime = time.time()
dbfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms')
db = anydbm.open(dbfile, 'c')
#load from existing baseterm file
termfile = os.path.join(os.getenv("MY_PATH"), 'datafiles', 'baseterms.txt.LARGE')
for line in open(termfile):
i += 1
pieces = line.split(',')
db[str(pieces[1])] = str(pieces[0])
if i > mark:
print i
print round(time.time() - starttime, 2)
mark = i + 200000
db.close()
Lots of ideas. However, if you want practical help, edit your question to show ALL of your code. Also tell us what is the "it" that shows memory used, what it shows when you load a file with zero entries, and what platform you are on, and what version of Python.
You say that "the word can be 1-5 words long". What is the average length of the key field in BYTES? Are the ids all integer? If so what are the min and max integer? If not, what is the average length if ID in bytes? To enable cross-achecking of all of above, how many bytes are there in your 6.5M-line file?
Looking at your code, a 1-line file word1,1 will create a dict d['1'] = 'word1' ... isn't that bassackwards?
Update 3: More questions: How is the "word" encoded? Are you sure you are not carrying a load of trailing spaces on any of the two fields?
Update 4 ... You asked "how to most efficiently store key/value pairs in memory with python" and nobody's answered that yet with any accuracy.
You have a 168 Mb file with 6.5 million lines. That's 168 * 1.024 ** 2 / 6.5 = 27.1 bytes per line. Knock off 1 byte for the comma and 1 byte for the newline (assuming it's a *x platform) and we're left with 25 bytes per line. Assuming the "id" is intended to be unique, and as it appears to be an integer, let's assume the "id" is 7 bytes long; that leaves us with an average size of 18 bytes for the "word". Does that match your expectation?
So, we want to store an 18-byte key and a 7-byte value in an in-memory look-up table.
Let's assume a 32-bit CPython 2.6 platform.
>>> K = sys.getsizeof('123456789012345678')
>>> V = sys.getsizeof('1234567')
>>> K, V
(42, 31)
Note that sys.getsizeof(str_object) => 24 + len(str_object)
Tuples were mentioned by one answerer. Note carefully the following:
>>> sys.getsizeof(())
28
>>> sys.getsizeof((1,))
32
>>> sys.getsizeof((1,2))
36
>>> sys.getsizeof((1,2,3))
40
>>> sys.getsizeof(("foo", "bar"))
36
>>> sys.getsizeof(("fooooooooooooooooooooooo", "bar"))
36
>>>
Conclusion: sys.getsizeof(tuple_object) => 28 + 4 * len(tuple_object) ... it only allows for a pointer to each item, it doesn't allow for the sizes of the items.
A similar analysis of lists shows that sys.getsizeof(list_object) => 36 + 4 * len(list_object) ... again it is necessary to add the sizes of the items. There is a further consideration: CPython overallocates lists so that it doesn't have to call the system realloc() on every list.append() call. For sufficiently large size (like 6.5 million!) the overallocation is 12.5 percent -- see the source (Objects/listobject.c). This overallocation is not done with tuples (their size doesn't change).
Here are the costs of various alternatives to dict for a memory-based look-up table:
List of tuples:
Each tuple will take 36 bytes for the 2-tuple itself, plus K and V for the contents. So N of them will take N * (36 + K + V); then you need a list to hold them, so we need 36 + 1.125 * 4 * N for that.
Total for list of tuples: 36 + N * (40.5 + K + v)
That's 26 + 113.5 * N (about 709 MB when is 6.5 million)
Two parallel lists:
(36 + 1.125 * 4 * N + K * N) + (36 + 1.125 * 4 * N + V * N)
i.e. 72 + N * (9 + K + V)
Note that the difference between 40.5 * N and 9 * N is about 200MB when N is 6.5 million.
Value stored as int not str:
But that's not all. If the IDs are actually integers, we can store them as such.
>>> sys.getsizeof(1234567)
12
That's 12 bytes instead of 31 bytes for each value object. That difference of 19 * N is a further saving of about 118MB when N is 6.5 million.
Use array.array('l') instead of list for the (integer) value:
We can store those 7-digit integers in an array.array('l'). No int objects, and no pointers to them -- just a 4-byte signed integer value. Bonus: arrays are overallocated by only 6.25% (for large N). So that's 1.0625 * 4 * N instead of the previous (1.125 * 4 + 12) * N, a further saving of 12.25 * N i.e. 76 MB.
So we're down to 709 - 200 - 118 - 76 = about 315 MB.
N.B. Errors and omissions excepted -- it's 0127 in my TZ :-(
Take a look (Python 2.6, 32-bit version)...:
>>> sys.getsizeof('word,1')
30
>>> sys.getsizeof(('word', '1'))
36
>>> sys.getsizeof(dict(word='1'))
140
The string (taking 6 bytes on disk, clearly) gets an overhead of 24 bytes (no matter how long it is, add 24 to its length to find how much memory it takes). When you split it into a tuple, that's a little bit more. But the dict is what really blows things up: even an empty dict takes 140 bytes -- pure overhead of maintaining a blazingly-fast hash-based lookup take. To be fast, a hash table must have low density -- and Python ensures a dict is always low density (by taking up a lot of extra memory for it).
The most memory-efficient way to store key / value pairs is as a list of tuples, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict).
Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be!-).
convert your data into a dbm (import anydbm, or use berkerley db by import bsddb ...), and then use dbm API to access it.
the reason to explode is that python has extra meta information for any objects, and the dict needs to construct a hash table (which would require more memory). you just created so many objects (6.5M) so the metadata becomes too huge.
import bsddb
a = bsddb.btopen('a.bdb') # you can also try bsddb.hashopen
for x in xrange(10500) :
a['word%d' %x] = '%d' %x
a.close()
This code takes only 1 second to run, so I think the speed is OK (since you said 10500 lines per second).
btopen creates a db file with 499,712 bytes in length, and hashopen creates 319,488 bytes.
With xrange input as 6.5M and using btopen, I got 417,080KB in ouput file size and around 1 or 2 minute to complete insertion. So I think it's totally suitable for you.
I have the same problem though I'm later. The others has answered this question well. And I offer an easy to use(maybe not so easy :-) ) and rather efficient alternative, that's pandas.DataFrame. It performs well in memory usage when saving large data.