I'm quite new to python, so I'm doing my usual of going through Project Euler to work out the logical kinks in my head.
Basically, I need the largest list size possible, ie range(1,n), without overflowing.
Any ideas?
Look at get_len_of_range and get_len_of_range_longs in the builtin module source
Summary: You'll get an OverflowError if the list has more elements than can be fit into a signed long. On 32bit Python that's 2**31 - 1, and on 64 bit Python that's 2**63 - 1. Of course, you will get a MemoryError even for values just under that.
The size of your lists is only limited by your memory. Note that, depending on your version of Python, range(1, 9999999999999999) needs only a few bytes of RAM since it always only creates a single element of the virtual list it returns.
If you want to instantiate the list, use list(range(1,n)) (this will copy the virtual list).
Related
I am building a web scraper that stores data retrieved from four different websites into a tuple array. I later iterate through the tuple and save the entire lot as both CSV and Excel.
Are tuple arrays or arrays in general, limited to the processor's RAM/disc-space?
Thanks
According to the doc, this is given by sys.maxsize
sys.maxsize
An integer giving the maximum value a variable of type Py_ssize_t can
take. It’s usually 2**31 - 1 on a 32-bit platform and 2**63 - 1 on a
64-bit platform.
And interestingly enough, this the Python3 doc about data model gives more implementation details under object.__len__.
CPython implementation detail: In CPython, the length is required to
be at most sys.maxsize. If the length is larger than sys.maxsize
some features (such as len()) may raise OverflowError.
I believe tuples and lists are limited by the size of the machine's virtual memory, unless you're on a 32 bit system in which case you're limited by the small word size. Also, lists are dynamically resized by... I believe about 12% each time they grow too small, so there's a little overhead there as well.
If you're concerned you're going to run out of virtual memory, it might be a good idea to write to a file or files instead.
I am learning how to program in python and am also learning theory as part of a computer science course. In programming i know that i can add additional variables to an array just by using the .append function, however in my theory classes we are told that arrays can neither be increase nor decreased in size.
How does this work in python?
Python uses resizable vectors under the hood. They maintain knowledge of how many elements are in the list as well as what the current total capacity is. When you try to add another element beyond the size of the collection, it allocates a new array with more capacity and populates it with the pointers to items from the original backing array. This is similar to java's ArrayList type, except that there's no way to specify the capacity in python
A detailed post on the implementation is here: http://www.laurentluce.com/posts/python-list-implementation/
They are not linked lists; there's no linked list type built into python, and the performance patterns are different.
Its not python but at one point in your future you will see this in other languages as well. Another common way this is solved that doesn't involve using a vector or a linked list is with dynamic arrays.
Essentially you create an array with a finite size. If the user calls append and you have no more room in your array. You create a new array that is 2x larger than the old array. Then copy all the elements over and append the new element.
The 2x is actually important because it keeps the insert time amortized constant. (That is more advanced algorithms though)
A list in python is akin to a linked list. They can grow dynamically and each element can point to anything.
If you're curious about what id dynamic and what isn't in Python then you should read about mutability vs immutability:
https://codehabitude.com/2013/12/24/python-objects-mutable-vs-immutable/
In the theory class, you learned about static arrays. We see these types of arrays in C usually. But in python, we have dynamic arrays which are extensible. Search for Linked List in google and you will gain further knowledge
I am writing a little arbitrary precision in c(I know such libraries exist, such as gmp, nut I find it more fun to write it, just to exercise myself), and I would like to know if arrays are the best way to represent very long integers, or if there is a better solution(maybe linked chains)? And secondly how does python work to have big integers?(does it use arrays or another technique ?)
Thanks in advance for any response
Try reading documentation about libgmp, it already implements bigints. From what I see, it seems like integers are implemented as a dynamically allocated which is realloc'd when the number needs to grow. See http://gmplib.org/manual/Integer-Internals.html#Integer-Internals.
Python long integers are implemented as a simple structure with just an object header and an array of integers (which might be either 16- or 32-bit ints, depending on platform).
Code is at http://hg.python.org/cpython/file/f8942b8e6774/Include/longintrepr.h
Is there a boundary on list and dictionary in python?
if there is, what is the limit?
I think by boundary you mean whether there is an upper bound on the number of elements in a list or dict. Python does not define any limits on them, so they can be as big as the memory available on your machine permits.
Actually, currently hash implementation for inner Python objects use 32 bit hashes --so, at a point close to 2^32 elements on a dictionary (assuming you have memory for that much), you will start to have a lot of collisions, and will have a significant slow down in dictionary usage. But that won't prevent it from working.
(Python devels are looking at making this hash 64 bit in future builds, so this is no longer an issue).
As for absolute limit, there is none - the limiting factor is the available system memory.
The amount of memory you have is the limit.
GAE has various limitations, one of which is size of biggest allocatable block of memory amounting to 1Mb (now 10 times more, but that doesn't change the question). The limitation means that one cannot put more then some number of items in list() as CPython would try to allocate contiguous memory block for element pointers. Having huge list()s can be considered bad programming practice, but even if no huge structure is created in program itself, CPython maintains some behind the scenes.
It appears that CPython is maintaining single global list of objects or something. I.e. application that has many small objects tend to allocate bigger and bigger single blocks of memory.
First idea was gc, and disabling it changes application behavior a bit but still some structures are maintained.
A simplest short application that experience the issue is:
a = b = []
number_of_lists = 8000000
for i in xrange(number_of_lists):
b.append([])
b = b[0]
Can anyone enlighten me how to prevent CPython from allocating huge internal structures when having many objects in application?
On a 32-bit system, each of the 8000000 lists you create will allocate 20 bytes for the list object itself, plus 16 bytes for a vector of list elements. So you are trying to allocate at least (20+16) * 8000000 = 20168000000 bytes, about 20 GB. And that's in the best case, if the system malloc only allocates exactly as much memory as requested.
I calculated the size of the list object as follows:
2 Pointers in the PyListObject structure itself (see listobject.h)
1 Pointer and one Py_ssize_t for the PyObject_HEAD part of the list object (see object.h)
one Py_ssize_t for the PyObject_VAR_HEAD (also in object.h)
The vector of list elements is slightly overallocated to avoid having to resize it at each append - see list_resize in listobject.c. The sizes are 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ... Thus, your one-element lists will allocate room for 4 elements.
Your data structure is a somewhat pathological example, paying the price of a variable-sized list object without utilizing it - all your lists have only a single element. You could avoid the 12 bytes overallocation by using tuples instead of lists, but to further reduce the memory consumption, you will have to use a different data structure that uses fewer objects. It's hard to be more specific, as I don't know what you are trying to accomplish.
I'm a bit confused as to what you're asking. In that code example, nothing should be garbage collected, as you're never actually killing off any references. You're holding a reference to the top level list in a and you're adding nested lists (held in b at each iteration) inside of that. If you remove the 'a =', then you've got unreferenced objects.
Edit: In response to the first part, yes, Python holds a list of objects so it can know what to cull. Is that the whole question? If not, comment/edit your question and I'll do my best to help fill in the gaps.
What are you trying to accomplish with the
a = b = []
and
b = b[0]
statements? It's certainly odd to see statements like that in Python, because they don't do what you might naively expect: in that example, a and b are two names for the same list (think pointers in C). If you're doing a lot of manipulation like that, it's easy to confuse the garbage collector (and yourself!) because you've got a lot of strange references floating around that haven't been properly cleared.
It's hard to diagnose what's wrong with that code without knowing why you want to do what it appears to be doing. Sure, it exposes a bit of interpreter weirdness... but I'm guessing you're approaching your problem in an odd way, and a more Pythonic approach might yield better results.
So that you're aware of it, Python has its own allocator. You can disable it using --without-pyalloc during the configure step.
However, the largest arena is 256KB so that shouldn't be the problem. You can also compile Python with debugging enabled, using --with-pydebug. This would give you more information about memory use.
I suspect your hunch and am sure that oefe's diagnosis are correct. A list uses contiguous memory, so if your list gets too large for a system arena then you're out of luck. If you're really adventurous you can reimplement PyList to use multiple blocks, but that's going to be a lot of work since various bits of Python expect contiguous data.