I am building a web scraper that stores data retrieved from four different websites into a tuple array. I later iterate through the tuple and save the entire lot as both CSV and Excel.
Are tuple arrays or arrays in general, limited to the processor's RAM/disc-space?
Thanks
According to the doc, this is given by sys.maxsize
sys.maxsize
An integer giving the maximum value a variable of type Py_ssize_t can
take. It’s usually 2**31 - 1 on a 32-bit platform and 2**63 - 1 on a
64-bit platform.
And interestingly enough, this the Python3 doc about data model gives more implementation details under object.__len__.
CPython implementation detail: In CPython, the length is required to
be at most sys.maxsize. If the length is larger than sys.maxsize
some features (such as len()) may raise OverflowError.
I believe tuples and lists are limited by the size of the machine's virtual memory, unless you're on a 32 bit system in which case you're limited by the small word size. Also, lists are dynamically resized by... I believe about 12% each time they grow too small, so there's a little overhead there as well.
If you're concerned you're going to run out of virtual memory, it might be a good idea to write to a file or files instead.
Related
I have a huge number of 128-bit unsigned integers that need to be sorted for analysis (around a trillion of them!).
The research I have done on 128-bit integers has led me down a bit of a blind alley, numpy doesn't seem to fully support them and the internal sorting functions are memory intensive (using lists).
What I'd like to do is load, for example, a billion 128-bit unsigned integers into memory (16GB if just binary data) and sort them. The machine in question has 48GB of RAM so should be OK to use 32GB for the operation. If it has to be done in smaller chunks that's OK, but doing as large a chunk as possible would be better. Is there a sorting algorithm that Python has which can take such data without requiring a huge overhead?
I can sort 128-bit integers using the .sort method for lists, and it works, but it can't scale to the level that I need. I do have a C++ version that was custom written to do this and works incredibly quickly, but I would like to replicate it in Python to accelerate development time (and I didn't write the C++ and I'm not used to that language).
Apologies if there's more information required to describe the problem, please ask anything.
NumPy doesn't support 128-bit integers, but if you use a structured dtype composed of high and low unsigned 64-bit chunks, those will sort in the same order as the 128-bit integers would:
arr.sort(order=['high', 'low'])
As for how you're going to get an array with that dtype, that depends on how you're loading your data in the first place. I imagine it might involve calling ndarray.view to reinterpret the bytes of another array. For example, if you have an array of dtype uint8 whose bytes should be interpreted as little-endian 128-bit unsigned integers, on a little-endian machine:
arr_structured = arr_uint8.view([('low', 'uint64'), ('high', 'uint64')])
So that might be reasonable for a billion ints, but you say you've got about a trillion of these. That's a lot more than an in-memory sort on a 48GB RAM computer can handle. You haven't asked for something to handle the whole trillion-element dataset at once, so I hope you already have a good solution in mind for merging sorted chunks, or for pre-partitioning the dataset.
I was probably expecting too much from Python, but I'm not disappointed. A few minutes of coding allowed me to create something (using built-in lists) that can process the sorting a hundred million uint128 items on an 8GB laptop in a couple of minutes.
Given a large number of items to be sorted (1 trillion), it's clear that putting them into smaller bins/files upon creation makes more sense than looking to sort huge numbers in memory. The potential issues created by appending data to thousands of files in 1MB chunks (fragmentation on spinning disks) are less of a worry due to the sorting of each of these fragmented files creating a sequential file that will be read many times (the fragmented file is written once and read once).
The benefits of development speed of Python seem to outweigh the performance hit versus C/C++, especially since the sorting happens only once.
Large numpy array (over 4GB) with nyp file and memmap function
I was using numpy package for array calculation where I read https://docs.scipy.org/doc/numpy/neps/npy-format.html
In "Format Specification: Version 2.0" it said that, for .npy file, "version 2.0 format extends the header size to 4 GiB".
My question was that:
What was header size? Did that mean I could only save numpy.array of sizeat most 4GB array into the npy file? How large could a single array go?
I also read https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.memmap.html
where it stated that "Memory-mapped files cannot be larger than 2GB on 32-bit systems"
did it mean numpy.memmap's limitation was based on the memory of the system? Was there anyway to avoid such limitation?
Further, I read that we could chose the dtype of the array, where the best resolution was "complex128". Was there any way to "use" and "save" elements with more accuracy on a 64 bit computer?(more accurate than complex128 or float64)
The previous header size field was 16 bits wide, allowing headers smaller than 64KiB. Because the header describes the structure of the data, and doesn't contain the data itself, this is not a huge concern for most people. Quoting the notes, "This can be exceeded by structured arrays with a large number of columns." So to answer the first question, header size was under 64KiB but the data came after, so this wasn't the array size limit. The format didn't specify a data size limit.
Memory map capacity is dependent on operating system as well as machine architecture. Nowadays we've largely moved to flat but typically virtual address maps, so the program itself, stack, heap, and mapped files all compete for the same space, in total 4GiB for 32 bit pointers. Operating systems frequently partition this in quite large chunks, so some systems might only allow 2GiB total for user space, others 3GiB; and often you can map more memory than you can allocate otherwise. The memmap limitation is more closely tied to the operating system in use than the physical memory.
Non-flat address spaces, such as using distinct segments on OS/2, could allow larger usage. The cost is that a pointer is no longer a single word. PAE, for instance, supplies a way for the operating system to use more memory but still leaves processes with their own 32 bit limits. Typically it's easier nowadays to use a 64 bit system, allowing memory spaces up to 16 exabytes. Because data sizes have grown a lot, we also handle it in larger pieces, such as 4MiB or 16MiB allocations rather than the classic 4KiB pages or 512B sectors. Physical memory typically has more practical limits.
Yes, there are elements with more precision than 64 bit floating point; in particular, 64 bit integers. This effectively uses a larger mantissa by sacrificing all of the exponent. Complex128 is two 64 bit floats, and doesn't have higher precision but a second dimension. There are types that can grow arbitrarily precise, such as Python's long integers (long in python 2, int in python 3) and fractions, but numpy generally doesn't delve into those because they also have matching storage and computation costs. A basic property of the arrays is that they can be addressed using index calculations since the element size is consistent.
This is mostly a question out of curiosity. I noticed that the numpy test suite contains tests for 128 bit integers, and the numerictypes module refers to int128, float256 (octuple precision?), and other types that don't seem to map to numpy dtypes on my machine.
My machine is 64bit, yet I can use quadruple 128bit floats (but not really). I suppose that if it's possible to emulate quadruple floats in software, one can theoretically also emulate octuple floats and 128bit ints. On the other hand, until just now I had never heard of either 128bit ints or octuple precision floating point before. Why is there a reference to 128bit ints and 256bit floats in numpy's numerictypes module if there are no corresponding dtypes, and how can I use those?
This is a very interesting question and probably there are reasons related to python, to computing and/or to hardware. While not trying to give a full answer, here is what I would go towards...
First note that the types are defined by the language and can be different from your hardware architecture. For example you could even have doubles with an 8-bits processor. Of course any arithmetic involves multiple CPU instructions, making the computation much slower. Still, if your application requires it, it might be worth it or even required (better being late than wrong, especially if say you are running a simulation for a say bridge stability...) So where is 128bit precision required? Here's the wikipedia article on it...
One more interesting detail is that when we say a computer is say 64-bit, this is not fully describing the hardware. There are a lot of pieces that can each be (and at least have been at times) different bits: The computational registers in the CPU, the memory addressing scheme / memory registers and the different buses with most important the buss from CPU to memory.
-The ALU (arithmetic and logic unit) has registers that do calculations. Your machines are 64bit (not sure if that also mean they could do 2 32bit calculations at a similar time) This is clearly the most relevant quantity for this discussion. Long time ago, it used to be the case you could go out and buy a co-processor to speed that for calculations of higher precision...
-The Registers that hold memory addresses limit the memory the computer can see (directly) that is why computers that had 32bit memory registers could only see 2^32 bytes (or approx 4 GB) Notice that for 16bits, this becomes 65K which is very low. The OS can find ways around this limit, but not for a single program, so no program in a 32bit computer can normally have more than 4GB memmory.
-Notice that those limits are about bytes, not bits. That is because when referring and loading from memory we load bytes. In fact, the way this is done, loading a byte (8 bits) or 8 (64 bits == buss length for your computer) takes the same time. I ask for an address, and then get at once all bits through the bus.
It can be that in an architecture all these quantities are not the same number of bits.
NumPy is amazingly powerful and can handle numbers much bigger than the internal CPU representation (e.g. 64 bit).
In case of dynamic type it stores the number in an array. It can extend the memory block too, that is why you can have an integer with 500 digits. This dynamic type is called bignum. In older Python versions it was the type long. In newer Python (3.0+) there is only long, which is called int, which supports almost arbitrarily number of digits (-> bignum).
If you specify a data type (int32 for example), then you specify bit length and bit format, i.e. which bits in memory stands for what. Example:
dt = np.dtype(np.int32) # 32-bit integer
dt = np.dtype(np.complex128) # 128-bit complex floating-point number
Look in: https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
I am iterating over 80m lines in a 2.5gb file to create a list of offsets for the location of the start of each line. The memory slowly increases as expected until I hit around line 40m, and then rapidly increases 1.5gb in 3-5 seconds before the process exits due to lack of memory.
After some investigation, I discovered that the blow-up occurs around the time when the current offset (curr_offset) is around 2b, which happens to be around my sys.maxint (2^31-1).
My questions are:
Do numbers greater than sys.maxint require substantially more memory to store? If so, why? If not, why would I be seeing this behavior?
What factors (e.g. which Python, which operating system) determine sys.maxint?
On my 2010 MacBook Pro using 64-bit Python, sys.maxint is 2^63-1.
On my Windows 7 laptop using 64-bit IronPython, sys.maxint is the smaller 2^31-1. Same with 32-bit Python. For various reasons, I can't get 64-bit Python on my Windows machine right now.
Is there a better way to create this list of offsets?
The code in question:
f = open('some_file', 'rb')
curr_offset = 0
offsets = []
for line in f:
offsets.append(curr_offset)
curr_offset += len(line)
f.close()
Integers larger than sys.maxint will require more memory as they're stored as longs. If your sys.maxint is just 2GB, you're using a 32-bit build -- download, install, and use, a 64-bit build, and you'll avoid the problem. Your code looks fine!
Here is a solution that works even with 32-bit Python versions: store the lengths of the lines (they are small), convert into a NumPy array of 64-bit integers, and only then calculate offsets:
import numpy
with open('some_file', 'rb') as input_file:
lengths = map(len, input_file)
offsets = numpy.array(lengths, dtype=numpy.uint64).cumsum()
where cumsum() calculates the cumulative sum of line lengths. 80 M lines will give a manageable 8*80 = 640 MB array of offsets.
The building of the lengths list can even be bypassed by building the array of lengths with numpy.fromiter():
import numpy
with open('some_file', 'rb') as input_file:
offsets = numpy.fromiter((len(line) for line in input_file), dtype=numpy.uint64).cumsum()
I guess that it should be hard to find a faster method, because using a single numeric type (64-bit integers) makes NumPy arrays faster than Python lists.
An offset in a 2.5 GB file should never need to be more than eight bytes. Indeed, a signed 64-bit integer is maximally 9223372036854775807, much greater than 2.5G.
If you have 80 million lines, you should require no more than 640 MB to store an array of 80M offsets.
I would investigate to see if something is buggy with your code or with Python, perhaps using a different container (perhaps an explicit numpy array of 64-bit integers), using a preinitialized list, or even a different language altogether to store and process your offsets, such as an off_t in C, with appropriate compile flags.
(If you're curious and want to look at demo code, I wrote a program in C called sample that stores 64-bit offsets to newlines in an input file, to be able to do reservoir sampling on a scale larger than GNU sort.)
Appending to a list will reallocate a buffer for the list once it passes the capacity of the current buffer. I don't know specifically how Python does it, but a common method is to allocate 1.5x to 2x the previous size of the buffer. This is an exponential operation, so it's normal to see the memory requirements increasing rapidly near the end. It might be the case that the size of the list is just too large overall; a quick test would be to replace curr_offset += len(line) with curr_offset += 1 and see if you have the same behavior.
If you really can't use a 64-bit version of Python, you can hold your calculated offsets in a NumPy array of numpy.uint64 numbers (maximum value of 2**64-1). This is a little inconvenient, because you have to dynamically extend the array when it reaches capacity, but this would solve your problem and would run on any version of Python.
PS: A more convenient NumPy-based solution, that does not require dynamically updating the size of the NumPy array of offsets, is given in my other answer.
Yes. Above some threshold, python is representing long numbers as bignums and they take space.
I'm quite new to python, so I'm doing my usual of going through Project Euler to work out the logical kinks in my head.
Basically, I need the largest list size possible, ie range(1,n), without overflowing.
Any ideas?
Look at get_len_of_range and get_len_of_range_longs in the builtin module source
Summary: You'll get an OverflowError if the list has more elements than can be fit into a signed long. On 32bit Python that's 2**31 - 1, and on 64 bit Python that's 2**63 - 1. Of course, you will get a MemoryError even for values just under that.
The size of your lists is only limited by your memory. Note that, depending on your version of Python, range(1, 9999999999999999) needs only a few bytes of RAM since it always only creates a single element of the virtual list it returns.
If you want to instantiate the list, use list(range(1,n)) (this will copy the virtual list).