Memory errors and list limits? - python

I need to produce large and big (very) matrices (Markov chains) for scientific purposes. I perform calculus that I put in a list of 20301 elements (=one row of my matrix). I need all those data in memory to proceed next Markov step but i can store them elsewhere (eg file) if needed even if it will slow my Markov chain walk-through. My computer (scientific lab): Bi-xenon 6 cores/12threads each, 12GB memory, OS: win64
Traceback (most recent call last):
File "my_file.py", line 247, in <module>
ListTemp.append(calculus)
MemoryError
Example of calculus results: 9.233747520008198e-102 (yes, it's over 1/9000)
The error is raised when storing the 19766th element:
ListTemp[19766]
1.4509421012263216e-103
If I go further
Traceback (most recent call last):
File "<pyshell#21>", line 1, in <module>
ListTemp[19767]
IndexError: list index out of range
So this list had a memory error at the 19767 loop.
Questions:
Is there a memory limit to a list?
Is it a "by-list limit" or a
"global-per-script limit"?
How to bypass those limits?
Any possibilites in mind?
Will it help to use numpy, python64? What
are the memory limits with them? What
about other languages?

First off, see How Big can a Python Array Get? and Numpy, problem with long arrays
Second, the only real limit comes from the amount of memory you have and how your system stores memory references. There is no per-list limit, so Python will go until it runs out of memory. Two possibilities:
If you are running on an older OS or one that forces processes to use a limited amount of memory, you may need to increase the amount of memory the Python process has access to.
Break the list apart using chunking. For example, do the first 1000 elements of the list, pickle and save them to disk, and then do the next 1000. To work with them, unpickle one chunk at a time so that you don't run out of memory. This is essentially the same technique that databases use to work with more data than will fit in RAM.

The MemoryError exception that you are seeing is the direct result of running out of available RAM. This could be caused by either the 2GB per program limit imposed by Windows (32bit programs), or lack of available RAM on your computer. (This link is to a previous question).
You should be able to extend the 2GB by using 64bit copy of Python, provided you are using a 64bit copy of windows.
The IndexError would be caused because Python hit the MemoryError exception before calculating the entire array. Again this is a memory issue.
To get around this problem you could try to use a 64bit copy of Python or better still find a way to write you results to file. To this end look at numpy's memory mapped arrays.
You should be able to run you entire set of calculation into one of these arrays as the actual data will be written disk, and only a small portion of it held in memory.

There is no memory limit imposed by Python. However, you will get a MemoryError if you run out of RAM. You say you have 20301 elements in the list. This seems too small to cause a memory error for simple data types (e.g. int), but if each element itself is an object that takes up a lot of memory, you may well be running out of memory.
The IndexError however is probably caused because your ListTemp has got only 19767 elements (indexed 0 to 19766), and you are trying to access past the last element.
It is hard to say what you can do to avoid hitting the limit without knowing exactly what it is that you are trying to do. Using numpy might help. It looks like you are storing a huge amount of data. It may be that you don't need to store all of it at every stage. But it is impossible to say without knowing.

If you want to circumvent this problem you could also use the shelve. Then you would create files that would be the size of your machines capacity to handle, and only put them on the RAM when necessary, basically writing to the HD and pulling the information back in pieces so you can process it.
Create binary file and check if information is already in it if yes make a local variable to hold it else write some data you deem necessary.
Data = shelve.open('File01')
for i in range(0,100):
Matrix_Shelve = 'Matrix' + str(i)
if Matrix_Shelve in Data:
Matrix_local = Data[Matrix_Shelve]
else:
Data[Matrix_Selve] = 'somenthingforlater'
Hope it doesn't sound too arcaic.

Related

Numpy's memmap acting strangely?

I am dealing with large numpy arrays and I am trying out memmap as it could help.
big_matrix = np.memmap(parameters.big_matrix_path, dtype=np.float16, mode='w+', shape=(1000000, 1000000)
The above works fine and it creates a file on my hard drive of about 140GB.
1000000 is just a random number I used - not the one I am actually using.
I want to fill the matrix with values. Currently it is just set to zero.
for i in tqdm(range(len(big_matrix))):
modified_row = get_row(i)
big_matrix[i, :] = modified_row
At this point now, I have a big_matrix filled with the values I want.
The problem is that from this point on I can't operate on this memmap.
For example I want to multiply column wise (broadcast).
I run this:
big_matrix * weights[:, np.newaxis]
Where weights has the same length.
It just hangs and throws and out of memory error as my RAM and SWAP is all used.
My understanding was that the memmap will keep everything on the hard drive.
For example save the results directly there.
So I tried this then:
for i in tqdm(range(big_matrix.shape[1])):
temp = big_matrix[:, i].tolist()
temp = np.array(temp) * weights
The above loads only 1 column in memory, and multiply that with the weights.
Then I will save that column back in big_matrix.
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
At this point I am thinking of switching to sqlite.
I wanted to get some insights why my code is not working?
Do I need to flush the memmap everytime I change it ?
np.memmap map a part of the virtual memory to the storage device space here. The OS is free to preload pages and cache them for a fast reuse. The memory is generally not flushed unless it is reclaimed (eg. by another process or the same process). When this happen, the OS typically (partially) flush data to the storage device and (partially) free the physical memory used for the mapping. That being said, this behaviour is dependent of the actual OS. It work that way on Windows. On Linux, you can use madvise to tune this behaviour but madvise is a low-level C function not yet supported by Numpy (though it is apparently supported for Python, see this issue for more information). Actually, Numpy does not even support closing the memmaped space (which is leaky). The solution is generally to flush data manually not to lose it. There are alternative solutions but none of them is great yet.
big_matrix * weights[:, np.newaxis]
It just hangs and throws and out of memory error as my RAM and SWAP is all used
This is normal since Numpy creates a new temporary array stored in RAM. There is no way to tell to Numpy to store temporary array in on the storage device. That being said, you can tell to Numpy where the output data is stored using the out parameter on some function (eg. np.multiply supports it). The output array can be created using memmap so not to use too much memory (regarding the behaviour of the OS).
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
This is also expected, especially if you use a HDD and not and SSD. Indeed, the array is stored (virtually) contiguously on the storage device. big_matrix[:, i] has to fetch data with a huge stride. For each item, with a size of only 2 bytes, the OS will perform an IO request to the storage device. Storage devices are optimized for contiguous reads so fetches are buffered and each IO request has a pretty significant latency. In practice, the OS will generally at least fetch a page (typically 4096 bytes, that is 512 times more than what is actually needed). Moreover, there is a limit of the number of IO requests that can be completed per second. HDDs can typically do about 20-200 IO requests per seconds while the fastest Nvme SSDs reach 100_000-600_000 UI requests per seconds. Note that the cache help not not reload data for the next column unless there are too many loaded pages and the OS has to flush them. Reading a matrix of size (1_000_000,1_000_000) causes up to 1_000_000*1_000_000=1_000_000_000_000 fetch, which is horribly inefficient. The cache could reduce this by a large margin, but operating simultaneously on 1_000_000 pages is also horribly inefficient since the processor cannot do that (due to a limited number of entries in the TLB). This will typically results in TLB misses, that is expensive kernel calls for each item to be read. Because a kernel call typically take (at least) about ~1 us on mainstream PC, this means more than a week to to the whole computation.
If you want to efficiently read columns, then you need to read large chunk of columns. For example, you certainly need at least several hundred of columns to be read even on a fast Nvme SSD. For HDD, it is at least several dozens of thousand columns to get a proper throughput. This means you certainly cannot read the full columns efficiently due to the high amount of requested RAM. Using another data layout (tile + transposed data) is critical in this case.

Load large data without iterators or chunks

I want to learn on a large file of data (7GB) : 800 rows, 5 millions columns. So I want to load these data and put them in a form I can use (2D list or array).
The problem is here, when I load the data and try to store them, they use all my memory (12GB) and just stop at row 500.
I heard a lot about how to use this kind of data, like using chunks and iterators, but I would like to load them entirely in the memory so I can do cross-validation.
I tried to use pandas to help me but the problem is the same.
Is there some issues to load and store the entire 7GB of data as I want to ? Or any other idea that could help me ?
You can try getting a swap or page file. Depending on your operating system, you can use virtual memory to allow your system to address more objects in a single process than will fit in physical memory. Depending on how large the working set is, performance may not suffer that much, or it may be completely dreadful.
That said, it is almost certain that getting more memory or using some partitioning strategy (similar to what you are calling chunking) is a better solution for your problem.
On windows take a look here for information on how to adjust page file size. For Redhat Linux try this link for information on adding swap.

Error saving and loading a list of matrices

I have a list "data_list", and I would save it in order to load it in another script.
First of all I converted it in an array, in this way:
data_array = np.array(data_list)
Then I saved it:
np.savez("File", data_array)
Then, in another script I want to access to "File"; so:
a = np.load("File.npz")
b = a['arr_0']
I used this code until two weeks ago and it worked fine. In these days I am trying to work with my program, but it ends with an error identified in the line
b = a['arr_0']
"File" is a 300 MB file. The strangest thing is that it has stopped suddenly to work.
Any idea about what can be happened?
Ps: I give you some information. My list contains 180 matrices 511x511. Each matrix contains decimal numbers (I tried to create 180 matrices of zeros, and the error occurs in the same way). If I reduce the number of matrices, the script works fine: in particular down to 130 matrices it is ok, while up to the program doesn't work.
Here I report the error message
b = a['arr_0']
File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 241, in
__getitem__
return format.read_array(value)
File "C:\Python27\lib\site-packages\numpy\lib\format.py", line 459, in
read_array
array = numpy.fromstring(data, dtype=dtype, count=count)
MemoryError
MemoryError is an out of memory condition. This explains why it happens with objects of at least a certain size - more and bigger arrays, as you would expect, require more memory. What the max size is, and why it seems to have changed, is harder. This can be highly specific to your system, especially in regard to considerations like:
How much memory (physical RAM and swap space) exists and is available to the operating system
How much virtual memory the OS gives to Python
How much of that you're already using
The implementation of the C library, especially of its malloc function, which can affect how Python uses the memory it is allocated
And possibly quite a few other things.
Per the comments, it seems the biggest problem here is that you are running a 32 bit build of Python. On Windows, 32 bit processes apparently have an effective maximum memory address space of around 2GB. By my tests, the list of arrays you are using by itself might take around a quarter of that. The fact that your error only comes up when reading the file back in suggests that numpy deserialisation is relatively memory intensive, but I don't know enough about its implementation to be able to say why that would be. In any case, it seems like installing a 64 bit build of Python is your best bet.

Maximum size of pandas dataframe

I'm trying to read in a somewhat large dataset using pandas read_csv or read_stata functions, but I keep running into Memory Errors. What is the maximum size of a dataframe? My understanding is that dataframes should be okay as long as the data fits into memory, which shouldn't be a problem for me. What else could cause the memory error?
For context, I'm trying to read in the Survey of Consumer Finances 2007, both in ASCII format (using read_csv) and in Stata format (using read_stata). The file is around 200MB as dta and around 1.2GB as ASCII, and opening it in Stata tells me that there are 5,800 variables/columns for 22,000 observations/rows.
I'm going to post this answer as was discussed in comments. I've seen it come up numerous times without an accepted answer.
The Memory Error is intuitive - out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.
1) Check for code errors
This may be a "dumb step" but that's why it's first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os module that will search your entire computer and put the output in an excel file)
2) Make your code more efficient
Goes along the lines of Step 1. But if something simple is taking a long time, there's usually a module or a better way of doing something that is faster and more memory efficent. That's the beauty of Python and/or open source Languages!
3) Check The Total Memory of the object
The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are here and here
to find the size of an object in bites you can always use sys.getsizeof():
import sys
print(sys.getsizeof(OBEJCT_NAME_HERE))
Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.
4) Check the memory while running
Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.
use the code below to see the documentation straight in Jupyter Notebook:
%mprun?
%memit?
Sample use:
%load_ext memory_profiler
def lol(x):
return x
%memit lol(500)
#output --- peak memory: 48.31 MiB, increment: 0.00 MiB
If you need help on magic functions This is a great post
5) This one may be first.... but Check for simple things like bit version
As in your case, a simple switching of the version of python you were running solved the issue.
Usually the above steps solve my issues.

Numpy memory error creating huge matrix

I am using numpy and trying to create a huge matrix.
While doing this, I receive a memory error
Because the matrix is not important, I will just show the way how to easily reproduce the error.
a = 10000000000
data = np.array([float('nan')] * a)
not surprisingly, this throws me MemoryError
There are two things I would like to tell:
I really need to create and to use a big matrix
I think I have enough RAM to handle this matrix (I have 24 Gb or RAM)
Is there an easy way to handle big matrices in numpy?
Just to be on the safe side, I previously read these posts (which sounds similar):
Very large matrices using Python and NumPy
Python/Numpy MemoryError
Processing a very very big data set in python - memory error
P.S. apparently I have some problems with multiplication and division of numbers, which made me think that I have enough memory. So I think it is time for me to go to sleep, review math and may be to buy some memory.
May be during this time some genius might come up with idea how to actually create this matrix using only 24 Gb of Ram.
Why I need this big matrix
I am not going to do any manipulations with this matrix. All I need to do with it is to save it into pytables.
Assuming each floating point number is 4 bytes each, you'd have
(10000000000 * 4) /(2**30.0) = 37.25290298461914
Or 37.5 gigabytes you need to store in memory. So I don't think 24gb of RAM is enough.
If you can't afford creating such a matrix, but still wish to do some computations, try sparse matrices.
If you wish to pass it to another Python package that uses duck typing, you may create your own class with __getitem__ implementing dummy access.
If you use pycharm editor for python you can change memory settings from
C:\Program Files\JetBrains\PyCharm 2018.2.4\bin\pycharm64.exe.vmoptions
you can decrease pycharm speed from this file so your program memory will allocate more megabites
you must edit this codes
-Xms1024m
-Xmx2048m
-XX:ReservedCodeCacheSize=960m
so you can make them -Xms512m -Xmx1024m and finally your program will work
but it'll affect the debugging performance in pycharm.

Categories