Error saving and loading a list of matrices - python

I have a list "data_list", and I would save it in order to load it in another script.
First of all I converted it in an array, in this way:
data_array = np.array(data_list)
Then I saved it:
np.savez("File", data_array)
Then, in another script I want to access to "File"; so:
a = np.load("File.npz")
b = a['arr_0']
I used this code until two weeks ago and it worked fine. In these days I am trying to work with my program, but it ends with an error identified in the line
b = a['arr_0']
"File" is a 300 MB file. The strangest thing is that it has stopped suddenly to work.
Any idea about what can be happened?
Ps: I give you some information. My list contains 180 matrices 511x511. Each matrix contains decimal numbers (I tried to create 180 matrices of zeros, and the error occurs in the same way). If I reduce the number of matrices, the script works fine: in particular down to 130 matrices it is ok, while up to the program doesn't work.
Here I report the error message
b = a['arr_0']
File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 241, in
__getitem__
return format.read_array(value)
File "C:\Python27\lib\site-packages\numpy\lib\format.py", line 459, in
read_array
array = numpy.fromstring(data, dtype=dtype, count=count)
MemoryError

MemoryError is an out of memory condition. This explains why it happens with objects of at least a certain size - more and bigger arrays, as you would expect, require more memory. What the max size is, and why it seems to have changed, is harder. This can be highly specific to your system, especially in regard to considerations like:
How much memory (physical RAM and swap space) exists and is available to the operating system
How much virtual memory the OS gives to Python
How much of that you're already using
The implementation of the C library, especially of its malloc function, which can affect how Python uses the memory it is allocated
And possibly quite a few other things.
Per the comments, it seems the biggest problem here is that you are running a 32 bit build of Python. On Windows, 32 bit processes apparently have an effective maximum memory address space of around 2GB. By my tests, the list of arrays you are using by itself might take around a quarter of that. The fact that your error only comes up when reading the file back in suggests that numpy deserialisation is relatively memory intensive, but I don't know enough about its implementation to be able to say why that would be. In any case, it seems like installing a 64 bit build of Python is your best bet.

Related

Indexing very large hex file with python

I'm trying to write a program that parses data from a (very) large file that contains even rows of 8 sets of 16 bit hex values. For instance, one row would look like this:
edfc b600 edfc 2102 81fb 0000 d1fe 0eff
The data files are expected to be anywhere between 1-4 TB, so I wasn't sure what the best approach would be. If I load this file using Python's open() function, could this turn out badly? I'm worried about how much of an impact this will have on my memory if I'm loading such a large file just to index through. Alternatively, if there's a method I can use to load just the section of data I want from the file, that would be ideal, but as far as I know, I don't think that's even possible. Is this correct?
Anyway, Some sort of idea as to how to approach this very general problem would be much appreciated!
Found an answer from Github. In numpy, there's a function called memmap that works for what I'm doing.
samples = np.memmap("hexdump_samples", mode="r", dtype=np.int16)[100:159]
This didn't seem to cause any issues with the smaller data set I was using, but I can't imagine this causing any issues with memory with the larger files. As far as I understand, this wouldn't cause any issues.
It depends on your computer hardware, how much RAM you have. Python is an interpreted language with a bunch of safeguards, but I wouldn't risk trying to open that file with Python. I would recommend using C or C++, they are good with large amounts of data and memory management. You can then parse the data in bite sized chunks, maybe 16MB per chunk. Python is a extremely slow and memory inefficient compared to C.

Maximum size of pandas dataframe

I'm trying to read in a somewhat large dataset using pandas read_csv or read_stata functions, but I keep running into Memory Errors. What is the maximum size of a dataframe? My understanding is that dataframes should be okay as long as the data fits into memory, which shouldn't be a problem for me. What else could cause the memory error?
For context, I'm trying to read in the Survey of Consumer Finances 2007, both in ASCII format (using read_csv) and in Stata format (using read_stata). The file is around 200MB as dta and around 1.2GB as ASCII, and opening it in Stata tells me that there are 5,800 variables/columns for 22,000 observations/rows.
I'm going to post this answer as was discussed in comments. I've seen it come up numerous times without an accepted answer.
The Memory Error is intuitive - out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.
1) Check for code errors
This may be a "dumb step" but that's why it's first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os module that will search your entire computer and put the output in an excel file)
2) Make your code more efficient
Goes along the lines of Step 1. But if something simple is taking a long time, there's usually a module or a better way of doing something that is faster and more memory efficent. That's the beauty of Python and/or open source Languages!
3) Check The Total Memory of the object
The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are here and here
to find the size of an object in bites you can always use sys.getsizeof():
import sys
print(sys.getsizeof(OBEJCT_NAME_HERE))
Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.
4) Check the memory while running
Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.
use the code below to see the documentation straight in Jupyter Notebook:
%mprun?
%memit?
Sample use:
%load_ext memory_profiler
def lol(x):
return x
%memit lol(500)
#output --- peak memory: 48.31 MiB, increment: 0.00 MiB
If you need help on magic functions This is a great post
5) This one may be first.... but Check for simple things like bit version
As in your case, a simple switching of the version of python you were running solved the issue.
Usually the above steps solve my issues.

Python list implementation and pympler measurement

I need to parse file (~500 Mb) and partially load it to list, I don't need the entire file.
I had a feeling that python allocate much more memory for the list that the size of the data it contains.
I tried to use asizeof of pympler in order to estimate the overkill however it fails with MemoryError which is strange for me, I thought if I have a list in the memory asizeof should just run over it sum the sizes of all entities and that it.
Then I took the chunk of the initial file, and I was shocked by the size of the list asizeof showed me. The list size was three times bigger that the file size.
The question is if the size given by asizeof is correct, what the more efficient way to use list in python. How to check the size of the bigger list when asizeof fails with memoryerror.
It would be helpful to see the code you use for reading/parsing the file and also how you invoke pympler.asizeof.
asizeof and all other facilities in Pympler work inside the profiled process (using Python's introspection facilities to navigate reference graphs). That means that the profiling overhead might become a problem when sizing reference graphs with large number of nodes (objects) - especially if you are already tight on memory before you start profiling. Be sure to set all=False and code=False when calling asizeof. In any case, please file a bug on GitHub. Maybe one can avoid running out of memory in this scenario.
To the best of my knowledge, the sizes reported by asizeof are accurate as long as sys.getsizeof returns the correct size for the individual objects (assuming Python >= 2.6). You could set align=1 when calling asizeof and see if the numbers are more in line with what you expect.
You could also check the virtual size of your process via your platform's tools or pympler.process:
from pympler.process import ProcessMemoryInfo
pmi = ProcessMemoryInfo()
print ("Process virtual size [Byte]: " + str(pmi.vsz))
This metric should always be higher than what asizeof reports when sizing objects.

Numpy memory error creating huge matrix

I am using numpy and trying to create a huge matrix.
While doing this, I receive a memory error
Because the matrix is not important, I will just show the way how to easily reproduce the error.
a = 10000000000
data = np.array([float('nan')] * a)
not surprisingly, this throws me MemoryError
There are two things I would like to tell:
I really need to create and to use a big matrix
I think I have enough RAM to handle this matrix (I have 24 Gb or RAM)
Is there an easy way to handle big matrices in numpy?
Just to be on the safe side, I previously read these posts (which sounds similar):
Very large matrices using Python and NumPy
Python/Numpy MemoryError
Processing a very very big data set in python - memory error
P.S. apparently I have some problems with multiplication and division of numbers, which made me think that I have enough memory. So I think it is time for me to go to sleep, review math and may be to buy some memory.
May be during this time some genius might come up with idea how to actually create this matrix using only 24 Gb of Ram.
Why I need this big matrix
I am not going to do any manipulations with this matrix. All I need to do with it is to save it into pytables.
Assuming each floating point number is 4 bytes each, you'd have
(10000000000 * 4) /(2**30.0) = 37.25290298461914
Or 37.5 gigabytes you need to store in memory. So I don't think 24gb of RAM is enough.
If you can't afford creating such a matrix, but still wish to do some computations, try sparse matrices.
If you wish to pass it to another Python package that uses duck typing, you may create your own class with __getitem__ implementing dummy access.
If you use pycharm editor for python you can change memory settings from
C:\Program Files\JetBrains\PyCharm 2018.2.4\bin\pycharm64.exe.vmoptions
you can decrease pycharm speed from this file so your program memory will allocate more megabites
you must edit this codes
-Xms1024m
-Xmx2048m
-XX:ReservedCodeCacheSize=960m
so you can make them -Xms512m -Xmx1024m and finally your program will work
but it'll affect the debugging performance in pycharm.

Memory errors and list limits?

I need to produce large and big (very) matrices (Markov chains) for scientific purposes. I perform calculus that I put in a list of 20301 elements (=one row of my matrix). I need all those data in memory to proceed next Markov step but i can store them elsewhere (eg file) if needed even if it will slow my Markov chain walk-through. My computer (scientific lab): Bi-xenon 6 cores/12threads each, 12GB memory, OS: win64
Traceback (most recent call last):
File "my_file.py", line 247, in <module>
ListTemp.append(calculus)
MemoryError
Example of calculus results: 9.233747520008198e-102 (yes, it's over 1/9000)
The error is raised when storing the 19766th element:
ListTemp[19766]
1.4509421012263216e-103
If I go further
Traceback (most recent call last):
File "<pyshell#21>", line 1, in <module>
ListTemp[19767]
IndexError: list index out of range
So this list had a memory error at the 19767 loop.
Questions:
Is there a memory limit to a list?
Is it a "by-list limit" or a
"global-per-script limit"?
How to bypass those limits?
Any possibilites in mind?
Will it help to use numpy, python64? What
are the memory limits with them? What
about other languages?
First off, see How Big can a Python Array Get? and Numpy, problem with long arrays
Second, the only real limit comes from the amount of memory you have and how your system stores memory references. There is no per-list limit, so Python will go until it runs out of memory. Two possibilities:
If you are running on an older OS or one that forces processes to use a limited amount of memory, you may need to increase the amount of memory the Python process has access to.
Break the list apart using chunking. For example, do the first 1000 elements of the list, pickle and save them to disk, and then do the next 1000. To work with them, unpickle one chunk at a time so that you don't run out of memory. This is essentially the same technique that databases use to work with more data than will fit in RAM.
The MemoryError exception that you are seeing is the direct result of running out of available RAM. This could be caused by either the 2GB per program limit imposed by Windows (32bit programs), or lack of available RAM on your computer. (This link is to a previous question).
You should be able to extend the 2GB by using 64bit copy of Python, provided you are using a 64bit copy of windows.
The IndexError would be caused because Python hit the MemoryError exception before calculating the entire array. Again this is a memory issue.
To get around this problem you could try to use a 64bit copy of Python or better still find a way to write you results to file. To this end look at numpy's memory mapped arrays.
You should be able to run you entire set of calculation into one of these arrays as the actual data will be written disk, and only a small portion of it held in memory.
There is no memory limit imposed by Python. However, you will get a MemoryError if you run out of RAM. You say you have 20301 elements in the list. This seems too small to cause a memory error for simple data types (e.g. int), but if each element itself is an object that takes up a lot of memory, you may well be running out of memory.
The IndexError however is probably caused because your ListTemp has got only 19767 elements (indexed 0 to 19766), and you are trying to access past the last element.
It is hard to say what you can do to avoid hitting the limit without knowing exactly what it is that you are trying to do. Using numpy might help. It looks like you are storing a huge amount of data. It may be that you don't need to store all of it at every stage. But it is impossible to say without knowing.
If you want to circumvent this problem you could also use the shelve. Then you would create files that would be the size of your machines capacity to handle, and only put them on the RAM when necessary, basically writing to the HD and pulling the information back in pieces so you can process it.
Create binary file and check if information is already in it if yes make a local variable to hold it else write some data you deem necessary.
Data = shelve.open('File01')
for i in range(0,100):
Matrix_Shelve = 'Matrix' + str(i)
if Matrix_Shelve in Data:
Matrix_local = Data[Matrix_Shelve]
else:
Data[Matrix_Selve] = 'somenthingforlater'
Hope it doesn't sound too arcaic.

Categories