Python Pandas MemoryError on first run, goes away on rerun - python

I'm running a long Python program that includes a step of reading a file into a Pandas dataframe. The program consistently fails with a MemoryError when it first tries to read the file into memory. When I rerun the failing step (without rerunning the previous parts of the program), there is no MemoryError.
It may be a problem of accumulating lots of previous objects in memory, which aren't present on the rerun. But the amount of memory in play is below the 2 GB limit where Windows starts having problems. In particular, the previous steps of the program only leave around ~400 MB in RAM, and the file I'm trying to read takes only ~400 MB.
Any ideas what's causing the MemoryError the first time around?

Related

python process creating file with inflated size

i have a python process which takes a file containing streamed data and converts it into a format ready to load to a database. i have just migrated this process from one Linux GCP VM to another running exactly the same code, but the final output file size is nearly 4 times as big. 500mb vs 2gb.
When i download the files and manually inspect them, they look exactly the same to the eye.
Any ideas what could be causing this?
Edit: Thanks for the feedback, i traced it back to the input file, which is slightly different (as my stream recording process has also been migrated)
I am now trying to work out why a marginally different file creates such a different output file once its been processed.

Splitting file into small chunks and processing

I have three files and each contain close to 300k records. Have written a python script to process those files with some business logic and able to create the output file successfully. This process completes in 5 mins.
I am using the same script to process the files with high of volume of data (All the three input files contain about 30 million records). Now the processing taking hours and kept running for very long time.
So I am thinking of breaking the file into 100 small chunks based on the last two digits of the unique id and having it processed parallels. Are there any data pipeline packages that I could use to perform this?
BTW, I am running this process in my VDI machine.
I am not sure of any API as such for the function.But you can try multiprocessing and multithreading to process large volume of data

Spyder hanged after executing python code

I have a few variables with length greater than 200k items with each item being a dictionary. I extracted this from even more complex variables.
Every time i would perform some operation on them and execute the code, Spyder would hang but after a few mins, it would show/print the result.
Now, again I executed another operation and Spyder has hanged again. However, it is still not responding. It's been 15+ mins.
Previously memory usage was 23%, now it is 67%. My laptop has 16GB RAM.
Will I lose my variables if Spyder is closed and opened again? Will I lose my file? The changes in the file haven't been saved.

How can I force Python code to read input files again without rebooting my computer

I am scanning through a large number of files looking for some markers. I am starting to be really confident that once I have run through the code one time Python is not rereading the actual files from disk. I find this behavior strange because I was told that one reason I needed to structure my file access in the manner I have is so that the handle and file content is flushed. But that can't be.
There are 9,568 file paths in the list I am reading from. If I shut down Python and reboot my computer it takes roughly 6 minutes to read the files and determine if there is anything returned from the regular expression.
However, if I run the code a second time it takes about 36 seconds. Just for grins, the average document has 53,000 words.
Therefore I am concluding that Python still has access to the file it read in the first iteration.
I want to also observe that the first time I do this I can hear the disk spin (E:\ - Python is on C:). E is just a spinning disk with 126 MB cache - I don't think the cache is big enough to hold the contents of these files. When I do it later I do not hear the disk spin.
Here is the code
import re
test_7A_re = re.compile(r'\n\s*ITEM\s*7\(*a\)*[.]*\s*-*\s*QUANT.*\n',re.IGNORECASE)
no7a = []
for path in path_list:
path = path.strip()
with open(path,'r') as fh:
string = fh.read()
items = [item for item in re.finditer(test_7A_re,string)]
if len(items) == 0:
no7a.append(path)
continue
I care about this for a number of reasons, one is that I was thinking about using multi-processing. But if the bottleneck is reading in the files I don't see that I will gain much. I also think this is a problem because I would be worried about the file being modified and not having the most recent version of the file available.
I am tagging this 2.7 because I have no idea if this behavior is persistent across versions.
To confirm this behavior I modified my code to run as a .py file, and added some timing code. I then rebooted my computer - the first time it ran it took 5.6 minutes and the second time (without rebooting) the time was 36 seconds. Output is the same in both cases.
The really interesting thing is that even if shut down IDLE (but do not reboot my computer) it still takes 36 seconds to run the code.
All of this suggests to me that the files are not read from disk after the first time - this is amazing behavior to me but it seems dangerous.
To be clear, the results are the same - I believe given the timing tests I have run and the fact that I do not hear the disk spinning that somehow the files are still accessible to Python.
This is caused by caching in Windows. It is not related to Python.
In order to stop Windows from caching your reads:
Disable paging file in Windows and fill the RAM up to 90%
Use some tool to disable file caching in Windows like this one.
Run your code on a Linux VM on your Windows machine that has limited RAM. In Linux you can control the caching much better
Make the files much bigger, so that they won't fit in cache
I fail to see why this is a problem. I'm not 100% certain of how Windows handles file cache invalidation, but unless the "Last modified time" changes, you and I and Windows would assume that the file still holds the same content. If the file holds the same content, I don't see why reading from cache can be a problem.
I'm pretty sure that if you change the last modified date, say, by opening the file for write access then closing it right away, Windows will hold sufficient doubts over the file content and invalidate the cache.

Volatile Loading Times

I am loading large files into memory using Python.
df = pd.read_hdf( filename, 'data' )
Sometimes it takes about a minute. Sometimes it takes a few seconds. It seems to take a few seconds if I run it twice in a row, or if I am sequentially going through all the files in a directory.
I suspect this might be something to do with the way the hard drive works and caching. Is there a way to make this less eratic?

Categories