I am loading large files into memory using Python.
df = pd.read_hdf( filename, 'data' )
Sometimes it takes about a minute. Sometimes it takes a few seconds. It seems to take a few seconds if I run it twice in a row, or if I am sequentially going through all the files in a directory.
I suspect this might be something to do with the way the hard drive works and caching. Is there a way to make this less eratic?
Related
I made a program that edits some files in a game, it read bytedata from files and stores the data in a python dict.
the files that the program read are 30+ files, all with the same type of data.
I had the program processing the files one by one, for each saving bytedata to a python data array, then following a structure I read all bytes in that file, then continue to the next file, etc.
that process takes about 1 minute to complete.
So I made threads, for each file, I made a thread that does the same at the same time for all files.
But it takes the same time, about 1 minute.
How can this be possible?
the problem is not data reading speed from HDD because that is made in less than 1 second, after that the bytedata that each thread is processing is stored in memory. why does it take so long? and why it is not faster with threading?
Another question, is normal that the python dict where I store all the info after that is very large in memory? like 3 GB?? , there are around 10k keys in the dict and each have maaaaany values.
Is there no limit as to how many files can be created in a code in python, like there is a recursion limit? I have this bit of code here,
each = 0
while True:
with open('filenamewhtever'+str(each)+'.txt', 'a') as file:
file.write(str(each))
each += 1
which seems to work just fine, although it quickly filled a lot of memory in the folder. Could this,if unchecked, potentially have crashed my pc? Also, shouldnt the compiler have a failsafe switch to prevent this?
There is typically an operating system defined limit to how many files you can have open at the same time. But because the with statement closes each file after you've written it, you don't run into this limit.
There might also be limits imposed by the file system on how many files can be in a single directory. You might see certain operations (like listing the files) become slow, well before you even come close to that limit.
And finally, you are obviously limited by disk space.
I am scanning through a large number of files looking for some markers. I am starting to be really confident that once I have run through the code one time Python is not rereading the actual files from disk. I find this behavior strange because I was told that one reason I needed to structure my file access in the manner I have is so that the handle and file content is flushed. But that can't be.
There are 9,568 file paths in the list I am reading from. If I shut down Python and reboot my computer it takes roughly 6 minutes to read the files and determine if there is anything returned from the regular expression.
However, if I run the code a second time it takes about 36 seconds. Just for grins, the average document has 53,000 words.
Therefore I am concluding that Python still has access to the file it read in the first iteration.
I want to also observe that the first time I do this I can hear the disk spin (E:\ - Python is on C:). E is just a spinning disk with 126 MB cache - I don't think the cache is big enough to hold the contents of these files. When I do it later I do not hear the disk spin.
Here is the code
import re
test_7A_re = re.compile(r'\n\s*ITEM\s*7\(*a\)*[.]*\s*-*\s*QUANT.*\n',re.IGNORECASE)
no7a = []
for path in path_list:
path = path.strip()
with open(path,'r') as fh:
string = fh.read()
items = [item for item in re.finditer(test_7A_re,string)]
if len(items) == 0:
no7a.append(path)
continue
I care about this for a number of reasons, one is that I was thinking about using multi-processing. But if the bottleneck is reading in the files I don't see that I will gain much. I also think this is a problem because I would be worried about the file being modified and not having the most recent version of the file available.
I am tagging this 2.7 because I have no idea if this behavior is persistent across versions.
To confirm this behavior I modified my code to run as a .py file, and added some timing code. I then rebooted my computer - the first time it ran it took 5.6 minutes and the second time (without rebooting) the time was 36 seconds. Output is the same in both cases.
The really interesting thing is that even if shut down IDLE (but do not reboot my computer) it still takes 36 seconds to run the code.
All of this suggests to me that the files are not read from disk after the first time - this is amazing behavior to me but it seems dangerous.
To be clear, the results are the same - I believe given the timing tests I have run and the fact that I do not hear the disk spinning that somehow the files are still accessible to Python.
This is caused by caching in Windows. It is not related to Python.
In order to stop Windows from caching your reads:
Disable paging file in Windows and fill the RAM up to 90%
Use some tool to disable file caching in Windows like this one.
Run your code on a Linux VM on your Windows machine that has limited RAM. In Linux you can control the caching much better
Make the files much bigger, so that they won't fit in cache
I fail to see why this is a problem. I'm not 100% certain of how Windows handles file cache invalidation, but unless the "Last modified time" changes, you and I and Windows would assume that the file still holds the same content. If the file holds the same content, I don't see why reading from cache can be a problem.
I'm pretty sure that if you change the last modified date, say, by opening the file for write access then closing it right away, Windows will hold sufficient doubts over the file content and invalidate the cache.
I'm running a long Python program that includes a step of reading a file into a Pandas dataframe. The program consistently fails with a MemoryError when it first tries to read the file into memory. When I rerun the failing step (without rerunning the previous parts of the program), there is no MemoryError.
It may be a problem of accumulating lots of previous objects in memory, which aren't present on the rerun. But the amount of memory in play is below the 2 GB limit where Windows starts having problems. In particular, the previous steps of the program only leave around ~400 MB in RAM, and the file I'm trying to read takes only ~400 MB.
Any ideas what's causing the MemoryError the first time around?
I have a python function that downloads a file from S3 to some temp location on a local drive and then processes it. The download part looks like this:
def processNewDataFile(key):
## templocation below is just some temp local path
key.get_contents_to_filename(templocation)
## further processing
Here key is the AWS key for the file to download. What I've noticed is that occasionally get_contents_to_filename seems to freeze. In other parts of my code I have some solution that interrupts blocks of code (and raises an exception) if these blocks do not complete in a specified amount of time. This solution is hard to use here since files that I need to download vary in size a lot and sometimes S3 responds slower than other times.
So is there any reliable way of interrupting/timing out get_contents_to_filename that does NOT involve a hard predetermined time limit?
thanks
You could use a callback function with get_contents_to_filename
http://boto.cloudhackers.com/en/latest/ref/gs.html#boto.gs.key.Key.get_contents_to_file
The callback function needs two parameters, Bytes Sent and Total Size of the file.
You can specify the granularity (maximum number of times the callback will get called) as well although I've only used it with small files (less than 10kb) and it usually only gets called twice - once on start and once on end.
The important thing is that it will pass the size of the file to the callback function at the start of the transfer, which could then start a timer based on the size of the file.