Small python script to split a large data file into many smaller files (some 100,000 or so) - basically map tiling.
Anyway, after a long time (20+hrs) of running the script it dies with:
IOError: [Errno 2] No such file or directory: ......
Now this strikes me as odd as NONE of the files existed before the script runs - and f.close() is called after every file has been written, so FD limits don't seem to be responsible... (and plenty of disk space)
The other odd thing was that all the files it had already created/written (about 55,000) were deleted along with the containing directory when the script died.
I used the multiprocessing module to create a process for each CPU core, and all 4 spat out the same message on dying - but with a file in the section of data they were processing (perhaps this is relevant?)
I can hopefully work around this, but am just curious as to why this might have occured ??
Edit: For more context, the script is splitting the GSHHS geo data into small 'tiles'. Now, I have succesfully done this for the lower resolution sections of the DB, into files containing larger areas of the globe - it fell over when trying to split the hi resolution data into 1x1° tiles
Related
i have a python process which takes a file containing streamed data and converts it into a format ready to load to a database. i have just migrated this process from one Linux GCP VM to another running exactly the same code, but the final output file size is nearly 4 times as big. 500mb vs 2gb.
When i download the files and manually inspect them, they look exactly the same to the eye.
Any ideas what could be causing this?
Edit: Thanks for the feedback, i traced it back to the input file, which is slightly different (as my stream recording process has also been migrated)
I am now trying to work out why a marginally different file creates such a different output file once its been processed.
I have a mounted directory contained several large files. I would like to move those files onto a local directory. However, the local machine has very limited disk space and I've run into an issue where moving these files has failed due to disk space and were subsequently lost. I'm looking for a pythonic way to:
Attempt to move all files from the source directory to the destination
If we run out of space, move them all back and raise an error (or just return false) without changing ownership or permissions
I do not want to move the directory itself, only its contents. It's okay to overwrite existing files. I see plenty of "How do I move files" questions, but no "what happens if we run out of space" questions.
Local machine is running Centos 7, remote machine is running Solaris 10.
I think the most pythonic way would be to move chunk by chunk, checking for errors between.
I'm going to assume that the file is too large to entirely store in memory at once, so something like this would probably be best:
chunk_size = 2048
successful = True
with f_in = open("original.file","rb"):
chunk = f_in.read(chunk_size)#Only reads a small chunk of the old file at a time
while chunk:
try
f_out = open("new_file","ab+")
f_out.write(chunk) #Writes the small chunk to the end of the new file and
f_out.close() #then closes it so as not to run out of memory
chunk = f_in.read(chunk_size)
except OSError as e:
chunk = False
successful = False
if successful:
os.remove("original.file")
I'm fairly certain this would work for you, as you would end early in the event of a NOSPC error, meaning you ran out of disc space. The original file would only be deleted if you completed the write (I assume that was your intent, since you wanted to move it, not copy it).
There's an easier solution with renaming the file using os.rename("old/path/file.txt","new/path/file.txt"), but I'm not 100% sure that would work with a mounted disc to local disc. But it's probably worth trying.
I am scanning through a large number of files looking for some markers. I am starting to be really confident that once I have run through the code one time Python is not rereading the actual files from disk. I find this behavior strange because I was told that one reason I needed to structure my file access in the manner I have is so that the handle and file content is flushed. But that can't be.
There are 9,568 file paths in the list I am reading from. If I shut down Python and reboot my computer it takes roughly 6 minutes to read the files and determine if there is anything returned from the regular expression.
However, if I run the code a second time it takes about 36 seconds. Just for grins, the average document has 53,000 words.
Therefore I am concluding that Python still has access to the file it read in the first iteration.
I want to also observe that the first time I do this I can hear the disk spin (E:\ - Python is on C:). E is just a spinning disk with 126 MB cache - I don't think the cache is big enough to hold the contents of these files. When I do it later I do not hear the disk spin.
Here is the code
import re
test_7A_re = re.compile(r'\n\s*ITEM\s*7\(*a\)*[.]*\s*-*\s*QUANT.*\n',re.IGNORECASE)
no7a = []
for path in path_list:
path = path.strip()
with open(path,'r') as fh:
string = fh.read()
items = [item for item in re.finditer(test_7A_re,string)]
if len(items) == 0:
no7a.append(path)
continue
I care about this for a number of reasons, one is that I was thinking about using multi-processing. But if the bottleneck is reading in the files I don't see that I will gain much. I also think this is a problem because I would be worried about the file being modified and not having the most recent version of the file available.
I am tagging this 2.7 because I have no idea if this behavior is persistent across versions.
To confirm this behavior I modified my code to run as a .py file, and added some timing code. I then rebooted my computer - the first time it ran it took 5.6 minutes and the second time (without rebooting) the time was 36 seconds. Output is the same in both cases.
The really interesting thing is that even if shut down IDLE (but do not reboot my computer) it still takes 36 seconds to run the code.
All of this suggests to me that the files are not read from disk after the first time - this is amazing behavior to me but it seems dangerous.
To be clear, the results are the same - I believe given the timing tests I have run and the fact that I do not hear the disk spinning that somehow the files are still accessible to Python.
This is caused by caching in Windows. It is not related to Python.
In order to stop Windows from caching your reads:
Disable paging file in Windows and fill the RAM up to 90%
Use some tool to disable file caching in Windows like this one.
Run your code on a Linux VM on your Windows machine that has limited RAM. In Linux you can control the caching much better
Make the files much bigger, so that they won't fit in cache
I fail to see why this is a problem. I'm not 100% certain of how Windows handles file cache invalidation, but unless the "Last modified time" changes, you and I and Windows would assume that the file still holds the same content. If the file holds the same content, I don't see why reading from cache can be a problem.
I'm pretty sure that if you change the last modified date, say, by opening the file for write access then closing it right away, Windows will hold sufficient doubts over the file content and invalidate the cache.
I have a RAM tester that tests ram modules and generates a test report for each module after the test is finished(whether it passes or fails). The test report is your typical .txt file, and contains all the data I need to know about the RAM module, particularly the speed and the pass/fail condition. I am having a hard time figuring out a way to have python read the contents of the test report without blocking the RAM testing software from writing a test report.
My goal is to have python to run in the background and read the file and if the file contains the rams speed AND the word 'pass' I want python to write over the serial port where I will have a arduino waiting for a key character of my choosing to come over the serial line (that character will depend on the speed of the RAM detected). after the test report is read, and python has written a character over the serial to the arduino, I will need python to clear/truncate the .txt test report so it will be clear and ready for the next time the file is read from. that cycle will then go on indefinitely.
To get a bigger picture of the whole project I will explain the ultimate goal. the ram tester is a fully automated tester that loads it's self, tests, and ejects each module onto a conveyor belt. if the module fails the conveyor goes to the left and if it passes it goes to the right. I want to use an arduino to create a extra conveyor that will sort the tested passed ram by speed. so far everything seems doable, I'm just having a hard time with python reading the test report and clearing it without blocking the RAM tester from writing the test report. I've had someone suggest using pipe but I'm still not certain on how to do that. I will also include that the software that writes the test report is third-party software that I have no idea what language it is written it's just what came with the RAM tester. thanks ahead for taking the time to read through this and any suggestions are greatly appreciated.
If I move or delete the file it will just generate another in its place with the same name.
Do this:
Rename the file: mv foo foo-for-report
Generate the report from it.
Python can rename a file using os.rename.
import os
os.rename("foo", "foo-for-report")
I'm working on a Python script to parse Squid(http://www.squid-cache.org/) log files. While the logs are rotated every day to stop them getting to big, they do reach between 40-90MB by the end of each day.
Essentially what I'm doing is reading the file line by line, parsing out the data I need(IP, Requested URL, Time) and adding it to an sqlite database. However this seems to be taking a very long time(It's been running over 20 minutes now)
So obviously, re-reading the file can't be done. What I would like to do is read the file and then detect all new lines written. Or even better, at the start of the day the script will simply read the data in real time as it is added so there will never be any long processing times.
How would I go about doing this?
One way to achieve this is by emulating tail -f. The script would constantly monitor the file and process each new line as it appears.
For a discussion and some recipes, see tail -f in python with no time.sleep
One way to do this is to use file system monitoring with py-inotify http://pyinotify.sourceforge.net/ - and set a callback function to be executed whenever
the log file size changed.
Another way to do it, without requiring external modules, is to record in the filesystem
(possibily on your sqlite database itself), the offset of the end of the lest line read on the log file, (which you get with with file.tell() ), and just read the newly added lines
from that offset onwards, which is done with a simple call to file.seek(offset) before looping through the lines.
The main difference of keeping track of the offset and the "tail" emulation described ont he other post is that this one allows your script to be run multiple times, i.e. - no need for it to be running continually, or to recover in case of a crash.