Reading updated files on the fly in Python

Reading updated files on the fly in Python - python

I'm writing two Python scripts that both parse files. One is a standard unix logfile and the other is a binary file. I'm trying to figure out the best way to monitor these so I can read data as soon as they're updated. Most of the solutions I've found thus far are linux specific, but I need this to work in FreeBSD.
Obviously one approach would be to just run my script every X amount of time, but this seems gross and inefficient. If I want my Python app running continuously in the background monitoring a file and acting on it once it's changed/updated, what's my best bet?

Have you tried KQueue events?
http://docs.python.org/library/select.html#kqueue-objects
kqueue is the FreeBSD / OS version of inotify (file change notification service). I haven't used this, but I think it's what you want.

I once did to make a sort of daemon process for a parser built in Python. I needed to watch a series of files and process them in Python, and it had to be a truly multi-OS solution (Windows & Linux in this case). I wrote a program that watches over a list of files by checking their modification time. The program sleeps for a while and then checks the modification time of the files being watched. If the modification time is newer than the one previously registered, then the file has changed and so stuff has to be done with this file.
Something like this:
import os
import time
path = os.path.dirname(__file__)
print "Looking for files in", path, "..."
# get interesting files
files = [{"file" : f} for f in os.listdir(path) if os.path.isfile(f) and os.path.splitext(f)[1].lower() == ".src"]
for f in files:
f["output"] = os.path.splitext(f["file"])[0] + ".out"
f["modtime"] = os.path.getmtime(f["file"]) - 10
print " watching:", f["file"]
while True:
# sleep for a while
time.sleep(0.5)
# check if anything changed
for f in files:
# is mod time of file is newer than the one registered?
if os.path.getmtime(f["file"]) > f["modtime"]:
# store new time and...
f["modtime"] = os.path.getmtime(f["file"])
print f["file"], "has changed..."
# do your stuff here
It does not look like top notch code, but it works quite well.
There are other SO questions about this, but I don't know if they'll provide a direct answer to your question:
How to implement a pythonic equivalent of tail -F?
How do I watch a file for changes?
How can I "watch" a file for modification / change?
Hope this helps!

Related

Windows - file opened by another process, still can rename it in Python

On Windows OS, just before doing some actions on my file, I need to know if it's in use by another process. After some serious research over all the other questions with a similar problem, I wasn't able to find a working solution for it.
os.rename(my_file.csv, my_file.csv) is still working even if I have the file opened with ... notepad let's say.
psutil ... it took too much time and it doesn't work (can't find my file path in nt.path:
for proc in psutil.process_iter():
try:
flist = proc.open_files()
if flist:
for nt in flist:
if my_file_path == nt.path:
print("it's here")
except psutil.NoSuchProcess as err:
print(err)
Is there any other solution for this?
UPDATE 1
I have to do 2 actions on this file: 1. check if the filename corresponds to a pattern; 2. copy it over SFTP.
UPDATE 2 + solution
Thanks to #Eryk Sun, I found out that Notepad "reads the contents into memory and then closes the handle". After opening my file with Word, os.rename and psutil are working like a (py)charm.

If the program You use opens the file by importing it (like Excel would do it, for example), that means that it transforms Your data in a readable form for itself, without keeping a hand on the actual file afterwards. If You save the file from there, it either saves it in the programs own format or exports (and transforms) the file back.
What dou You want to do with the file? Maybe You can simply copy the file?

OneDrive free up space with Python

I have been using OneDrive to store a large amount of images and now I need to process those, so I have sync'd my OneDrive folder to my computer, which takes relatively no space on disk. However, since I have to open() them in my code, they all get downloaded, which would take much more than the available memory on my computer. I can manually use the Free up space action in the right-click contextual menu, which keeps them sync'd without taking space.
I'm looking for a way to do the same thing but in my code instead, after every image I process.
Trying to find how to get the commands of contextual menu items led me to these two places in the registry:
HKEY_LOCAL_MACHINE\SOFTWARE\Classes\Directory\shell
HKEY_LOCAL_MACHINE\SOFTWARE\Classes*\shellex\ContextMenuHandlers
However I couldn't find anything related to it and those trees have way too many keys to check blindly. Also this forum post (outside link) shows a few ways to free up space automatically, but it seems to affect all files and is limited to full days intervals.
So is there any way to either access that command or to free up the space in python ?

According to this microsoft post it is possible to call Attrib.exe to do that sort of manipulation on files.
This little snippet does the job for a per-file usage. As shown in the linked post, it's also possible to do it on the full contents of a folder using the /s argument, and much more.
import subprocess
def process_image(path):
# Open the file, which downloads it automatically
with open(path, 'r') as img:
print(img)
# Free up space (OneDrive) after usage
subprocess.run('attrib +U -P "' + path + '"')
The download and freeing up space are fairly quick, but in the case of running this heavily in parallel, it is possible that some disk space will be consumed for a short amount of time. In general though, this is pretty instantaneous.

In addition to Mat's answer. If you are working on a Mac then you can replace Attrib.exe with "/Applications/OneDrive.App/Contents/MacOS/OneDrive /unpin" to make the file online only.
import subprocess
path = "/Users/OneDrive/file.png"
subprocess.run(["/Applications/OneDrive.App/Contents/MacOS/OneDrive", "/unpin", path])

Free up space for multiples files.
import os
import subprocess
path = r"C:\Users\yourUser\Folder"
diret = os.listdir(path)
for di in diret:
dir_atual = path + "\\" + di
for root, dirs, files in os.walk(dir_atual):
for file in files:
arquivos = (os.path.join(root, file))
print (arquivos)
subprocess.run('attrib +U -P "' + arquivos + '"')

Windows disk usage issues with python

I am executing the python code that follows.
I am running it on a folder ("articles") which has a couple hundred subfolders and 240,226 files in all.
I am timing the execution. At first the times were pretty stable but went non-linear after 100,000 files. Now the times (I am timing at 10,000 file intervals) can go non_linear after 30,000 or so (or not).
I have the Task Manager open and correlate the slow-downs to 99% Disk usage by python.exe. I have done gc-collect(). dels etc., turned off Windows indexing. I have re-started Windows, emptied the trash (I have a few hundred GBs free). Nothing helps, the disk usage seems to be getting more erratic if anything.
Sorry for the long post - Thanks for the help
def get_filenames():
for (dirpath, dirnames, filenames) in os.walk("articles/"):
dirs.extend(dirnames)
for dir in dirs:
path = "articles" + "\\" + dir
nxml_files.extend(glob.glob(path + "/*.nxml"))
return nxml_files
def extract_text_from_files(nxml_files):
for nxml_file in nxml_files:
fast_parse(nxml_file)
def fast_parse(infile):
file = open(infile,"r")
filetext = file.read()
tag_breaks = filetext.split('><')
paragraphs = [tag_break.strip('p>').strip('</') for tag_break in tag_breaks if tag_break.startswith('p>')]
def run_files():
nxml_files = get_filenames()
extract_text_from_files(nxml_files)
if __name__ == "__main__":
run_files()

There are some things that could be optimized.
At first, is you open files, close them as well. A with open(...) as name: block will do that easily. BTW in Python 2 file is a bad choice for a variable name, it is built-in function's name.
You can remove one disc read by doing string comparisons instead of the glob.
And last but not least: os.walk spits out the results cleverly, so don't buffer them into a list, process everything inside one loop. This will save a lot of memory.
That is what I can advise from the code. For more details on what is causing the I/O you should use profiling. See https://docs.python.org/2/library/profile.html for details.

Limitation to Python's glob?

I'm using glob to feed file names to a loop like so:
inputcsvfiles = glob.iglob('NCCCSM*.csv')
for x in inputcsvfiles:
csvfilename = x
do stuff here
The toy example that I used to prototype this script works fine with 2, 10, or even 100 input csv files, but I actually need it to loop through 10,959 files. When using that many files, the script stops working after the first iteration and fails to find the second input file.
Given that the script works absolutely fine with a "reasonable" number of entries (2-100), but not with what I need (10,959) is there a better way to handle this situation, or some sort of parameter that I can set to allow for a high number of iterations?
PS- initially I was using glob.glob, but glob.iglob fairs no better.
Edit:
An expansion of above for more context...
# typical input file looks like this: "NCCCSM20110101.csv", "NCCCSM20110102.csv", etc.
inputcsvfiles = glob.iglob('NCCCSM*.csv')
# loop over individial input files
for x in inputcsvfiles:
csvfile = x
modelname = x[0:5]
# ArcPy
arcpy.AddJoin_management(inputshape, "CLIMATEID", csvfile, "CLIMATEID", "KEEP_COMMON")
do more stuff after
The script fails at the ArcPy line, where the "csvfile" variable gets passed into the command. The error reported is that it can't find a specified csv file (e.g., "NCCSM20110101.csv"), when in fact, the csv is definitely in the directory. Could it be that you can't reuse a declared variable (x) multiple times as I have above? Again, this will work fine if the directory being glob'd only has 100 or so files, but if there's a whole lot (e.g., 10,959), it fails seemingly arbitrarily somewhere down the list.

Try doing a ls * on shell for those 10,000 entries and shell would fail too. How about walking the directory and yield those files one by one for your purpose?
#credit - #dabeaz - generators tutorial
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
# Example use
if __name__ == '__main__':
lognames = gen_find("NCCCSM*.csv",".")
for name in lognames:
print name

One issue that arose was not with Python per se, but rather with ArcPy and/or MS handling of CSV files (more the latter, I think). As the loop iterates, it creates a schema.ini file whereby information on each CSV file processed in the loop gets added and stored. Over time, the schema.ini gets rather large and I believe that's when the performance issues arise.
My solution, although perhaps inelegant, was do delete the schema.ini file during each loop to avoid the issue. Doing so allowed me to process the 10k+ CSV files, although rather slowly. Truth be told, we wound up using GRASS and BASH scripting in the end.

If it works for 100 files but fails for 10000, then check that arcpy.AddJoin_management closes csvfile after it is done with it.
There is a limit on the number of open files that a process may have at any one time (which you can check by running ulimit -n).

Python programming - Windows focus and program process

I'm working on a python program that will automatically combine sets of files based on their names.
Being a newbie, I wasn't quite sure how to go about it, so I decided to just brute force it with the win32api.
So I'm attempting to do everything with virtual keys. So I run the script, it selects the top file (after arranging the by name), then sends a right click command,selects 'combine as adobe PDF', and then have it push enter. This launched the Acrobat combine window, where I send another 'enter' command. The here's where I hit the problem.
The folder where I'm converting these things loses focus and I'm unsure how to get it back. Sending alt+tab commands seems somewhat unreliable. It sometimes switches to the wrong thing.
A much bigger issue for me.. Different combination of files take different times to combine. though I haven't gotten this far in my code, my plan was to set some arbitrarily long time.sleep() command before it finally sent the last "enter" command to finish and confirm the file name completing the combination process. Is there a way to monitor another programs progress? Is there a way to have python not execute anymore code until something else has finished?

I would suggest using a command-line tool like pdftk http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ - it does exactly what you want, it's cross-platform, it's free, and it's a small download.
You can easily call it from python with (for example) subprocess.Popen
Edit: sample code as below:
import subprocess
import os
def combine_pdfs(infiles, outfile, basedir=''):
"""
Accept a list of pdf filenames,
merge the files,
save the result as outfile
#param infiles: list of string, names of PDF files to combine
#param outfile: string, name of merged PDF file to create
#param basedir: string, base directory for PDFs (if filenames are not absolute)
"""
# From the pdftk documentation:
# Merge Two or More PDFs into a New Document:
# pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf
if basedir:
infiles = [os.path.join(basedir,i) for i in infiles]
outfile = [os.path.join(basedir,outfile)]
pdftk = [r'C:\Program Files (x86)\Pdftk\pdftk.exe'] # or wherever you installed it
op = ['cat']
outcmd = ['output']
args = pdftk + infiles + op + outcmd + outfile
res = subprocess.call(args)
combine_pdfs(
['p1.pdf', 'p2.pdf'],
'p_total.pdf',
'C:\\Users\\Me\\Downloads'
)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.