I just started working with the Watchdog library in Python on Mac, and am doing some basic tests to make sure things are working like I would expect. Unfortunately, they're not -- I can only seem to obtain the path to the folder containing the file where an event was registered, not the path to the file itself.
Below is a simple test program (slightly modified from the example provided by Watchdog) to print out the event type, path, and time whenever an event is registered.
import time
from watchdog.observers import Observer
from watchdog.events import LoggingEventHandler
from watchdog.events import FileSystemEventHandler
class TestEventHandler(FileSystemEventHandler):
def on_any_event(self, event):
print("event noticed: " + event.event_type +
" on file " + event.src_path + " at " + time.asctime())
if __name__ == "__main__":
event_handler = TestEventHandler()
observer = Observer()
observer.schedule(event_handler, path='~/test', recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
The src_path variable should contain the path of the file that had the event happen to it.
However, in my testing, when I modify a file, src_path only prints the path to the folder containing the file, not the path to the file itself. For example, when I modify the file moon.txt in the folder europa, the program prints the following output:
event noticed: modified on file ~/test/europa at Mon Jul 8 15:32:07 2013
What do I need to change in order to obtain the full path to the modified file?
Problem solved. As it turns out, FSEvents in OS X returns only the directory for file modified events, leaving you to scan the directory yourself to find out which file was modified. This is not mentioned in Watchdog documentation, though it's found easily in FSEvents documentation.
To get the full path to the file, I added the following snippet of code (inspired by this StackOverflow thread) to find the most recently modified file in a directory, to be used whenever event.src_path returns a directory.
if(event.is_directory):
files_in_dir = [event.src_path+"/"+f for f in os.listdir(event.src_path)]
mod_file_path = max(files_in_dir, key=os.path.getmtime)
mod_file_path contains the full path to the modified file.
Thanks ekl for providing your solution. I just stumbled across the same problem. However, I used to use PatternMatchingEventHandler, which requires small changes to your solution:
subclass from FileSystemEventHandler
create an attribute pattern where you store your pattern matching. This is not as flexible as the original PatternMatchingEventHandler, but should suffice most needs, and you will get the idea anyway if you want to extend it.
Here's the code you have to put in your FileSystemEventHandlersubclass:
def __init__(self, pattern='*'):
super(MidiEventHandler, self).__init__()
self.pattern = pattern
def on_modified(self, event):
super(MidiEventHandler, self).on_modified(event)
if event.is_directory:
files_in_dir = [event.src_path+"/"+f for f in os.listdir(event.src_path)]
if len(files_in_dir) > 0:
modifiedFilename = max(files_in_dir, key=os.path.getmtime)
else:
return
else:
modifiedFilename = event.src_path
if fnmatch.fnmatch(os.path.basename(modifiedFilename), self.pattern):
print "Modified MIDI file: %s" % modifiedFilename
One other thing I changed is that I check whether the directory is empty or not before running max() on the file list. max() does not work with empty lists.
Related
I have a Python script that creates a directory on local NTFS SSD disk on a computer running Windows.
os.makedirs(path)
After creating the directory
os.path.isdir(path)
returns true.
But when i start a process using multiprocessing.Pool's starmap, sometimes os.path.isdir(path) returns false for the same path in the function executed in the pool process. The failure occurs quite randomly, and i haven't spotted any obvious reason for it.
The directory is created in the same thread that starts the processes. There are multiple processes writing files to the directory, but each of them has it's own file and none of them modifies the directory in any other way.
I've tried solving this using retries and delays, but would like to know whether there's a correct solution to this problem?
Are the filesystem's records somehow outdated in the other processes at this time? And if so, is there any way to ensure they're updated before attempting to access any files or directories?
Here's a minimal example with all the relevant bits. The system processes "Works". Whenever a new work arrives, the system launches a thread that schedules processing in a process pool. The processing itself consists of writing a file the the directory causing problems i.e. the processes don't seem to find the directory.
Note: i haven't been able to reproduce this problem in the original environment nor with this example. I just see in the logs, that it occurs every now and then.
#!/usr/bin/python
import os
import sys
import time
import random
import shutil
import threading
import multiprocessing
def create_dir(path):
os.makedirs(path)
# This never fails
if not os.path.isdir(path):
raise Exception(f"Unable to create directory '{path}'")
def clean_dir(path):
for f in os.listdir(path):
filepath = os.path.join(path, f)
if os.path.isdir(filepath):
shutil.rmtree(filepath, ignore_errors = True)
else:
os.remove(filepath)
# Creates a file to the problem directory. Crashes if
# the directory doesn't exist (or isn't visible).
# This function is executed in a process.
def process(i, tmp_dir):
print(f"Processing {i} / {tmp_dir}")
# Most of the work is done here before accessing path
time.sleep (1 + 3 * random.random ())
# Check Foo/Work-id/Temp
# This fails every now and then, and when
# it fails, it fails for every child process
if not os.path.isdir(tmp_dir):
raise Exception(f"Directory '{tmp_dir}' doesn't exist")
filepath = os.path.join(tmp_dir, f"File-{i}.txt")
with open(filepath, 'w') as file:
file.write(f"{i}")
# Starts 3 processes (from a pool of 8) each writing a single file.
# This function itself is executed in a thread.
def start_processing(target_dir, pool):
tmp_dir = os.path.join(target_dir, "Temp")
# This never fails, create Foo/Work-id/Temp
if not os.path.exists(tmp_dir):
create_dir(tmp_dir)
num_processes = 3
results = pool.starmap(process, enumerate([tmp_dir] * num_processes))
# Creates directory Foo to the current directory. And starts 20 works
# to be executed in a process pool, 8 at a time.
if __name__ == "__main__":
pool = multiprocessing.Pool(processes = 8)
root = "Foo"
seed = int (time.time())
print(f"Seed {seed}")
# This never fails
if not os.path.exists(root):
create_dir(root)
else:
clean_dir(root)
time.sleep(2)
num_works = 20
for work_id in range (1, num_works + 1):
name = f"Work-{work_id}"
print(f"Starting {name}/{num_works})")
target_dir = os.path.join(root, name)
# Create target directory Foo/Work-id
if not os.path.exists(target_dir):
create_dir(target_dir)
thread = threading.Thread(
target = start_processing,
args = (target_dir, pool),
name = "Processing Thread"
)
thread.start()
time.sleep(0.07)
I am working on a python script, where I will be passing a directory, and I need to get all log-files from it. Currently, I have a small script which watches for any changes to these files and then processes that information.
It's working good, but it's just for a single file, and hardcoded file value. How can I pass a directory to it, and still watch all the files. My confusion is since I am working on these files in a while loop which should always stay running, how can I do that for n number of files inside a directory?
Current code :
import time
f = open('/var/log/nginx/access.log', 'r')
while True:
line = ''
while len(line) == 0 or line[-1] != '\n':
tail = f.readline()
if tail == '':
time.sleep(0.1) # avoid busy waiting
continue
line += tail
print(line)
_process_line(line)
Question was already tagged for duplicate, but the requirement is to get changes line by line from all files inside directory. Other questions cover single file, which is already working.
Try this library: watchdog.
Python API library and shell utilities to monitor file system events.
https://pythonhosted.org/watchdog/
Simple example:
import sys
import time
import logging
from watchdog.observers import Observer
from watchdog.events import LoggingEventHandler
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
path = sys.argv[1] if len(sys.argv) > 1 else '.'
event_handler = LoggingEventHandler()
observer = Observer()
observer.schedule(event_handler, path, recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
I'm not completely sure if I understand what your trying to do, but maybe use:
while True:
files = os.listdir(directory)
for file in files:
--You're code for checking contents of the file--
I have written a script that observes a directory for creation of new files. I set up a split function to split the event.src_path from the target directory that i was giving the observer. This allowed me to get the file_name successfully.
See script below
def on_created(event):
source_path = event.src_path
file_name = source_path.split(TargetDir,1)[1]
print(f"{file_name} was just Created")
if __name__ == "__main__":
for dir in range(len(TargetDir)):
event_handler = FileSystemEventHandler()
event_handler.on_created = on_created
observer = Observer()
observer.schedule(event_handler, path = TargetDir[0], recursive=True)
observer.start()
However, now i am trying to feed in a list of Target directories and am looping through each one and calling the on_created() method. Now obviously the Target directory is no longer a global variable, and i need to try to pass each Dir in to the function. Im using watchdog, and don't think it's possible to add extra arguments to the on_created() function. If i'm wrong, please let me know how to do this? otherwise, is there no simpler way to just get the name of the file that was created, without passing in the target directory just for that reason? I can get the event.src, however this gives the full path, and then i wouldn't know where to split it, if it were scanning multiple directories.
Well one simple way is to pass in a different function for each directory, for example :
def create_callback(dir):
def on_create(event):
source_path = event.src_path
file_name = source_path.split(dir, 1)[1]
print(f"{file_name} was just Created")
return on_create
if __name__ == "__main__":
for dir in range(len(TargetDir)):
event_handler = FileSystemEventHandler()
event_handler.on_created = create_callback(dir)
observer = Observer()
observer.schedule(event_handler, path=TargetDir[0], recursive=True)
observer.start()
The dir variable is attached to the scope of the on_create function and can therefore be used from within the function.
I am trying to find a solution to a problem that I have...
I have a big share that contains hundreds of thousands(if not millions) of files and new files arriving every second.
I am trying to write an application that will make it faster to find files in the share my idea was to insert all the file names in to Redis DB in the format of :
{'file_name','file_path'} and than when a file is needed to pull its path from the DB...
The problem starts when I am trying to index all the old files(I assume it will take at least a few hours) while simeltaniously listen to new files that arrive during the process.
This is an example of what I'am trying to do:
import redis
import os
r = redis.StrictRedis(host='localhost',port=6379,db=0)
for root, dirs, files in os.walk(r'D:\\'):
for file in files:
r.set(os.path.basename(file),os.path.join(root, file))
print 'The file %s was succefuly added' %os.path.basename(file)
how am i supposed to modify the code that i will keep listening for new files?
thanks for your help!=)
You should take a look at the watchdog library. It does exactly what you're looking for. Here's an example of using it:
import sys
import time
import logging
from watchdog.observers import Observer
from watchdog.events import LoggingEventHandler
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
path = sys.argv[1] if len(sys.argv) > 1 else '.'
event_handler = LoggingEventHandler()
observer = Observer()
observer.schedule(event_handler, path, recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
Have an example piece of (Python) code to check if a directory has changed:
import os
def watch(path, fdict):
"""Checks a directory and children for changes"""
changed = []
for root, dirs, files in os.walk(path):
for f in files:
abspath = os.path.abspath(os.path.join(root, f))
new_mtime = os.stat(abspath).st_mtime
if not fdict.has_key(abspath) or new_mtime > fdict[abspath]:
changed.append(abspath)
fdict[abspath] = new_mtime
return fdict, changed
But the accompanying unittest randomly fails unless I add at least a 2 second sleep:
import unittest
import project_creator
import os
import time
class tests(unittest.TestCase):
def setUp(self):
os.makedirs('autotest')
f = open(os.path.join('autotest', 'new_file.txt'), 'w')
f.write('New file')
def tearDown(self):
os.unlink(os.path.join('autotest', 'new_file.txt'))
os.rmdir('autotest')
def test_amend_file(self):
changed = project_creator.watch('autotest', {})
time.sleep(2)
f = open(os.path.join('autotest', 'new_file.txt'), 'a')
f.write('\nA change!')
f.close()
changed = project_creator.watch('autotest', changed[0])
self.assertEqual(changed[1], [os.path.abspath(os.path.join('autotest', 'new_file.txt'))])
if __name__ == '__main__':
unittest.main()
Is stat really limited to worse than 1 second accuracy? (Edit: apparently so, with FAT)
Is there any (cross platform) way of detecting more rapid changes?
The proper way is to watch a directory instead of polling for changes.
Check out FindFirstChangeNotification Function.
Watch a Directory for Changes is a Python implementation.
If directory watching isn't accurate enough then probably the only alternative is to intercept file systems calls.
Watchdog:
http://packages.python.org/watchdog/quickstart.html
Is a good project to have some multi-platform changes notification.
if this were linux, i'd use inotify. there's apparently a windows inotify equivalent - the java jnotify library has implemented it - but i don't know if there's a python implementation