Listening for new files Python - python

I am trying to find a solution to a problem that I have...
I have a big share that contains hundreds of thousands(if not millions) of files and new files arriving every second.
I am trying to write an application that will make it faster to find files in the share my idea was to insert all the file names in to Redis DB in the format of :
{'file_name','file_path'} and than when a file is needed to pull its path from the DB...
The problem starts when I am trying to index all the old files(I assume it will take at least a few hours) while simeltaniously listen to new files that arrive during the process.
This is an example of what I'am trying to do:
import redis
import os
r = redis.StrictRedis(host='localhost',port=6379,db=0)
for root, dirs, files in os.walk(r'D:\\'):
for file in files:
r.set(os.path.basename(file),os.path.join(root, file))
print 'The file %s was succefuly added' %os.path.basename(file)
how am i supposed to modify the code that i will keep listening for new files?
thanks for your help!=)

You should take a look at the watchdog library. It does exactly what you're looking for. Here's an example of using it:
import sys
import time
import logging
from watchdog.observers import Observer
from watchdog.events import LoggingEventHandler
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
path = sys.argv[1] if len(sys.argv) > 1 else '.'
event_handler = LoggingEventHandler()
observer = Observer()
observer.schedule(event_handler, path, recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()

Related

Python : Read all files in directory to watch inside while loop

I am working on a python script, where I will be passing a directory, and I need to get all log-files from it. Currently, I have a small script which watches for any changes to these files and then processes that information.
It's working good, but it's just for a single file, and hardcoded file value. How can I pass a directory to it, and still watch all the files. My confusion is since I am working on these files in a while loop which should always stay running, how can I do that for n number of files inside a directory?
Current code :
import time
f = open('/var/log/nginx/access.log', 'r')
while True:
line = ''
while len(line) == 0 or line[-1] != '\n':
tail = f.readline()
if tail == '':
time.sleep(0.1) # avoid busy waiting
continue
line += tail
print(line)
_process_line(line)
Question was already tagged for duplicate, but the requirement is to get changes line by line from all files inside directory. Other questions cover single file, which is already working.
Try this library: watchdog.
Python API library and shell utilities to monitor file system events.
https://pythonhosted.org/watchdog/
Simple example:
import sys
import time
import logging
from watchdog.observers import Observer
from watchdog.events import LoggingEventHandler
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
path = sys.argv[1] if len(sys.argv) > 1 else '.'
event_handler = LoggingEventHandler()
observer = Observer()
observer.schedule(event_handler, path, recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
I'm not completely sure if I understand what your trying to do, but maybe use:
while True:
files = os.listdir(directory)
for file in files:
--You're code for checking contents of the file--

Python Watchdog get the name of the file that was just created?

I have written a script that observes a directory for creation of new files. I set up a split function to split the event.src_path from the target directory that i was giving the observer. This allowed me to get the file_name successfully.
See script below
def on_created(event):
source_path = event.src_path
file_name = source_path.split(TargetDir,1)[1]
print(f"{file_name} was just Created")
if __name__ == "__main__":
for dir in range(len(TargetDir)):
event_handler = FileSystemEventHandler()
event_handler.on_created = on_created
observer = Observer()
observer.schedule(event_handler, path = TargetDir[0], recursive=True)
observer.start()
However, now i am trying to feed in a list of Target directories and am looping through each one and calling the on_created() method. Now obviously the Target directory is no longer a global variable, and i need to try to pass each Dir in to the function. Im using watchdog, and don't think it's possible to add extra arguments to the on_created() function. If i'm wrong, please let me know how to do this? otherwise, is there no simpler way to just get the name of the file that was created, without passing in the target directory just for that reason? I can get the event.src, however this gives the full path, and then i wouldn't know where to split it, if it were scanning multiple directories.
Well one simple way is to pass in a different function for each directory, for example :
def create_callback(dir):
def on_create(event):
source_path = event.src_path
file_name = source_path.split(dir, 1)[1]
print(f"{file_name} was just Created")
return on_create
if __name__ == "__main__":
for dir in range(len(TargetDir)):
event_handler = FileSystemEventHandler()
event_handler.on_created = create_callback(dir)
observer = Observer()
observer.schedule(event_handler, path=TargetDir[0], recursive=True)
observer.start()
The dir variable is attached to the scope of the on_create function and can therefore be used from within the function.

Forcing os.walk to stop if taking too long time

I want to find all files in a directory tree with a given file extension. However, some folders are really large and I therefore want to stop this process if it takes too long time (say 1 second). My current code looks something like this:
import os
import time
start_time = time.time()
file_ext = '.txt'
path = 'C:/'
file_list = []
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(file_ext):
relDir = os.path.relpath(root, path)
relFile = os.path.join(relDir, file)
file_list.append(relFile)
if time.time() - start_time> 1:
break
if time.time() - start_time> 1:
break
The problem with this code is that when I get to a really large subfolder, this code does not break until that folder has been completely traversed. If that folder contains many files, it might take much longer time than I would like. Is there any way I can make sure that the code does not run for much longer than the allotted time?
Edit: Note that while it is certainly helpful to find ways to speed up the code (for instance by using os.scandir), this question deals primarily with how to kill a process that is running.
You can do the walk in a subprocess and kill that. Options include multiprocessing.Process but the multiprocessing libs on Windows may need to do a fair amount of work that you don't need. Instead, you can just pipe the walker code into a python subprocess and go from there.
import os
import sys
import threading
import subprocess as subp
walker_script = """
import os
import sys
path = os.environ['TESTPATH']
file_ext = os.environ['TESTFILEEXT']
# let parent know we are going
print('started')
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(file_ext):
relDir = os.path.relpath(root, path)
relFile = os.path.join(relDir, file)
print(relFile)
"""
file_ext = '.txt'
path = 'C:/'
encoding = sys.getdefaultencoding()
# subprocess reads directories... additional python flags seek to
# speed python initialization. If a linuxy system, forking would
# be a good option.
env = {'TESTPATH':path, 'TESTFILEEXT':file_ext}
env.update(os.environ)
proc = subp.Popen([sys.executable, '-E', '-s', '-S', '-'], stdin=subp.PIPE,
stdout=subp.PIPE, # , stderr=open(os.devnull, 'wb'))
env = env)
# write walker script
proc.stdin.write(walker_script.encode('utf-8'))
proc.stdin.close()
# wait for start marker
next(proc.stdout)
# timer kills directory traversal when bored
threading.Timer(1, proc.kill).start()
file_list = [line.decode(encoding).strip() for line in proc.stdout]
print(file_list)

Watchdog library in Python on OS X -- not showing full event path

I just started working with the Watchdog library in Python on Mac, and am doing some basic tests to make sure things are working like I would expect. Unfortunately, they're not -- I can only seem to obtain the path to the folder containing the file where an event was registered, not the path to the file itself.
Below is a simple test program (slightly modified from the example provided by Watchdog) to print out the event type, path, and time whenever an event is registered.
import time
from watchdog.observers import Observer
from watchdog.events import LoggingEventHandler
from watchdog.events import FileSystemEventHandler
class TestEventHandler(FileSystemEventHandler):
def on_any_event(self, event):
print("event noticed: " + event.event_type +
" on file " + event.src_path + " at " + time.asctime())
if __name__ == "__main__":
event_handler = TestEventHandler()
observer = Observer()
observer.schedule(event_handler, path='~/test', recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
The src_path variable should contain the path of the file that had the event happen to it.
However, in my testing, when I modify a file, src_path only prints the path to the folder containing the file, not the path to the file itself. For example, when I modify the file moon.txt in the folder europa, the program prints the following output:
event noticed: modified on file ~/test/europa at Mon Jul 8 15:32:07 2013
What do I need to change in order to obtain the full path to the modified file?
Problem solved. As it turns out, FSEvents in OS X returns only the directory for file modified events, leaving you to scan the directory yourself to find out which file was modified. This is not mentioned in Watchdog documentation, though it's found easily in FSEvents documentation.
To get the full path to the file, I added the following snippet of code (inspired by this StackOverflow thread) to find the most recently modified file in a directory, to be used whenever event.src_path returns a directory.
if(event.is_directory):
files_in_dir = [event.src_path+"/"+f for f in os.listdir(event.src_path)]
mod_file_path = max(files_in_dir, key=os.path.getmtime)
mod_file_path contains the full path to the modified file.
Thanks ekl for providing your solution. I just stumbled across the same problem. However, I used to use PatternMatchingEventHandler, which requires small changes to your solution:
subclass from FileSystemEventHandler
create an attribute pattern where you store your pattern matching. This is not as flexible as the original PatternMatchingEventHandler, but should suffice most needs, and you will get the idea anyway if you want to extend it.
Here's the code you have to put in your FileSystemEventHandlersubclass:
def __init__(self, pattern='*'):
super(MidiEventHandler, self).__init__()
self.pattern = pattern
def on_modified(self, event):
super(MidiEventHandler, self).on_modified(event)
if event.is_directory:
files_in_dir = [event.src_path+"/"+f for f in os.listdir(event.src_path)]
if len(files_in_dir) > 0:
modifiedFilename = max(files_in_dir, key=os.path.getmtime)
else:
return
else:
modifiedFilename = event.src_path
if fnmatch.fnmatch(os.path.basename(modifiedFilename), self.pattern):
print "Modified MIDI file: %s" % modifiedFilename
One other thing I changed is that I check whether the directory is empty or not before running max() on the file list. max() does not work with empty lists.

python script to recursively search FTP for specific filename and newer than 24 hours

Our storage area ran into trouble with SMB connections and now we have been forced to use FTP to access files on a regular basis. So rather than using Bash, I am trying to use python but I am running into a few problems. The script needs to recursively search through the FTP directory and find all files "*1700_m30.mp4" newer than 24 hours. Then copy all these files locally.
this is what I got so far - but I can't seem to get the script to download the files or get the stats from the files that tell me whether they are newer than 24 hours.
#!/usr/bin/env python
# encoding: utf-8
import sys
import os
import ftplib
import ftputil
import fnmatch
import time
dir_dest = '/Volumes/VoigtKampff/Temp/TEST1/' # Directory where the files needs to be downloaded to
pattern = '*1700_m30.mp4' #filename pattern for what the script is looking for
print 'Looking for this pattern :', pattern # print pattern
print "logging into GSP" # print
host = ftputil.FTPHost('xxx.xxx','xxx','xxxxx') # ftp host info
recursive = host.walk("/GSPstor/xxxxx/xxx/xxx/xxx/xxxx",topdown=True,onerror=None) # recursive search
for root,dirs,files in recursive:
for name in files:
print 'Files :', files # print all files it finds
video_list = fnmatch.filter(files, pattern)
print 'Files to be moved :', video_list # print list of files to be moved
if host.path.isfile(video_list): # check whether the file is valid
host.download(video_list, video_list, 'b') # download file list
host.close
Here is the modified script based on the excellent recommendations from ottomeister (thank you!!) - the last issue now is that it downloads but it keeps downloading the files and overwriting the existing files:
import sys
import os
import ftplib
import ftputil
import fnmatch
import time
from time import mktime
import datetime
import os.path, time
from ftplib import FTP
dir_dest = '/Volumes/VoigtKampff/Temp/TEST1/' # Directory where the files needs to be downloaded to
pattern = '*1700_m30.mp4' #filename pattern for what the script is looking for
print 'Looking for this pattern :', pattern # print pattern
utc_datetime_less24H = datetime.datetime.utcnow()-datetime.timedelta(seconds=86400) #UTC time minus 24 hours in seconds
print 'UTC time less than 24 Hours is: ', utc_datetime_less24H.strftime("%Y-%m-%d %H:%M:%S") # print UTC time minus 24 hours in seconds
print "logging into GSP FTP" # print
with ftputil.FTPHost('xxxxxxxx','xxxxxx','xxxxxx') as host: # ftp host info
recursive = host.walk("/GSPstor/xxxx/com/xxxx/xxxx/xxxxxx",topdown=True,onerror=None) # recursive search
for root,dirs,files in recursive:
for name in files:
print 'Files :', files # print all files it finds
video_list = fnmatch.filter(files, pattern) # collect all files that match pattern into variable:video_list
statinfo = host.stat(root, video_list) # get the stats from files in variable:video_list
file_mtime = datetime.datetime.utcfromtimestamp(statinfo.st_mtime)
print 'Files with pattern: %s and epoch mtime is: %s ' % (video_list, statinfo.st_mtime)
print 'Last Modified: %s' % datetime.datetime.utcfromtimestamp(statinfo.st_mtime)
if file_mtime >= utc_datetime_less24H:
for fname in video_list:
fpath = host.path.join(root, fname)
if host.path.isfile(fpath):
host.download_if_newer(fpath, os.path.join(dir_dest, fname), 'b')
host.close()
This line:
video_list = fnmatch.filter(files, pattern)
gets you a list of filenames that match your glob pattern. But this line:
if host.path.isfile(video_list): # check whether the file is valid
is bogus, because host.path.isfile() does not want a list of filenames as its argument. It wants a single pathname. So you need to iterate over video_list constructing one pathname at a time, passing each of those pathnames to host.path.isfile(), and then possibly downloading that particular file. Something like this:
import os.path
for fname in video_list:
fpath = host.path.join(root, fname)
if host.path.isfile(fpath):
host.download(fpath, os.path.join(dir_dest, fname), 'b')
Note that I'm using host.path.join() to manage remote pathnames and os.path.join() to manage local pathnames. Also note that this puts all of the downloaded files into a single directory. If you want to put them into a directory hierarchy that mirrors the remote layout (you'll have to do something like that if the filenames in different remote directories can clash) then you'll need to construct a different destination path, and you'll probably have to create the local destination directory hierarchy too.
To get timestamp information use host.lstat() or host.stat() depending on how you want to handle symlinks.
And yes, that should be host.close(). Without it the connection will be closed after the host variable goes out of scope and is garbage-collected, but it's better to close it explicitly. Even better, use a with clause to ensure that the connection gets closed even if an exception causes this code to be abandoned before it reaches the host.close() call, like this:
with ftputil.FTPHost('xxx.xxx','xxx','xxxxx') as host: # ftp host info
recursive = host.walk(...)
...

Categories