Python : Read all files in directory to watch inside while loop

Python : Read all files in directory to watch inside while loop - python

I am working on a python script, where I will be passing a directory, and I need to get all log-files from it. Currently, I have a small script which watches for any changes to these files and then processes that information.
It's working good, but it's just for a single file, and hardcoded file value. How can I pass a directory to it, and still watch all the files. My confusion is since I am working on these files in a while loop which should always stay running, how can I do that for n number of files inside a directory?
Current code :
import time
f = open('/var/log/nginx/access.log', 'r')
while True:
line = ''
while len(line) == 0 or line[-1] != '\n':
tail = f.readline()
if tail == '':
time.sleep(0.1) # avoid busy waiting
continue
line += tail
print(line)
_process_line(line)
Question was already tagged for duplicate, but the requirement is to get changes line by line from all files inside directory. Other questions cover single file, which is already working.

Try this library: watchdog.
Python API library and shell utilities to monitor file system events.
https://pythonhosted.org/watchdog/
Simple example:
import sys
import time
import logging
from watchdog.observers import Observer
from watchdog.events import LoggingEventHandler
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
path = sys.argv[1] if len(sys.argv) > 1 else '.'
event_handler = LoggingEventHandler()
observer = Observer()
observer.schedule(event_handler, path, recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()

I'm not completely sure if I understand what your trying to do, but maybe use:
while True:
files = os.listdir(directory)
for file in files:
--You're code for checking contents of the file--

Related

How can I check if a file is actively open in another process outside of my python script with python3.9 in Windows 10?

I am using the following function to allow the OS to open a third party application associated with the filetype in question. For example: If variable 'fileToOpen' links to a file (it's full path of course) called flower.psd, this function would open up Photoshop in Windows and Gimp in Linux (typically).
def launchFile(fileToOpen):
if platform.system() == 'Darwin': # macOS
subprocess.call(('open', fileToOpen))
elif platform.system() == 'Windows': # Windows
os.startfile(fileToOpen)
else: # linux variants
subprocess.call(('xdg-open', fileToOpen))
While it is running, I want to have the same python script monitor the use of that file and delete it once the third party app is done using it (meaning...the 3rd party app closed the psd file or the third party app itself closed and released the file from use).
I've tried using psutil and pywin32 but neither seem to work in Windows 10 with Python3.9. Does anyone have any success with this? If so, how did you go about getting the process of the third party app while not getting a permission error from Windows?
Ideally, I would like to get a solution that works across Windows, Macs, and Linux but I'll take any help with Windows 10 for now since Mac and Linux can be found easier with commandline assistance with the ps -ax | grep %filename% command
Keep in mind, this would ideally track any file. TIA for your help.
Update by request:
I tried adding this code to mine (from a previous suggestion). Even this alone in a python test.py file spits out permission errors:
import psutil
for proc in psutil.process_iter():
try:
# this returns the list of opened files by the current process
flist = proc.open_files()
if flist:
print(proc.pid,proc.name)
for nt in flist:
print("\t",nt.path)
# This catches a race condition where a process ends
# before we can examine its files
except psutil.NoSuchProcess as err:
print("****",err)
The follow code below does not spit out an error but does not detect a file in use:
import psutil
from pathlib import Path
def has_handle(fpath):
for proc in psutil.process_iter():
try:
for item in proc.open_files():
if fpath == item.path:
return True
except Exception:
pass
return False
thePath = Path("C:\\Users\\someUser\\Downloads\\Book1.xlsx")
fileExists = has_handle(thePath)
if fileExists :
print("This file is in use!")
else :
print("This file is not in use")

Found it!
The original recommendation from another post forgot one function..."Path" The item.path from the process list is returned as a string. This needs to convert to a Path object for comparison of your own path object.
Therefore this line:
if fpath == item.path:
Should be:
if fpath == Path(item.path):
and here is the full code:
import psutil
from pathlib import Path
def has_handle(fpath):
for proc in psutil.process_iter():
try:
for item in proc.open_files():
print (item.path)
if fpath == Path(item.path):
return True
except Exception:
pass
return False
thePath = Path("C:\\Users\\someUser\\Downloads\\Book1.xlsx")
fileExists = has_handle(thePath)
if fileExists :
print("This file is in use!")
else :
print("This file is not in use")
Note: The reason to use Path objects rather than a string is to stay OS independant.

Based on #Frankie 's answer I put together this script. The script above took 16.1 seconds per file as proc.open_files() is quite slow.
The script below checks all files in a directory and returns the pid related to each open file. 17 files only took 2.9s to check. This is due to only calling proc.open_files() if the files default app is open in memory.
As this is used to check if a folder can be moved, the pid can be later used to force close the locking application but BEWARE that that application could have other documents open and all data would be lost.
This does not detect open txt files or may not detect files that dont have a default application
from pathlib import Path
import psutil
import os
import shlex
import winreg
from pprint import pprint as pp
from collections import defaultdict
class CheckFiles():
def check_locked_files(self, path: str):
'''Check all files recursivly in a directory and return a dict with the
locked files associated with each pid (proocess id)
Args:
path (str): root directory
Returns:
dict: dict(pid:[filenames])
'''
fnames = []
apps = set()
for root, _, f_names in os.walk(path):
for f in f_names:
f = Path(os.path.join(root, f))
if self.is_file_in_use(f):
default_app = Path(self.get_default_windows_app(f.suffix)).name
apps.add(default_app)
fnames.append(str(f))
if apps:
return self.find_process(fnames, apps)
def find_process(self, fnames: list[str], apps: set[str]):
'''find processes for each locked files
Args:
fnames (list[str]): list of filepaths
apps (set[str]): set of default apps
Returns:
dict: dict(pid:[filenames])
'''
open_files = defaultdict(list)
for p in psutil.process_iter(['name']):
name = p.info['name']
if name in apps:
try:
[open_files[p.pid].append(x.path) for x in p.open_files() if x.path in fnames]
except:
continue
return dict(open_files)
def is_file_in_use(self, file_path: str):
'''Check if file is in use by trying to rename it to its own name (nothing changes) but if
locked then this will fail
Args:
file_path (str): _description_
Returns:
bool: True is file is locked by a process
'''
path = Path(file_path)
if not path.exists():
raise FileNotFoundError
try:
path.rename(path)
except PermissionError:
return True
else:
return False
def get_default_windows_app(self, suffix: str):
'''Find the default app dedicated to a file extension (suffix)
Args:
suffix (str): ie ".jpg"
Returns:
None|str: default app exe
'''
try:
class_root = winreg.QueryValue(winreg.HKEY_CLASSES_ROOT, suffix)
with winreg.OpenKey(winreg.HKEY_CLASSES_ROOT, r'{}\shell\open\command'.format(class_root)) as key:
command = winreg.QueryValueEx(key, '')[0]
return shlex.split(command)[0]
except:
return None
old_dir = r"C:\path_to_dir"
c = CheckFiles()
r = c.check_locked_files(old_dir)
pp(r)

Python script won't print when called from another python script

I have two scripts. The first script outputs to a file a list of folders and files, then it calls a second python script to read that file and print it to screen. The second script is called but nothing ever prints to the screen and I'm not sure why. No error message is thrown.
First Script:
#!/bin/python
from subprocess import call
import os.path
import os
def main():
userRequest=raw_input("""Type the path and folder name that you'd like to list all files for.
The format should begin with a slash '/' and not have an ending slash '/'
Example (/var/log) *Remember capital vs. lower case does matter* :""")
userInputCheck(userRequest)
def userInputCheck(userRequest):
lastCharacter=userRequest[-1:]
if lastCharacter=="/":
userRequest=userRequest[:-1]
folderCheck=os.path.isdir(userRequest)
if folderCheck != True:
print("\nSorry, '"+userRequest+"' does not exist, please try again.\n")
requestUserInput()
else:
extractFileList(userRequest)
def extractFileList(userRequest):
fileList=open('/tmp/fileList.txt', 'a')
for folderName, subFolderName, listFiles in os.walk(userRequest):
fileList.write(folderName+":\n")
for fileName in listFiles:
fileList.write(fileName+"\n")
fileList.write("\n")
fileList.close
os.system("readFile.py /tmp/fileList.txt")
if os.path.isfile("/tmp/fileList.txt"):
os.remove("/tmp/fileList.txt")
if __name__ == "__main__":
main()
Second Script:
#!/bin/python
import sys
userFile=sys.argv[1]
f = open(userFile, 'r')
fileInfo=f.read()
sys.stdout.write(fileInfo)
sys.stdout.flush()
f.close

Forcing os.walk to stop if taking too long time

I want to find all files in a directory tree with a given file extension. However, some folders are really large and I therefore want to stop this process if it takes too long time (say 1 second). My current code looks something like this:
import os
import time
start_time = time.time()
file_ext = '.txt'
path = 'C:/'
file_list = []
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(file_ext):
relDir = os.path.relpath(root, path)
relFile = os.path.join(relDir, file)
file_list.append(relFile)
if time.time() - start_time> 1:
break
if time.time() - start_time> 1:
break
The problem with this code is that when I get to a really large subfolder, this code does not break until that folder has been completely traversed. If that folder contains many files, it might take much longer time than I would like. Is there any way I can make sure that the code does not run for much longer than the allotted time?
Edit: Note that while it is certainly helpful to find ways to speed up the code (for instance by using os.scandir), this question deals primarily with how to kill a process that is running.

You can do the walk in a subprocess and kill that. Options include multiprocessing.Process but the multiprocessing libs on Windows may need to do a fair amount of work that you don't need. Instead, you can just pipe the walker code into a python subprocess and go from there.
import os
import sys
import threading
import subprocess as subp
walker_script = """
import os
import sys
path = os.environ['TESTPATH']
file_ext = os.environ['TESTFILEEXT']
# let parent know we are going
print('started')
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(file_ext):
relDir = os.path.relpath(root, path)
relFile = os.path.join(relDir, file)
print(relFile)
"""
file_ext = '.txt'
path = 'C:/'
encoding = sys.getdefaultencoding()
# subprocess reads directories... additional python flags seek to
# speed python initialization. If a linuxy system, forking would
# be a good option.
env = {'TESTPATH':path, 'TESTFILEEXT':file_ext}
env.update(os.environ)
proc = subp.Popen([sys.executable, '-E', '-s', '-S', '-'], stdin=subp.PIPE,
stdout=subp.PIPE, # , stderr=open(os.devnull, 'wb'))
env = env)
# write walker script
proc.stdin.write(walker_script.encode('utf-8'))
proc.stdin.close()
# wait for start marker
next(proc.stdout)
# timer kills directory traversal when bored
threading.Timer(1, proc.kill).start()
file_list = [line.decode(encoding).strip() for line in proc.stdout]
print(file_list)

Listening for new files Python

I am trying to find a solution to a problem that I have...
I have a big share that contains hundreds of thousands(if not millions) of files and new files arriving every second.
I am trying to write an application that will make it faster to find files in the share my idea was to insert all the file names in to Redis DB in the format of :
{'file_name','file_path'} and than when a file is needed to pull its path from the DB...
The problem starts when I am trying to index all the old files(I assume it will take at least a few hours) while simeltaniously listen to new files that arrive during the process.
This is an example of what I'am trying to do:
import redis
import os
r = redis.StrictRedis(host='localhost',port=6379,db=0)
for root, dirs, files in os.walk(r'D:\\'):
for file in files:
r.set(os.path.basename(file),os.path.join(root, file))
print 'The file %s was succefuly added' %os.path.basename(file)
how am i supposed to modify the code that i will keep listening for new files?
thanks for your help!=)

You should take a look at the watchdog library. It does exactly what you're looking for. Here's an example of using it:
import sys
import time
import logging
from watchdog.observers import Observer
from watchdog.events import LoggingEventHandler
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
path = sys.argv[1] if len(sys.argv) > 1 else '.'
event_handler = LoggingEventHandler()
observer = Observer()
observer.schedule(event_handler, path, recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()

Watchdog library in Python on OS X -- not showing full event path

I just started working with the Watchdog library in Python on Mac, and am doing some basic tests to make sure things are working like I would expect. Unfortunately, they're not -- I can only seem to obtain the path to the folder containing the file where an event was registered, not the path to the file itself.
Below is a simple test program (slightly modified from the example provided by Watchdog) to print out the event type, path, and time whenever an event is registered.
import time
from watchdog.observers import Observer
from watchdog.events import LoggingEventHandler
from watchdog.events import FileSystemEventHandler
class TestEventHandler(FileSystemEventHandler):
def on_any_event(self, event):
print("event noticed: " + event.event_type +
" on file " + event.src_path + " at " + time.asctime())
if __name__ == "__main__":
event_handler = TestEventHandler()
observer = Observer()
observer.schedule(event_handler, path='~/test', recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
The src_path variable should contain the path of the file that had the event happen to it.
However, in my testing, when I modify a file, src_path only prints the path to the folder containing the file, not the path to the file itself. For example, when I modify the file moon.txt in the folder europa, the program prints the following output:
event noticed: modified on file ~/test/europa at Mon Jul 8 15:32:07 2013
What do I need to change in order to obtain the full path to the modified file?

Problem solved. As it turns out, FSEvents in OS X returns only the directory for file modified events, leaving you to scan the directory yourself to find out which file was modified. This is not mentioned in Watchdog documentation, though it's found easily in FSEvents documentation.
To get the full path to the file, I added the following snippet of code (inspired by this StackOverflow thread) to find the most recently modified file in a directory, to be used whenever event.src_path returns a directory.
if(event.is_directory):
files_in_dir = [event.src_path+"/"+f for f in os.listdir(event.src_path)]
mod_file_path = max(files_in_dir, key=os.path.getmtime)
mod_file_path contains the full path to the modified file.

Thanks ekl for providing your solution. I just stumbled across the same problem. However, I used to use PatternMatchingEventHandler, which requires small changes to your solution:
subclass from FileSystemEventHandler
create an attribute pattern where you store your pattern matching. This is not as flexible as the original PatternMatchingEventHandler, but should suffice most needs, and you will get the idea anyway if you want to extend it.
Here's the code you have to put in your FileSystemEventHandlersubclass:
def __init__(self, pattern='*'):
super(MidiEventHandler, self).__init__()
self.pattern = pattern
def on_modified(self, event):
super(MidiEventHandler, self).on_modified(event)
if event.is_directory:
files_in_dir = [event.src_path+"/"+f for f in os.listdir(event.src_path)]
if len(files_in_dir) > 0:
modifiedFilename = max(files_in_dir, key=os.path.getmtime)
else:
return
else:
modifiedFilename = event.src_path
if fnmatch.fnmatch(os.path.basename(modifiedFilename), self.pattern):
print "Modified MIDI file: %s" % modifiedFilename
One other thing I changed is that I check whether the directory is empty or not before running max() on the file list. max() does not work with empty lists.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python : Read all files in directory to watch inside while loop - python

I'm not completely sure if I understand what your trying to do, but maybe use: while True: files = os.listdir(directory) for file in files: --You're code for checking contents of the file--

Related

How can I check if a file is actively open in another process outside of my python script with python3.9 in Windows 10?

Python script won't print when called from another python script

Forcing os.walk to stop if taking too long time

Listening for new files Python

Watchdog library in Python on OS X -- not showing full event path

Categories

Resources