updating file path list in a fast changing directory - python

I have a directory that I visit from time to time to check its contents, so I have created this code to retrieve list paths of all the files within this directory and its subdirectories:
our_dir='c:\\mydocs'
walk=os.walk(our_dir)
for path, folders, files in walk:
for f in files:
file_path=os.path.join(path,f)
print file_path
This directory has 200K+ files and frequent file changes and additions, so while the code finishes running, more files will have been added/changed. The question is how to do the following:
conduct an initial run for the code, to list all the file paths
in this directory that were created/changed before the initial run
starting time
Store somehow the files added/changed during the
initial run (between the initial run start time and the initial run end time)
with every subsequent run, list only the paths
created before the current run- and after the end time of the initial run, and during the current run (between the current run start time and the current run end time)
Any idea on how to do this? I just want to make it clear that I am not "watching/monitoring" the directory, but I am visiting it from time to time.

Here's a really basic structure idea: each folder gets it's own thread. You would have 2 classes, one that gathers the data "directoryHelper", and one to store it "Directory".
Two classes are required because a thread can only be started once, and you need to be able to generate a new thread for a directory that has already been listed without losing it's data.
The root directory would be a Directory instance that lists it's given path ('C:\mydocs'). It would store the file list in self.files and create a new Directory instance for every directory it contains (without forgetting to them in self.dirs to be able to access them.
Refreshing could be timed, and checks for the directory's modification date as you suggested.
Here's some code to help you understand my idea:
class Helper(threading.Thread):
def __init__(self, directory):
super(Helper, self).__init__()
self.directory = directory
self.start()
def run(self):
for path, folders, files in os.walk(self.directory.path):
for f in files:
self.directory.files.append(os.path.join(path, f))
for d in folders:
self.directory.dirs.append(Directory(os.path.join(path, d), self.directory.interval, self.directory.do))
self = None
class Directory(threading.Thread):
def __init__(self, path, interval=5, do=None):
super(Directory, self).__init__()
self.path = path
self.files, self.dirs = ([], [])
self.interval = interval
self.last_update = 0
self.helper = None
self.do = do # One flag to stop refreshing all instances
if do == None:
self.do = True
def run(self):
while self.do:
self.refresh()
time.sleep(self.interval)
def refresh(self):
# Only start a refresh if there self.helper is done and directory was changed
if not self.helper and self.has_changed():
self.last_update = int(time.time())
self.helper = Helper(self)
def has_changed(self):
return int(os.path.getmtime(self.path)) > self.last_update
I think this should be enough to get you started!
Edit: I changed the code a bit to actually be in a working state. Or at least I hope it is (I haven't tested it)!
Edit 2: I actually took the time to test this, and fix it. I ran:
if __name__ == '__main__':
root = Directory('/home/plg')
root.refresh()
root.helper.join()
for d in [root] + root.dirs:
for f in d.files:
print f
And:
$ time python bin/dirmon.py | wc -l # wc -l == len(sys.stdout.readlines())
7805
real 0m0.078s
user 0m0.048s
sys 0m0.028s
That's 7805 / 0.078 = 100,064 files per second. Not too bad! :)
Edit 3 (last one!):
I ran the test on '/', first run (without cache):
147551 / 4.103 = 35,961 files per second
Second and third:
$ time python bin/dirmon.py | wc -l
147159
real 0m1.213s
user 0m0.940s
sys 0m0.272s
$ time python bin/dirmon.py | wc -l
147159
real 0m1.209s
user 0m0.928s
sys 0m0.284s
147551 / 1.213 = 121,641 files per second
147551 / 1.209 = 122,044 files per second

Related

pyinotify: execute a command with args

this might be a dup but I couldn't find exactly what I was looking for. Feel free to link any previous answer.
I need to write a python script (bash also would be ok) that continuously watches a directory. When the content of this directory changes (because another program generates a new directory inside of it), I want to run automatically a command line that has the name of the newly created directory as an argument.
Example:
I need to watch directory /home/tmp/:
the actual content of the directory is:
$ ls /home/tmp
Patient Patient2 Patient3
Suddenly, Patient4 dir arrives in /home/tmp.
I want a code that runs automatically
$ my_command --target_dir /home/tmp/Patient4/
I hope I'm clear in explaining what I need.
Thanks
The answer that i found works on Linux only, and it makes use of the pyinotify wrapper. below is the wroking code:
class EventProcessor(pyinotify.ProcessEvent):
_methods = ["IN_CREATE",
# "IN_OPEN",
# "IN_ACCESS",
]
def process_generator(cls, method):
def _method_name(self, event):
if event.maskname=="IN_CREATE|IN_ISDIR":
print(f"Starting pipeline for {event.pathname}")
os.system(f"clearlung --single --automatic --base_dir {event.pathname} --target_dir CT " + \
f"--model {MODEL} --subroi --output_dir {OUTPUT} --tag 0 --history_path {HISTORY}")
pass
_method_name.__name__ = "process_{}".format(method)
setattr(cls, _method_name.__name__, _method_name)
for method in EventProcessor._methods:
process_generator(EventProcessor, method)
class PathWatcher():
"""Class to watch for changes"""
def __init__(self, path_to_watch) -> None:
"""Base constructor"""
self.path = path_to_watch
if not os.path.isdir(self.path):
raise FileNotFoundError()
def watch(self,):
"""Main method of the PathWatcher class"""
print(f"Waiting for changes in {self.path}...")
watch_manager = pyinotify.WatchManager()
event_notifier = pyinotify.Notifier(watch_manager, EventProcessor())
watch_this = os.path.abspath(self.path)
watch_manager.add_watch(watch_this, pyinotify.ALL_EVENTS)
event_notifier.loop()

Python file path in Windows

I am using Windows 10.
I have this code,
script_dir = os.path.dirname(__file__)
temp = cs(os.path.join(script_dir+"first.txt"),
os.path.join(script_dir+"second.text"),
os.path.join(script_dir+"third.txt"))
It executes in git bash, but it throws an error in powershell and cmd.
How can I fix this code, so that I can execute this code in anywhere?
============================================================
Edit:
it says, it cannot find .first.txt and following files.
It also throws this error,
DLL load failed: The specified module could not be found.
============================================================
Edit2:
cs is a class I created.
class cs:
info = {}
result = {}
def __init__(self, first, second, third, output=None):
self.output = ""
self.first = first
self.second = second
self.third = third
def decrypt(self):
pass
I don't know why it works in git bash, but not in powershell and cmd
The correct code is
script_dir = os.path.dirname(__file__)
temp = cs(os.path.join(script_dir, "first.txt"),
os.path.join(script_dir, "second.text"),
os.path.join(script_dir, "third.txt"))
What you are doing wrong, is adding "first.txt" etc to script_dir, and passing that to os.path.join. os.path.join, however, takes multiple arguments (any number of arguments) and joins them together in the correct way. What your code did, is add those strings together, making: script_dirfirst.txt, which would explain why it couldn't find the file.

Parsing a directory tree that is not on my local computer using multithreading

So i'm trying to parse a directory tree but the directory tree is not on my local computer. I was able to do it but the problem is that when I tree to parse a path that has a lot of directories inside of it, it takes a long time to complete the process. I thought that by making my code run on multiple threads that it wouldn't take as long, but i'm having some trouble with where to start for the multi-threading process. I've provided my existing code below. If anyone has any advice please commit.
Thank you
NOTE
Some of the code uses pyqt5 to display the directory tree to a ui that I created.
def populate_values(self):
list_of_dir = GenVar.device.shell("ls {} -p | grep \"/\"".format(GenVar.path)).split('\n')[0:-1] # Find all directories in the root path
while len(list_of_dir) > 0:
try:
dir, parent = list_of_dir[0]
list_of_dir.pop(0)
except:
dir = GenVar.path + list_of_dir.pop(0)
dir_parent = QTreeWidgetItem(parent)
dir_parent.setText(0, dir[len(GenVar.path):])
dir_items = self._get_sysfs_nodes(path = dir)
for item in dir_items:
child = QTreeWidgetItem(dir_parent)
child.setText(0, item)
new_list = GenVar.device.shell("ls {} -p | grep \"/\"".format(dir)).split('\n')[0:-1] # Checking if there are any directories within the current directory
# If there are any directories clean them up and add them to the list_of_dir list
if len(new_list) > 0:
for i, child_path in enumerate(new_list):
if child_path[-1] != '/':
child_path += '/'
new_list[i] = (dir + child_path, dir_parent)
list_of_dir += new_list
EDIT
GenVar.device is an adb client. So the directory path is not on a shared computer of my local computer its on an adb server. So the code is running on my local machine but Im communicating with an adb server.

Subprocess.Popen only runs second time

I have a boot controller which runs a boot.py file contained in each folder of each tool i am trying to deploy. I want my boot controller to run all of these boot files simultaneously. The config file has the tool names and the versions desired, which help to generate the path to the boot.py.
def run_boot():
config_file = get_config_file()
parse_config_file.init(config_file)
tools = parse_config_file.get_tools_to_deploy()
#tools is now a list of tool names
top_dir = os.getcwd()
for tool in tools:
ver = parse_config_file.get_tool_version(tool).strip()
boot_file_path = "{0}\\Deploy\\{1}\\{2}".format(os.getcwd(),tool,ver)
try:
subprocess.Popen('boot.py', shell=True, cwd=boot_file_path)
except:
print ("{0} failed to open".format(tool))
print(tool, boot_file_path)
os.chdir(top_dir)
The first time i run this, the print(tool, boot_file_path) executes but the processes do not. the second time it is run the processes do open. I cannot find a reason for this.

FileNotFoundError When file exists (when created in current script)

I am trying to create a secure (e.g., SSL/HTTPS) XML-RPC Client Server. The client-server part works perfectly when the required certificates are present on my system; however, when I try to create the certificates during execution, I receive a FileNotFoundError when opening the ssl-wrapped socket even though the certificates are clearly present (because the preceding function created them.)
Why is the FileNotFoundError given when the files are present? (If I simply close and restart the python script no error is produced when opening the socket and everything works with no issue whatsoever.)
I've searched elsewhere for solutions, but the best/closest answer I've found is, perhaps, "race conditions" between creating the certificates and opening them. However, I've tried adding "sleep" to alleviate the possibility of race conditions (as well as running each function individually via a user input menu) with the same error every time.
What I am missing?
Here is a snippet of my code:
import os
import threading
import ssl
from xmlrpc.server import SimpleXMLRPCServer
import certs.gencert as gencert # <---- My python module for generating certs
...
rootDomain = "mydomain"
CERTFILE = "certs/mydomain.cert"
KEYFILE = "certs/mydomain.key"
...
def listenNow(ipAdd, portNum, serverCert, serverKey):
# Create XMLRPC Server, based on ipAdd/port received
server = SimpleXMLRPCServer((ipAdd, portNum))
# **THIS** is what causes the FileNotFoundError ONLY if
# the certificates are created during THE SAME execution
# of the program.
server.socket = ssl.wrap_socket(server.socket,
certfile=serverCert,
keyfile=serverKey,
do_handshake_on_connect=True,
server_side=True)
...
# Start server listening [forever]
server.serve_forever()
...
# Verify Certificates are present; if not present,
# create new certificates
def verifyCerts():
# If cert or key file not present, create new certs
if not os.path.isfile(CERTFILE) or not os.path.isfile(KEYFILE):
# NOTE: This [genert] will create certificates matching
# the file names listed in CERTFILE and KEYFILE at the top
gencert.gencert(rootDomain)
print("Certfile(s) NOT present; new certs created.")
else:
print("Certfiles Verified Present")
# Start a thread to run server connection as a daemon
def startServer(hostIP, serverPort):
# Verify certificates present prior to starting server
verifyCerts()
# Now, start thread
t = threading.Thread(name="ServerDaemon",
target=listenNow,
args=(hostIP,
serverPort,
CERTFILE,
KEYFILE
)
)
t.daemon = True
t.start()
if __name__ == '__main__':
startServer("127.0.0.1", 12345)
time.sleep(60) # <--To allow me to connect w/client before closing
When I run the above, with NO certificates present, this is the error I receive:
$ python3 test.py
Certfile(s) NOT present; new certs created.
Exception in thread ServerDaemon:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "test.py", line 41, in listenNow
server_side=True)
File "/usr/lib/python3.5/ssl.py", line 1069, in wrap_socket
ciphers=ciphers)
File "/usr/lib/python3.5/ssl.py", line 691, in __init__
self._context.load_cert_chain(certfile, keyfile)
FileNotFoundError: [Errno 2] No such file or directory
When I simply re-run the script a second time (i.e., the cert files are already present when it starts, everything runs as expected with NO errors, and I can connect my client just fine.
$ python3 test.py
Certfiles Verified Present
What is preventing the ssl.wrap_socket function from seeing/accessing the files that were just created (and thus producing the FileNotFoundError exception)?
EDIT 1:
Thanks for the comments John Gordon. Here is a copy of gencert.py, courtesy of Atul Varm, found here https://gist.github.com/toolness/3073310
import os
import sys
import hashlib
import subprocess
import datetime
OPENSSL_CONFIG_TEMPLATE = """
prompt = no
distinguished_name = req_distinguished_name
req_extensions = v3_req
[ req_distinguished_name ]
C = US
ST = IL
L = Chicago
O = Toolness
OU = Experimental Software Authority
CN = %(domain)s
emailAddress = varmaa#toolness.com
[ v3_req ]
# Extensions to add to a certificate request
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
subjectAltName = #alt_names
[ alt_names ]
DNS.1 = %(domain)s
DNS.2 = *.%(domain)s
"""
MYDIR = os.path.abspath(os.path.dirname(__file__))
OPENSSL = '/usr/bin/openssl'
KEY_SIZE = 1024
DAYS = 3650
CA_CERT = 'ca.cert'
CA_KEY = 'ca.key'
# Extra X509 args. Consider using e.g. ('-passin', 'pass:blah') if your
# CA password is 'blah'. For more information, see:
#
# http://www.openssl.org/docs/apps/openssl.html#PASS_PHRASE_ARGUMENTS
X509_EXTRA_ARGS = ()
def openssl(*args):
cmdline = [OPENSSL] + list(args)
subprocess.check_call(cmdline)
def gencert(domain, rootdir=MYDIR, keysize=KEY_SIZE, days=DAYS,
ca_cert=CA_CERT, ca_key=CA_KEY):
def dfile(ext):
return os.path.join('domains', '%s.%s' % (domain, ext))
os.chdir(rootdir)
if not os.path.exists('domains'):
os.mkdir('domains')
if not os.path.exists(dfile('key')):
openssl('genrsa', '-out', dfile('key'), str(keysize))
# EDIT 3: mydomain.key gets output here during execution
config = open(dfile('config'), 'w')
config.write(OPENSSL_CONFIG_TEMPLATE % {'domain': domain})
config.close()
# EDIT 3: mydomain.config gets output here during execution
openssl('req', '-new', '-key', dfile('key'), '-out', dfile('request'),
'-config', dfile('config'))
# EDIT 3: mydomain.request gets output here during execution
openssl('x509', '-req', '-days', str(days), '-in', dfile('request'),
'-CA', ca_cert, '-CAkey', ca_key,
'-set_serial',
'0x%s' % hashlib.md5(domain +
str(datetime.datetime.now())).hexdigest(),
'-out', dfile('cert'),
'-extensions', 'v3_req', '-extfile', dfile('config'),
*X509_EXTRA_ARGS)
# EDIT 3: mydomain.cert gets output here during execution
print "Done. The private key is at %s, the cert is at %s, and the " \
"CA cert is at %s." % (dfile('key'), dfile('cert'), ca_cert)
if __name__ == "__main__":
if len(sys.argv) < 2:
print "usage: %s <domain-name>" % sys.argv[0]
sys.exit(1)
gencert(sys.argv[1])
EDIT 2:
Regarding John's comment, "this might mean that those files are being created, but not in the directory [I] expect":
When I have the directory open in another window, I see the files pop up in the correct location during execution. In addition, when running the test.py script a second time with no changes, the files are identified as present in the correct (the same) location. This leads me to believe that file location is not the problem. Thanks for the suggestion. I'll keep looking.
EDIT 3:
I stepped through the gencert.py program, and each of the files are correctly output at the right time during execution. I indicated when exactly they were output within the above file, labeled with "EDIT 3"
While gencert is paused awaiting my input (raw_input), I can open/view/edit the mentioned files in another program with no problem.
In addition, with the first test.py instance running (paused, waiting for user input, just after mydomain.cert appears), I can run a second instance of test.py in another terminal and it sees/uses the files just fine.
Within the first instance, however, if I continue the program it outputs "FileNotFoundError."
The problem contained in the above stems from the use of os.chdir(rootdir) as suggested by John; however, the specifics are slightly different than the created files being in the wrong location. The problem is the current working directory (cwd) of the running program being changed by gencert(). Here are the specifics:
The program is started with test.py, which calls verifyCerts(). At this point the program is running in the current directory of whichever folder test.py is running inside of. Use os.getcwd() to find the current directory at this point. In this case (as an example), it is running in:
/home/name/testfolder/
Next, os.path.isfile() looks for the files "certs/mydomain.cert" and "certs/mydomain.key"; based on the file path above [e.g., the cwd], it is looking for the following files:
/home/name/testfolder/certs/mydomain.cert
/home/name/testfolder/certs/mydomain.key
The above files are not present, so the program executes gencert.gencert(rootDomain) which correctly creates both files as expected in the exact locations mentioned above in number 2.
The problem is indeed the os.chdir() call: When gencert() executes, it uses os.chdir() to change the cwd to "rootdir," which is os.path.abspath(os.path.dirname(__file__)), which is the directory of the current file (gencert.py). Since this file is located a folder deeper, the new cwd becomes:
/home/name/testfolder/certs/
When gencert() finishes execution and the control returns to test.py, the cwd never changes again; the cwd remains as /home/name/testfolder/certs/ even when execution returns to test.py.
Later, when ssl.wrap_socket() tries to find the serverCert and serverKey, it looks for "certs/mydomain.cert" and "certs/mydomain.key" in the cwd, so here is the full path of what it is looking for:
/home/name/testfolder/certs/certs/mydomain.cert
/home/name/testfolder/certs/certs/mydomain.key
These two files are NOT present, so the program correctly returns "FileNotFoundError".
Solution
A) Move the "gencert.py" file to the same directory as "test.py"
B) At the beginning of "gencert.py" add cwd = os.getcwd() to record the original cwd of the program; then, at the very end, add os.chdir(cwd) to change back to the original cwd before ending and giving control back to the calling program.
I went with option 'B', and my program now works flawlessly. I appreciate the assistance from John Gordon to point me toward finding the source of my problem.

Categories