I want to monitor a dir , and the dir has sub dirs and in subdir there are somes files with .md. (maybe there are some other files, such as *.swp...)
I only want to monitor the .md files, I have read the doc, and there is only a ExcludeFilter, and in the issue : https://github.com/seb-m/pyinotify/issues/31 says, only dir can be filter but not files.
Now what I do is to filter in process_* functions to check the event.name by fnmatch.
So if I only want to monitor the specified suffix files, is there a better way? Thanks.
This is the main code I have written:
!/usr/bin/env python
# -*- coding: utf-8 -*-
import pyinotify
import fnmatch
def suffix_filter(fn):
suffixes = ["*.md", "*.markdown"]
for suffix in suffixes:
if fnmatch.fnmatch(fn, suffix):
return False
return True
class EventHandler(pyinotify.ProcessEvent):
def process_IN_CREATE(self, event):
if not suffix_filter(event.name):
print "Creating:", event.pathname
def process_IN_DELETE(self, event):
if not suffix_filter(event.name):
print "Removing:", event.pathname
def process_IN_MODIFY(self, event):
if not suffix_filter(event.name):
print "Modifing:", event.pathname
def process_default(self, event):
print "Default:", event.pathname
I think you basically have the right idea, but that it could be implemented more easily.
The ProcessEvent class in the pyinotify module already has a hook you can use to filter the processing of events. It's specified via an optional pevent keyword argument given on the call to the constructor and is saved in the instance's self.pevent attribute. The default value is None. It's value is used in the class' __call__() method as shown in the following snippet from the pyinotify.py source file:
def __call__(self, event):
stop_chaining = False
if self.pevent is not None:
# By default methods return None so we set as guideline
# that methods asking for stop chaining must explicitly
# return non None or non False values, otherwise the default
# behavior will be to accept chain call to the corresponding
# local method.
stop_chaining = self.pevent(event)
if not stop_chaining:
return _ProcessEvent.__call__(self, event)
So you could use it only allow events for files with certain suffixes (aka extensions) with something like this:
SUFFIXES = {".md", ".markdown"}
def suffix_filter(event):
# return True to stop processing of event (to "stop chaining")
return os.path.splitext(event.name)[1] not in SUFFIXES
processevent = ProcessEvent(pevent=suffix_filter)
There's nothing particularly wrong with your solution, but you want your inotify handler to be as fast as possible, so there are a few optimizations you can make.
You should move your match suffixes out of your function, so the compiler only builds them once:
EXTS = set([".md", ".markdown"])
I made them a set so you can do a more efficient match:
def suffix_filter(fn):
ext = os.path.splitext(fn)[1]
if ext in EXTS:
return False
return True
I'm only presuming that os.path.splitext and a set search are faster than an iterative fnmatch, but this may not be true for your really small list of extensions - you should test it.
(Note: I've mirrored your code above where you return False when you make a match, but I'm not convinced that's what you want - it is at the very least not very clear to someone reading your code)
You can use the __call__ method of ProcessEvent to centralize the call to suffix_filter:
class EventHandler(pyinotify.ProcessEvent):
def __call__(self, event):
if not suffix_filter(event.name):
super(EventHandler, self).__call__(event)
def process_IN_CREATE(self, event):
print "Creating:", event.pathname
def process_IN_DELETE(self, event):
print "Removing:", event.pathname
def process_IN_MODIFY(self, event):
print "Modifying:", event.pathname
Related
This working code brings up a QFileDialog prompting the user to select a .csv file:
def load(self,fileName=None):
if not fileName:
fileName=fileDialog.getOpenFileName(caption="Load Existing Radio Log",filter="csv (*.csv)")[0]
...
...
Now, I'd like to change that filter to be more selective. The program saves each project as a set of three .csv files (project.csv, project_fleetsync.csv, project_clueLog.csv) but I only want the file dialog to display the first one (project.csv) in order to avoid presenting the user with too many choices when only a third of them can be handled by the rest of the load() function.
According to this post, it looks like the solution is to use a proxy model. So, I changed the code to the following (all of the commented lines in load() are things I've tried in various combinations):
def load(self,fileName=None):
if not fileName:
fileDialog=QFileDialog()
fileDialog.setProxyModel(CSVFileSortFilterProxyModel(self))
# fileDialog.setNameFilter("CSV (*.csv)")
# fileDialog.setOption(QFileDialog.DontUseNativeDialog)
# fileName=fileDialog.getOpenFileName(caption="Load Existing Radio Log",filter="csv (*.csv)")[0]
# fileName=fileDialog.getOpenFileName(caption="Load Existing Radio Log")[0]
# fileDialog.exec_()
...
...
# code for CSVFileSortFilterProxyModel partially taken from
# https://github.com/ZhuangLab/storm-control/blob/master/steve/qtRegexFileDialog.py
class CSVFileSortFilterProxyModel(QSortFilterProxyModel):
def __init__(self,parent=None):
print("initializing CSVFileSortFilterProxyModel")
super(CSVFileSortFilterProxyModel,self).__init__(parent)
# filterAcceptsRow - return True if row should be included in the model, False otherwise
#
# do not list files named *_fleetsync.csv or *_clueLog.csv
# do a case-insensitive comparison just in case
def filterAcceptsRow(self,source_row,source_parent):
print("CSV filterAcceptsRow called")
source_model=self.sourceModel()
index0=source_model.index(source_row,0,source_parent)
# Always show directories
if source_model.isDir(index0):
return True
# filter files
filename=source_model.fileName(index0)
# filename=self.sourceModel().index(row,0,parent).data().lower()
print("testing lowercased filename:"+filename)
if filename.count("_fleetsync.csv")+filename.count("_clueLog.csv")==0:
return True
else:
return False
When I call the load() function, I do get the "initializing CSVFileSortFilterProxyModel" output, but apparently filterAcceptsRow is not getting called: there is no "CSV filterAcceptsRow called" output, and, the _fleetsync.csv and _clueLog.csv files are still listed in the dialog. Clearly I'm doing something wrong...?
Found the solution at another stackoverflow question here.
From that solution:
The main thing to watch out for is to call
dialog.setOption(QFileDialog::DontUseNativeDialog) before
dialog.setProxyModel.
Also it looks like you then have to use fileDialog.exec_() rather than fileDialog.getOpenFileName. The value you set to setNameFilter does show up in the filter cyclic field of the non-native dialog, but is effectively just for decoration since the proxymodel filter overrides it. In my opinion that is a good thing since you can put wording in the filter cyclic that would indicate something useful to the user as to what type of filtering is going on.
Thanks to users Frank and ariwez.
UPDATE: to clarify, here's the full final code I'm using:
def load(self,fileName=None):
if not fileName:
fileDialog=QFileDialog()
fileDialog.setOption(QFileDialog.DontUseNativeDialog)
fileDialog.setProxyModel(CSVFileSortFilterProxyModel(self))
fileDialog.setNameFilter("CSV Radio Log Data Files (*.csv)")
fileDialog.setDirectory(self.firstWorkingDir)
if fileDialog.exec_():
fileName=fileDialog.selectedFiles()[0]
else: # user pressed cancel on the file browser dialog
return
... (the rest of the load function processes the selected file)
...
# code for CSVFileSortFilterProxyModel partially taken from
# https://github.com/ZhuangLab/storm-control/blob/master/steve/qtRegexFileDialog.py
class CSVFileSortFilterProxyModel(QSortFilterProxyModel):
def __init__(self,parent=None):
# print("initializing CSVFileSortFilterProxyModel")
super(CSVFileSortFilterProxyModel,self).__init__(parent)
# filterAcceptsRow - return True if row should be included in the model, False otherwise
#
# do not list files named *_fleetsync.csv or *_clueLog.csv
# do a case-insensitive comparison just in case
def filterAcceptsRow(self,source_row,source_parent):
# print("CSV filterAcceptsRow called")
source_model=self.sourceModel()
index0=source_model.index(source_row,0,source_parent)
# Always show directories
if source_model.isDir(index0):
return True
# filter files
filename=source_model.fileName(index0).lower()
# print("testing lowercased filename:"+filename)
# never show non- .csv files
if filename.count(".csv")<1:
return False
if filename.count("_fleetsync.csv")+filename.count("_cluelog.csv")==0:
return True
else:
return False
As far as I know, a luigi.Target can either exist, or not.
Therefore, if a luigi.Target exists, it wouldn't be recomputed.
I'm looking for a way to force recomputation of the task, if one of its dependencies is modified, or if the code of one of the tasks changes.
One way you could accomplish your goal is by overriding the complete(...) method.
The documentation for complete is straightforward.
Simply implement a function that checks your constraint, and returns False if you want to recompute the task.
For example, to force recomputation when a dependency has been updated, you could do:
def complete(self):
"""Flag this task as incomplete if any requirement is incomplete or has been updated more recently than this task"""
import os
import time
def mtime(path):
return time.ctime(os.path.getmtime(path))
# assuming 1 output
if not os.path.exists(self.output().path):
return False
self_mtime = mtime(self.output().path)
# the below assumes a list of requirements, each with a list of outputs. YMMV
for el in self.requires():
if not el.complete():
return False
for output in el.output():
if mtime(output.path) > self_mtime:
return False
return True
This will return False when any requirement is incomplete or any has been modified more recently than the current task or the output of the current task does not exist.
Detecting when code has changed is harder. You could use a similar scheme (checking mtime), but it'd be hit-or-miss unless every task has its own file.
Because of the ability to override complete, any logic you want for recomputation can be implemented. If you want a particular complete method for many tasks, I'd recommend sub-classing luigi.Task, implementing your custom complete there, and then inheriting your tasks from the sub-class.
I'm late to the game, but here's a mixin that improves the accepted answer to support multiple input / output files.
class MTimeMixin:
"""
Mixin that flags a task as incomplete if any requirement
is incomplete or has been updated more recently than this task
This is based on http://stackoverflow.com/a/29304506, but extends
it to support multiple input / output dependencies.
"""
def complete(self):
def to_list(obj):
if type(obj) in (type(()), type([])):
return obj
else:
return [obj]
def mtime(path):
return time.ctime(os.path.getmtime(path))
if not all(os.path.exists(out.path) for out in to_list(self.output())):
return False
self_mtime = min(mtime(out.path) for out in to_list(self.output()))
# the below assumes a list of requirements, each with a list of outputs. YMMV
for el in to_list(self.requires()):
if not el.complete():
return False
for output in to_list(el.output()):
if mtime(output.path) > self_mtime:
return False
return True
To use it, you would just declare your class using, for example class MyTask(Mixin, luigi.Task).
The above code works well for me except that I believe for proper timestamp comparison mtime(path) must return a float instead of a string ("Sat " > "Mon "...[sic]). Thus simply,
def mtime(path):
return os.path.getmtime(path)
instead of:
def mtime(path):
return time.ctime(os.path.getmtime(path))
Regarding the Mixin suggestion from Shilad Sen posted below, consider this example:
# Filename: run_luigi.py
import luigi
from MTimeMixin import MTimeMixin
class PrintNumbers(luigi.Task):
def requires(self):
wreturn []
def output(self):
return luigi.LocalTarget("numbers_up_to_10.txt")
def run(self):
with self.output().open('w') as f:
for i in range(1, 11):
f.write("{}\n".format(i))
class SquaredNumbers(MTimeMixin, luigi.Task):
def requires(self):
return [PrintNumbers()]
def output(self):
return luigi.LocalTarget("squares.txt")
def run(self):
with self.input()[0].open() as fin, self.output().open('w') as fout:
for line in fin:
n = int(line.strip())
out = n * n
fout.write("{}:{}\n".format(n, out))
if __name__ == '__main__':
luigi.run()
where MTimeMixin is as in the post above. I run the task once using
luigi --module run_luigi SquaredNumbers
Then I touch file numbers_up_to_10.txt and run the task again. Then Luigi gives the following complaint:
File "c:\winpython-64bit-3.4.4.6qt5\python-3.4.4.amd64\lib\site-packages\luigi-2.7.1-py3.4.egg\luigi\local_target.py", line 40, in move_to_final_destination
os.rename(self.tmp_path, self.path)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'squares.txt-luigi-tmp-5391104487' -> 'squares.txt'
This may just be a Windows problem, not an issue on Linux where "mv a b" may just delete the old b if it already exists and is not write-protected. We can fix this with the following patch to Luigi/local_target.py:
def move_to_final_destination(self):
if os.path.exists(self.path):
os.rename(self.path, self.path + time.strftime("_%Y%m%d%H%M%S.txt"))
os.rename(self.tmp_path, self.path)
Also for completeness here is the Mixin again as a separate file, from the other post:
import os
class MTimeMixin:
"""
Mixin that flags a task as incomplete if any requirement
is incomplete or has been updated more recently than this task
This is based on http://stackoverflow.com/a/29304506, but extends
it to support multiple input / output dependencies.
"""
def complete(self):
def to_list(obj):
if type(obj) in (type(()), type([])):
return obj
else:
return [obj]
def mtime(path):
return os.path.getmtime(path)
if not all(os.path.exists(out.path) for out in to_list(self.output())):
return False
self_mtime = min(mtime(out.path) for out in to_list(self.output()))
# the below assumes a list of requirements, each with a list of outputs. YMMV
for el in to_list(self.requires()):
if not el.complete():
return False
for output in to_list(el.output()):
if mtime(output.path) > self_mtime:
return False
return True
I have the following problem. My application randomly takes different files, e.g. rar, zip, 7z. And I have different processors to extract and save them locally:
Now everything looks this way:
if extension == 'zip':
archive = zipfile.ZipFile(file_contents)
file_name = archive.namelist()[0]
file_contents = ContentFile(archive.read(file_name))
elif extension == '7z':
archive = py7zlib.Archive7z(file_contents)
file_name = archive.getnames()[0]
file_contents = ContentFile(
archive.getmember(file_name).read())
elif extension == '...':
And I want to switch to more object oriented approach, with one main Processor class and subclasses responsible for specific archives.
E.g. I was thinking about:
class Processor(object):
def __init__(self, filename, contents):
self.filename = filename
self.contents = contents
def get_extension(self):
return self.filename.split(".")[-1]
def process(self):
raise NotImplemented("Need to implement something here")
class ZipProcessor(Processor):
def process(self):
archive = zipfile.ZipFile(file_contents)
file_name = archive.namelist()[0]
file_contents = ContentFile(archive.read(file_name))
etc
But I am not sure, that's a correct way. E.g. I can't invent a way to call needed processor based on the file extension, if following this way
A rule of thumb is that if you have a class with two methods, one of which is __init__(), then it's not a class but a function is disguise.
Writing classes is overkill in this case, because you still have to use the correct class manually.
Since the handling of all kinds of archives will be subtly different, wrap each in a function;
def handle_zip(name):
print name, 'is a zip file'
return 'zip'
def handle_7z(name):
print name, 'is a 7z file'
return '7z'
Et cetera. Since functions are first-class objects in Python, you can use a dictionary using the extension as a key for calling the right function;
import os.path
filename = 'foo.zip'
dispatch = {'.zip': handle_zip, '.7z': handle_7z}
_, extension = os.path.splitext(filename)
try:
rv = dispatch[extension](filename)
except KeyError:
print 'Unknown extension', extension
rv = None
It is important to handle the KeyError here, since dispatch doesn't contain all possible extensions.
An idea that might make sense before (or instead) of writing a custom class to perform your operations generally, is making sure you offer a consistent interface to archives - wrapping zipfile.ZipFile and py7zlib.Archive7z into classes with, for example, a getfilenames method.
This method ensures that you don't repeat yourself, without needing to "hide" your operations in a class, if you don't want to
You may want to use a abc as a base class, to make things extra clear.
Then, you can simply:
archive_extractors= {'zip':MyZipExtractor, '7z':My7zExtractor}
extractor= archive_extractors[extension]
file_name = extractor.getfilenames()[0]
#...
If you want to stick to OOP, you could give Processor a static method to decide if a class can handle a certain file, and implement it in every subclass. Then, if you need to unpack a file, use the base class'es __subclasses__() method to iterate over the subclasses and create an instance of the appropriate one:
class Processor(object):
#staticmethod
def is_appropriate_for(name):
raise NotImplemented()
def process(self, name):
raise NotImplemented()
class ZipProcessor(Processor):
#staticmethod
def is_appropriate_for(name):
return name[-4:] == ".zip"
def process(self, name):
print ".. handling ", name
name = "test.zip"
handler = None
for cls in Processor.__subclasses__():
if cls.is_appropriate_for(name):
handler = cls()
print name, "handled by", handler
I'm a total beginner and only started doing classes today, and im trying to make a sort of 'spinner' object i can call something like this: One of the things im confused about is whether to use 'thread', 'threading' or 'processes'. I just read somewhere that an instance of a thread costs 8meg, as this is a simple text spinner thing, it doesnt warrant using a huge amount of memory. My first question is which module should I use, and my second is how do implement this in a class so i can call it like this:
spin.start() - starts it
spin.stop() - stops it
spin.cursor_invisible() - turns the cursor invisible
spin.cursor_visible() - cursor visible!
I copied some code and read some books but Im a bit confused, what I have so far is this: i put some comments in to show how ignorant I am. I have been reading a lot though, honest! Its kind of a large thing to get your head around.
spinner="▏▎▍▌▋▊▉█▉▊▌▍▎" #utf8
#convert the utf8 spinner string to a list
chars=[c.encode("utf-8") for c in unicode(spinner,"utf-8")]
class spin(): # not sure what to put in the brackets was (threading.Thread, but now im not sure whether to use processes or not)
def __init__(self):
super(spin, self).__init__() # dont understand what this does
self._stop = threading.Event()
def run (self):
threading.Thread(target = self).run()
pos=0
while not self._stop:
sys.stdout.write("\r"+chars[pos])
sys.stdout.flush()
time.sleep(.15)
pos+=1
pos%=len(chars)
def cursor_visible(self):
os.system("tput cvvis")
def cursor_invisible(self):
os.system("tput civis")
def stop(self):
self._stop.set() #the underscore makes this a private variable ?
def stopped(self):
return self._stop.isSet()
I have altered your code slightly. Now it runs! First a commented version:
The first line tells python that this source file contains utf-8 characters
# -*- coding: utf-8 -*-
Then you need to import all the stuff that you will eventually use. You dont have to do it at the top of the file like this, but I'm a C guy and this is how I like it...
import threading
import sys
import time
import os
spinner="▏▎▍▌▋▊▉█▉▊▌▍▎" #utf8
#convert the utf8 spinner string to a list
chars=[c.encode("utf-8") for c in unicode(spinner,"utf-8")]
class spin(threading.Thread): # not sure what to put in the brackets was (threading.Thread, but now im not sure whether to use processes or not)
Threading is fine for this
def __init__(self):
super(spin,self).__init__() # dont understand what this does
Since you are overriding the init method of threading.Thread with your own init you need to call the parent class's init to make sure the object is properly initiated.
self._stop = False
I changed this to a boolean. The threading.Event is overkill for this.
def run (self):
pos=0
while not self._stop:
sys.stdout.write("\r"+chars[pos])
sys.stdout.flush()
time.sleep(.15)
pos+=1
pos%=len(chars)
def cursor_visible(self):
os.system("tput cvvis")
def cursor_invisible(self):
os.system("tput civis")
def stop(self):
self._stop = True #the underscore makes this a private variable ?
Sort of. It's not actually private, the underscore just tells everyone that it's bad form to acces it.
def stopped(self):
return self._stop == True
And finally a small test of the code:
if __name__ == "__main__":
s = spin()
s.cursor_invisible()
s.start()
a = raw_input("")
s.stop()
s.cursor_visible()
And here is the uncommented version...
# -*- coding: utf-8 -*-
import threading
import sys
import time
import os
spinner="▏▎▍▌▋▊▉█▉▊▌▍▎" #utf8
#convert the utf8 spinner string to a list
chars=[c.encode("utf-8") for c in unicode(spinner,"utf-8")]
class spin(threading.Thread): # not sure what to put in the brackets was (threading.Thread, but now im not sure whether to use processes or not)
def __init__(self):
super(spin,self).__init__() # dont understand what this does
self._stop = False
def run (self):
pos=0
while not self._stop:
sys.stdout.write("\r"+chars[pos])
sys.stdout.flush()
time.sleep(.15)
pos+=1
pos%=len(chars)
def cursor_visible(self):
os.system("tput cvvis")
def cursor_invisible(self):
os.system("tput civis")
def stop(self):
self._stop = True #the underscore makes this a private variable ?
def stopped(self):
return self._stop == True
if __name__ == "__main__":
s = spin()
s.cursor_invisible()
s.start()
a = raw_input("")
s.stop()
s.cursor_visible()
Consider this scenario:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
walk = os.walk('/home')
for root, dirs, files in walk:
for pathname in dirs+files:
print os.path.join(root, pathname)
for root, dirs, files in walk:
for pathname in dirs+files:
print os.path.join(root, pathname)
I know that this example is kinda redundant, but you should consider that we need to use the same walk data more than once. I've a benchmark scenario and the use of same walk data is mandatory to get helpful results.
I've tried walk2 = walk to clone and use in the second iteration, but it didn't work. The question is... How can I copy it? Is it ever possible?
Thank you in advance.
You can use itertools.tee():
walk, walk2 = itertools.tee(walk)
Note that this might "need significant extra storage", as the documentation points out.
If you know you are going to iterate through the whole generator for every usage, you will probably get the best performance by unrolling the generator to a list and using the list multiple times.
walk = list(os.walk('/home'))
Define a function
def walk_home():
for r in os.walk('/home'):
yield r
Or even this
def walk_home():
return os.walk('/home')
Both are used like this:
for root, dirs, files in walk_home():
for pathname in dirs+files:
print os.path.join(root, pathname)
This is a good usecase for functools.partial()
to make a quick generator-factory:
from functools import partial
import os
walk_factory = partial(os.walk, '/home')
walk1, walk2, walk3 = walk_factory(), walk_factory(), walk_factory()
What functools.partial() does is hard to describe with human-words, but this^ is what it's for.
It partially fills out function-params without executing that function. Consequently it acts as a function/generator factory.
This answer aims to extend/elaborate on what the other answers have expressed. The solution will necessarily vary depending on what exactly you aim to achieve.
If you want to iterate over the exact same result of os.walk multiple times, you will need to initialize a list from the os.walk iterable's items (i.e. walk = list(os.walk(path))).
If you must guarantee the data remains the same, that is probably your only option. However, there are several scenarios in which this is not possible or desirable.
It will not be possible to list() an iterable if the output is of sufficient size (i.e. attempting to list() an entire filesystem may freeze your computer).
It is not desirable to list() an iterable if you wish to acquire "fresh" data prior to each use.
In the event that list() is not suitable, you will need to run your generator on demand. Note that generators are extinguised after each use, so this poses a slight problem. In order to "rerun" your generator multiple times, you can use the following pattern:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
class WalkMaker:
def __init__(self, path):
self.path = path
def __iter__(self):
for root, dirs, files in os.walk(self.path):
for pathname in dirs + files:
yield os.path.join(root, pathname)
walk = WalkMaker('/home')
for path in walk:
pass
# do something...
for path in walk:
pass
The aforementioned design pattern will allow you to keep your code DRY.
This "Python Generator Listeners" code allows you to have many listeners on a single generator, like os.walk, and even have someone "chime in" later.
def walkme():
os.walk('/home')
m1 = Muxer(walkme)
m2 = Muxer(walkme)
then m1 and m2 can run in threads even and process at their leisure.
See: https://gist.github.com/earonesty/cafa4626a2def6766acf5098331157b3
import queue
from threading import Lock
from collections import namedtuple
class Muxer():
Entry = namedtuple('Entry', 'genref listeners, lock')
already = {}
top_lock = Lock()
def __init__(self, func, restart=False):
self.restart = restart
self.func = func
self.queue = queue.Queue()
with self.top_lock:
if func not in self.already:
self.already[func] = self.Entry([func()], [], Lock())
ent = self.already[func]
self.genref = ent.genref
self.lock = ent.lock
self.listeners = ent.listeners
self.listeners.append(self)
def __iter__(self):
return self
def __next__(self):
try:
e = self.queue.get_nowait()
except queue.Empty:
with self.lock:
try:
e = self.queue.get_nowait()
except queue.Empty:
try:
e = next(self.genref[0])
for other in self.listeners:
if not other is self:
other.queue.put(e)
except StopIteration:
if self.restart:
self.genref[0] = self.func()
raise
return e
def __del__(self):
with self.top_lock:
try:
self.listeners.remove(self)
except ValueError:
pass
if not self.listeners and self.func in self.already:
del self.already[self.func]