Consider this scenario:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
walk = os.walk('/home')
for root, dirs, files in walk:
for pathname in dirs+files:
print os.path.join(root, pathname)
for root, dirs, files in walk:
for pathname in dirs+files:
print os.path.join(root, pathname)
I know that this example is kinda redundant, but you should consider that we need to use the same walk data more than once. I've a benchmark scenario and the use of same walk data is mandatory to get helpful results.
I've tried walk2 = walk to clone and use in the second iteration, but it didn't work. The question is... How can I copy it? Is it ever possible?
Thank you in advance.
You can use itertools.tee():
walk, walk2 = itertools.tee(walk)
Note that this might "need significant extra storage", as the documentation points out.
If you know you are going to iterate through the whole generator for every usage, you will probably get the best performance by unrolling the generator to a list and using the list multiple times.
walk = list(os.walk('/home'))
Define a function
def walk_home():
for r in os.walk('/home'):
yield r
Or even this
def walk_home():
return os.walk('/home')
Both are used like this:
for root, dirs, files in walk_home():
for pathname in dirs+files:
print os.path.join(root, pathname)
This is a good usecase for functools.partial()
to make a quick generator-factory:
from functools import partial
import os
walk_factory = partial(os.walk, '/home')
walk1, walk2, walk3 = walk_factory(), walk_factory(), walk_factory()
What functools.partial() does is hard to describe with human-words, but this^ is what it's for.
It partially fills out function-params without executing that function. Consequently it acts as a function/generator factory.
This answer aims to extend/elaborate on what the other answers have expressed. The solution will necessarily vary depending on what exactly you aim to achieve.
If you want to iterate over the exact same result of os.walk multiple times, you will need to initialize a list from the os.walk iterable's items (i.e. walk = list(os.walk(path))).
If you must guarantee the data remains the same, that is probably your only option. However, there are several scenarios in which this is not possible or desirable.
It will not be possible to list() an iterable if the output is of sufficient size (i.e. attempting to list() an entire filesystem may freeze your computer).
It is not desirable to list() an iterable if you wish to acquire "fresh" data prior to each use.
In the event that list() is not suitable, you will need to run your generator on demand. Note that generators are extinguised after each use, so this poses a slight problem. In order to "rerun" your generator multiple times, you can use the following pattern:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
class WalkMaker:
def __init__(self, path):
self.path = path
def __iter__(self):
for root, dirs, files in os.walk(self.path):
for pathname in dirs + files:
yield os.path.join(root, pathname)
walk = WalkMaker('/home')
for path in walk:
pass
# do something...
for path in walk:
pass
The aforementioned design pattern will allow you to keep your code DRY.
This "Python Generator Listeners" code allows you to have many listeners on a single generator, like os.walk, and even have someone "chime in" later.
def walkme():
os.walk('/home')
m1 = Muxer(walkme)
m2 = Muxer(walkme)
then m1 and m2 can run in threads even and process at their leisure.
See: https://gist.github.com/earonesty/cafa4626a2def6766acf5098331157b3
import queue
from threading import Lock
from collections import namedtuple
class Muxer():
Entry = namedtuple('Entry', 'genref listeners, lock')
already = {}
top_lock = Lock()
def __init__(self, func, restart=False):
self.restart = restart
self.func = func
self.queue = queue.Queue()
with self.top_lock:
if func not in self.already:
self.already[func] = self.Entry([func()], [], Lock())
ent = self.already[func]
self.genref = ent.genref
self.lock = ent.lock
self.listeners = ent.listeners
self.listeners.append(self)
def __iter__(self):
return self
def __next__(self):
try:
e = self.queue.get_nowait()
except queue.Empty:
with self.lock:
try:
e = self.queue.get_nowait()
except queue.Empty:
try:
e = next(self.genref[0])
for other in self.listeners:
if not other is self:
other.queue.put(e)
except StopIteration:
if self.restart:
self.genref[0] = self.func()
raise
return e
def __del__(self):
with self.top_lock:
try:
self.listeners.remove(self)
except ValueError:
pass
if not self.listeners and self.func in self.already:
del self.already[self.func]
Related
I'm building an interactive file explorer inside the Python console, such that when I pass in a path, I get an object, then with a dot . the auto-complete starts suggesting the contents of the path, then I do that again to get to the contents of the subfolder, and so on untill I get to the file and it returns the path.
I have achieved my goal, except this little nagging thing: I wanted __repr__ method, but it never worked.
Here's my code:
import os
from glob import glob
path = r'C:\Users\eng_a\Downloads'
def browse(path):
my_dict = {'_path': path}
tmp = os.listdir(path)
key_contents = []
for akey in tmp:
key_contents.append(akey.replace(".", "_").replace(" ", "_").replace("-", "_"))
val_paths = glob(path + '//*')
for akey, avalue in zip(key_contents, val_paths):
if os.path.isfile(avalue):
my_dict[akey] = avalue
else:
my_dict[akey] = browse(avalue)
def func(self):
return self._path
my_dict["__repr__"] = func
my_dict["__str__"] = func
obj = type(os.path.basename(path), (), dict(zip(my_dict.keys(), my_dict.values())))
return obj
>>> b = browse(path)
>>> b
Unfortunately it keeps printing __main__.
As noted in the comments, obj is a class, not an instance. It contains a function __repr__ that will be bound to an instance as soon as you create it.
A simple and elegant solution to this would be to replace the function browse with a class of the same name. Calling a class creates an instance (unless you really mess with metaclasses or __new__), so the interface you have now would not have to change. Internally, however, you would instantiate your class for every directory that you delved into.
Another thing that this would allow you to do is to have a truly dynamic solution. Right now you actually recurse into all the children of your root. This can be very expensive in both time and memory. Ideally, you would only want to list the current directory, and recurse into children only when asked to.
from os import listdir
from os.path import isdir, join
import re
class browse:
def __init__(self, path, directory=True):
# Create an attribute in __dict__ for each child
self.__path__ = path
if directory:
for file in listdir(path):
full = join(path, file)
key = re.sub(r'^(?=\d)|\W', '_', file)
setattr(self, key, full if isdir(full) else browse(full, False))
def __getattribute__(self, name):
if name == '__path__':
return super().__getattribute__(name)
d = super().__getattribute__('__dict__')
if name in d:
child = d[name]
if isinstance(child, str):
child = browse(child)
setattr(self, name, child)
return child
return super().__getattribute__(name)
def __repr__(self):
return self.__path__
def __str__(self):
return self.__path__
This solution does adds an attribute for each entry in the root path. Files are recorded as browse objects, while directories are recorded as strings. Overriding __getattribute__ allows you to swap the strings you request for full browse objects on the fly, instead of having to expand all your folders up front.
A possible improvement, given the intended use case, would be to remove the line setattr(self, name, child). This way, you would not retain unnecessary references to directories that you accidentally browsed into, for example.
I have the following problem. My application randomly takes different files, e.g. rar, zip, 7z. And I have different processors to extract and save them locally:
Now everything looks this way:
if extension == 'zip':
archive = zipfile.ZipFile(file_contents)
file_name = archive.namelist()[0]
file_contents = ContentFile(archive.read(file_name))
elif extension == '7z':
archive = py7zlib.Archive7z(file_contents)
file_name = archive.getnames()[0]
file_contents = ContentFile(
archive.getmember(file_name).read())
elif extension == '...':
And I want to switch to more object oriented approach, with one main Processor class and subclasses responsible for specific archives.
E.g. I was thinking about:
class Processor(object):
def __init__(self, filename, contents):
self.filename = filename
self.contents = contents
def get_extension(self):
return self.filename.split(".")[-1]
def process(self):
raise NotImplemented("Need to implement something here")
class ZipProcessor(Processor):
def process(self):
archive = zipfile.ZipFile(file_contents)
file_name = archive.namelist()[0]
file_contents = ContentFile(archive.read(file_name))
etc
But I am not sure, that's a correct way. E.g. I can't invent a way to call needed processor based on the file extension, if following this way
A rule of thumb is that if you have a class with two methods, one of which is __init__(), then it's not a class but a function is disguise.
Writing classes is overkill in this case, because you still have to use the correct class manually.
Since the handling of all kinds of archives will be subtly different, wrap each in a function;
def handle_zip(name):
print name, 'is a zip file'
return 'zip'
def handle_7z(name):
print name, 'is a 7z file'
return '7z'
Et cetera. Since functions are first-class objects in Python, you can use a dictionary using the extension as a key for calling the right function;
import os.path
filename = 'foo.zip'
dispatch = {'.zip': handle_zip, '.7z': handle_7z}
_, extension = os.path.splitext(filename)
try:
rv = dispatch[extension](filename)
except KeyError:
print 'Unknown extension', extension
rv = None
It is important to handle the KeyError here, since dispatch doesn't contain all possible extensions.
An idea that might make sense before (or instead) of writing a custom class to perform your operations generally, is making sure you offer a consistent interface to archives - wrapping zipfile.ZipFile and py7zlib.Archive7z into classes with, for example, a getfilenames method.
This method ensures that you don't repeat yourself, without needing to "hide" your operations in a class, if you don't want to
You may want to use a abc as a base class, to make things extra clear.
Then, you can simply:
archive_extractors= {'zip':MyZipExtractor, '7z':My7zExtractor}
extractor= archive_extractors[extension]
file_name = extractor.getfilenames()[0]
#...
If you want to stick to OOP, you could give Processor a static method to decide if a class can handle a certain file, and implement it in every subclass. Then, if you need to unpack a file, use the base class'es __subclasses__() method to iterate over the subclasses and create an instance of the appropriate one:
class Processor(object):
#staticmethod
def is_appropriate_for(name):
raise NotImplemented()
def process(self, name):
raise NotImplemented()
class ZipProcessor(Processor):
#staticmethod
def is_appropriate_for(name):
return name[-4:] == ".zip"
def process(self, name):
print ".. handling ", name
name = "test.zip"
handler = None
for cls in Processor.__subclasses__():
if cls.is_appropriate_for(name):
handler = cls()
print name, "handled by", handler
I want to monitor a dir , and the dir has sub dirs and in subdir there are somes files with .md. (maybe there are some other files, such as *.swp...)
I only want to monitor the .md files, I have read the doc, and there is only a ExcludeFilter, and in the issue : https://github.com/seb-m/pyinotify/issues/31 says, only dir can be filter but not files.
Now what I do is to filter in process_* functions to check the event.name by fnmatch.
So if I only want to monitor the specified suffix files, is there a better way? Thanks.
This is the main code I have written:
!/usr/bin/env python
# -*- coding: utf-8 -*-
import pyinotify
import fnmatch
def suffix_filter(fn):
suffixes = ["*.md", "*.markdown"]
for suffix in suffixes:
if fnmatch.fnmatch(fn, suffix):
return False
return True
class EventHandler(pyinotify.ProcessEvent):
def process_IN_CREATE(self, event):
if not suffix_filter(event.name):
print "Creating:", event.pathname
def process_IN_DELETE(self, event):
if not suffix_filter(event.name):
print "Removing:", event.pathname
def process_IN_MODIFY(self, event):
if not suffix_filter(event.name):
print "Modifing:", event.pathname
def process_default(self, event):
print "Default:", event.pathname
I think you basically have the right idea, but that it could be implemented more easily.
The ProcessEvent class in the pyinotify module already has a hook you can use to filter the processing of events. It's specified via an optional pevent keyword argument given on the call to the constructor and is saved in the instance's self.pevent attribute. The default value is None. It's value is used in the class' __call__() method as shown in the following snippet from the pyinotify.py source file:
def __call__(self, event):
stop_chaining = False
if self.pevent is not None:
# By default methods return None so we set as guideline
# that methods asking for stop chaining must explicitly
# return non None or non False values, otherwise the default
# behavior will be to accept chain call to the corresponding
# local method.
stop_chaining = self.pevent(event)
if not stop_chaining:
return _ProcessEvent.__call__(self, event)
So you could use it only allow events for files with certain suffixes (aka extensions) with something like this:
SUFFIXES = {".md", ".markdown"}
def suffix_filter(event):
# return True to stop processing of event (to "stop chaining")
return os.path.splitext(event.name)[1] not in SUFFIXES
processevent = ProcessEvent(pevent=suffix_filter)
There's nothing particularly wrong with your solution, but you want your inotify handler to be as fast as possible, so there are a few optimizations you can make.
You should move your match suffixes out of your function, so the compiler only builds them once:
EXTS = set([".md", ".markdown"])
I made them a set so you can do a more efficient match:
def suffix_filter(fn):
ext = os.path.splitext(fn)[1]
if ext in EXTS:
return False
return True
I'm only presuming that os.path.splitext and a set search are faster than an iterative fnmatch, but this may not be true for your really small list of extensions - you should test it.
(Note: I've mirrored your code above where you return False when you make a match, but I'm not convinced that's what you want - it is at the very least not very clear to someone reading your code)
You can use the __call__ method of ProcessEvent to centralize the call to suffix_filter:
class EventHandler(pyinotify.ProcessEvent):
def __call__(self, event):
if not suffix_filter(event.name):
super(EventHandler, self).__call__(event)
def process_IN_CREATE(self, event):
print "Creating:", event.pathname
def process_IN_DELETE(self, event):
print "Removing:", event.pathname
def process_IN_MODIFY(self, event):
print "Modifying:", event.pathname
Simple questions for those who know, pretty hard for me, as I think that it is not pratically possible.
After making a simple python program, it is possible to run it on the computer's command prompt.
I was wondering or it is possible to allow someone who runs it that way, to add an element to a list (list.insert) and have it still there the next time the program is run?(thus editting the predefined list to a new list and saving it that way)
EDIT: just giving a bit more information:
All the program has to do is allow you to choose a list. From this list it returns a random item.
I was just hoping to allow to add items to this list while running the program, keeping the list updated afterwards.
The most basic way is to use the Pickle module to save and load your data to disk:
http://docs.python.org/2/library/pickle.html
http://docs.python.org/2/library/pickle.html#example
Here's how I would use it in a simple program
try:
import cPickle as pickle
except ImportError:
import pickle
class MyClass(object):
def __init__(self, file_name):
self.array = []
self.file_name = file_name
self.load_data()
def add_element(self, element):
self.array.append(element)
self.save_data()
def load_data(self):
try:
with open(self.file_name, "r") as f:
self.array = pickle.load(f)
except IOError:
pass
def save_data(self):
with open(self.file_name, "w") as f:
pickle.dump(self.array, f)
def main():
FILE_NAME = "test.pkl"
a = MyClass(FILE_NAME)
print "elements in array are", a.array
for i in range(5):
a.add_element(i)
if __name__ == "__main__":
main()
I have a program depending on a large code base that prints a lot of irrelevant and annoying messages. I would like to clean them up a bit, but since their content is dynamically generated, I can't just grep for them.
Is there a way to place a hook on the print statement? (I use python 2.4, but I would be interested in results for any version). Is there another way to find from which "print" statement the output comes?
For CPython2.5 or older:
import sys
import inspect
import collections
_stdout = sys.stdout
Record = collections.namedtuple(
'Record',
'frame filename line_number function_name lines index')
class MyStream(object):
def __init__(self, target):
self.target = target
def write(self, text):
if text.strip():
record = Record(*inspect.getouterframes(inspect.currentframe())[1])
self.target.write(
'{f} {n}: '.format(f = record.filename, n = record.line_number))
self.target.write(text)
sys.stdout = MyStream(sys.stdout)
def foo():
print('Hi')
foo()
yields
/home/unutbu/pybin/test.py 20: Hi
For CPython2.6+ we can import the print function with
from __future__ import print_function
and then redirect it as we wish:
from __future__ import print_function
import sys
import inspect
import collections
Record = collections.namedtuple(
'Record',
'frame filename line_number function_name lines index')
def myprint(text):
if text.strip():
record = Record(*inspect.getouterframes(inspect.currentframe())[1])
sys.stdout.write('{f} {n}: '.format(f = record.filename, n = record.line_number))
sys.stdout.write(text + '\n')
def foo():
print('Hi')
print = myprint
foo()
Note that inspect.currentframe uses sys._getframe which is not part of all implementations of Python. So the solution above may only work for CPython.
Strictly speaking, code base that you depend on, as in libraries, shouldn't contain any print statements. So, you should really just remove all of them.
Other than that, you can monkey-patch stdout: Adding a datetime stamp to Python print
a very gross hack to make this work:
use your favorite text editor, use your search/find feature.
find all the print statements.
and input into each of them a number, or identifier manually. (or automatically if you do this what a script)
a script to do this would be simple, just have it look for for print with regex, and replace it with print ID, and then it will all be the same, but you will get numbers.
cheers.
edit
barring any strange formatting, the following code should do it for you.
note, this is just an example of a way you could do it. not really an answer.
import re
class inc():
def __init__(self):
self.x = 0
def get(self):
self.x += 1
return self.x
def replacer(filename_in, filename_out):
i = inc()
out = open(filename_out, 'w')
with open(filename_in) as f:
for line in f:
out.write("%s\n" % re.sub(r'print', 'print %d,' % i.get(), line))
i used an basic incrementer class in case you wanted to had some kind of more complex ID, instead of just having a counter.
In harsh circumstances (output done in some weird binary libraries) you could also use strace -e write (and more options). If you do not read strace's output, the straced program waits until you do, so you can send it a signal and see where it dies.
Here is a trick that Jeeeyul came up with for Java: Replace the output stream (i.e. sys.out) with something that notices when a line feed has been written.
If this flag is true, throw an exception when the next byte is being written. Catch the exception in the same place, walk up the stack trace until you find code that doesn't belong to your "debug stream writer".
Pseudocode:
class DebugPrintln:
def __init__(self):
self.wasLF = False
def write(self, x):
if self.wasLF:
self.wasLF = False
frames = traceback.extract_stack()
... find calling code and output it ...
if x == '\n':
self.wasLF = true
super.write(x)