I have a function to port from another language, could you please help me make it "pythonic"?
Here the function ported in a "non-pythonic" way (this is a bit of an artificial example - every task is associated with a project or "None", we need a list of distinct projects, distinct meaning no duplication of the .identifier property, starting from a list of tasks):
#staticmethod
def get_projects_of_tasks(task_list):
projects = []
project_identifiers_seen = {}
for task in task_list:
project = task.project
if project is None:
continue
project_identifier = project.identifier
if project_identifiers_seen.has_key(project_identifier):
continue
project_identifiers_seen[project_identifier] = True
projects.append(project)
return projects
I have specifically not even started to make it "pythonic" not to start off on the wrong foot (e.g. list comprehension with "if project.identifier is not None, filter() based on predicate that looks up the dictionary-based registry of identifiers, using set() to strip duplicates, etc.)
EDIT:
Based on the feedback, I have this:
#staticmethod
def get_projects_of_tasks(task_list):
projects = []
project_identifiers_seen = set()
for task in task_list:
project = task.project
if project is None:
continue
project_identifier = project.identifier
if project_identifier in project_identifiers_seen:
continue
project_identifiers_seen.add(project_identifier)
projects.append(project)
return projects
There's nothing massively unPythonic about this code. A couple of possible improvements:
project_identifiers_seen could be a set, rather than a dictionary.
foo.has_key(bar) is better spelled bar in foo
I'm suspicious that this is a staticmethod of a class. Usually there's no need for a class in Python unless you're actually doing data encapsulation. If this is just a normal function, make it a module-level one.
What about:
project_list = {task.project.identifier:task.project for task in task_list if task.project is not None}
return project_list.values()
For 2.6- use dict constructor instead:
return dict((x.project.id, x.project) for x in task_list if x.project).values()
def get_projects_of_tasks(task_list):
seen = set()
return [seen.add(task.project.identifier) or task.project #add is always None
for task in task_list if
task.project is not None and task.project.identifier not in seen]
This works because (a) add returns None (and or returns the value of the last expression evaluated) and (b) the mapping clause (the first clause) is only executed if the if clause is True.
There is no reason that it has to be in a list comprehension - you could just as well set it out as a loop, and indeed you may prefer to. This way has the advantage that it is clear that you are just building a list, and what is supposed to be in it.
I've not used staticmethod because there is rarely a need for it. Either have this as a module-level function, or a classmethod.
An alternative is a generator (thanks to #delnan for point this out):
def get_projects_of_tasks(task_list):
seen = set()
for task in task_list:
if task.project is not None and task.project.identifier not in seen:
identifier = task.project.identifier
seen.add(identifier)
yield task.project
This eliminates the need for a side-effect in comprehension (which is controversial), but keeps clear what is being collected.
For the sake of avoiding another if/continue construction, I have left in two accesses to task.project.identifier. This could be conveniently eliminated with the use of a promise library.
This version uses promises to avoid repeated access to task.project.identifier without the need to include an if/continue:
from peak.util.proxies import LazyProxy, get_cache # pip install ProxyTypes
def get_projects_of_tasks(task_list):
seen = set()
for task in task_list:
identifier = LazyProxy(lambda:task.project.identifier) # a transparent promise
if task.project is not None and identifier not in seen:
seen.add(identifier)
yield task.project
This is safe from AttributeErrors because task.project.identifier is never accessed before task.project is checked.
Some say EAFP is pythonic, so:
#staticmethod
def get_projects_of_tasks(task_list):
projects = {}
for task in task_list:
try:
if not task.project.identifier in projects:
projects[task.project.identifier] = task.project
except AttributeError:
pass
return projects.values()
of cours an explicit check wouldn't be wrong either, and would of course be better if many tasks have not project.
And just one dict to keep track of seen identifiers and projects would be enough, if the order of the projects matters, then a OrderedDict (python2.7+) could come in handy.
There are already a lot of good answers, and, indeed, you have accepted one! But I thought I would add one more option. A number of people have seen that your code could be made more compact with generator expressions or list comprehensions. I'm going to suggest a hybrid style that uses generator expressions to do the initial filtering, while maintaining your for loop in the final filter.
The advantage of this style over the style of your original code is that it simplifies the flow of control by eliminating continue statements. The advantage of this style over a single list comprehension is that it avoids multiple accesses to task.project.identifier in a natural way. It also handles mutable state (the seen set) transparently, which I think is important.
def get_projects_of_tasks(task_list):
projects = (task.project for task in task_list)
ids_projects = ((p.identifier, p) for p in projects if p is not None)
seen = set()
unique_projects = []
for id, p in ids_projects:
if id not in seen:
seen.add(id)
unique_projects.append(p)
return unique_projects
Because these are generator expressions (enclosed in parenthesis instead of brackets), they don't build temporary lists. The first generator expression creates an iterable of projects; you could think of it as performing the project = task.project line from your original code on all the projects at once. The second generator expression creates an iterable of (project_id, project) tuples. The if clause at the end filters out the None values; (p.identifier, p) is only evaluated if p passes through the filter. Together, these two generator expressions do away with your first two if blocks. The remaining code is essentially the same as yours.
Note also the excellent suggestion from Marcin/delnan that you create a generator using yield. This cuts down further on the verbosity of your code, boiling it down to its essentials:
def get_projects_of_tasks(task_list):
projects = (task.project for task in task_list)
ids_projects = ((p.identifier, p) for p in projects if p is not None)
seen = set()
for id, p in ids_projects:
if id not in seen:
seen.add(id)
yield p
The only disadvantage -- in case this isn't obvious -- is that if you want to permanently store the projects, you have to pass the resulting iterable to list.
projects_of_tasks = list(get_projects_of_tasks(task_list))
Related
I have some use cases in which I need to run generator functions without caring about the yielded items.
I cannot make them non-generaor functions because in other use cases I certainly need the yielded values.
I am currently using a trivial self-made function to exhaust the generators.
def exhaust(generator):
for _ in generator:
pass
I wondered, whether there is a simpler way to do that, which I'm missing?
Edit
Following a use case:
def create_tables(fail_silently=True):
"""Create the respective tables."""
for model in MODELS:
try:
model.create_table(fail_silently=fail_silently)
except Exception:
yield (False, model)
else:
yield (True, model)
In some context, I care about the error and success values…
for success, table in create_tables():
if success:
print('Creation of table {} succeeded.'.format(table))
else:
print('Creation of table {} failed.'.format(table), file=stderr)
… and in some I just want to run the function "blindly":
exhaust(create_tables())
Setting up a for loop for this could be relatively expensive, keeping in mind that a for loop in Python is fundamentally successive execution of simple assignment statements; you'll be executing n (number of items in generator) assignments, only to discard the assignment targets afterwards.
You can instead feed the generator to a zero length deque; consumes at C-speed and does not use up memory as with list and other callables that materialise iterators/generators:
from collections import deque
def exhaust(generator):
deque(generator, maxlen=0)
Taken from the consume itertools recipe.
One very simple and possibly efficient solution could be
def exhaust(generator): all(generator)
if we can assume that generator will always return True (as in your case where a tuple of 2 elements (success,table) is true even if success and table both are False), or: any(generator) if it will always return False, and in the "worst case", all(x or True for x in generator).
Being that short & simple, you might not even need a function for it!
Regarding the "why?" comment (I dislike these...): There are many cases where one may want to exhaust a generator. To cite just one, it's a way of doing a for loop as an expression, e.g., any(print(i,x) for i,x in enumerate(S)) - of course there are less trivial examples.
Based on your use case it's hard to imagine that there would be sufficiently many tables to create that you would need to consider performance.
Additionally, table creation is going to be much more expensive than iteration.
So the for loop that you already have would seem the simplest and most Pythonic solution - in this case.
You could just have two functions that each do one thing and call the appropriate one at the appropriate time?
def create_table(model, fail_silently=True):
"""Create the table."""
try:
model.create_table(fail_silently=fail_silently)
except Exception:
return (False, model)
else:
return (True, model)
def create_tables(MODELS)
for model in MODELS:
create_table(model)
def iter_create_tables(MODELS)
for model in MODELS:
yield create_table(model)
When you care about the returned values do:
for success, table in iter_create_tables(MODELS):
if success:
print('Creation of table {} succeeded.'.format(table))
else:
print('Creation of table {} failed.'.format(table), file=stderr)
when you don't just do
create_tables(MODELS)
My aim is to improve the speed of my Python code that has been successfully accepted in a leetcode problem, Course Schedule.
I am aware of the algorithm but even though I am using O(1) data-structures, my runtime is still poor: around 200ms.
My code uses dictionaries and sets:
from collections import defaultdict
class Solution:
def canFinish(self, numCourses: int, prerequisites: List[List[int]]) -> bool:
course_list = []
pre_req_mapping = defaultdict(list)
visited = set()
stack = set()
def dfs(course):
if course in stack:
return False
stack.add(course)
visited.add(course)
for neighbor in pre_req_mapping.get(course, []):
if neighbor in visited:
no_cycle = dfs(neighbor)
if not no_cycle:
return False
stack.remove(course)
return True
# for course in range(numCourses):
# course_list.append(course)
for pair in prerequisites:
pre_req_mapping[pair[1]].append(pair[0])
for course in range(numCourses):
if course in visited:
continue
no_cycle = dfs(course)
if not no_cycle:
return False
return True
What else can I do to improve the speed?
You are calling dfs() for a given course multiple times.
But its return value won't change.
So we have an opportunity to memoize it.
Change your algorithmic approach (here, to dynamic programming)
for the big win.
It's a space vs time tradeoff.
EDIT:
Hmmm, you are already memoizing most of the computation
with visited, so lru_cache would mostly improve clarity
rather than runtime.
It's just a familiar idiom for caching a result.
It would be helpful to add a # comment citing a reference
for the algorithm you implemented.
This is a very nice expression, with defaulting:
pre_req_mapping.get(course, [])
If you use timeit you may find that the generated bytecode
for an empty tuple () is a tiny bit more efficient than that
for an empty list [], as it involves fewer allocations.
Ok, some style nits follow, unrelated to runtime.
As an aside, youAreMixingCamelCase and_snake_case.
PEP-8 asks you to please stick with just snake_case.
This is a fine choice of identifier name:
for pair in prerequisites:
But instead of the cryptic [0], [1] dereferences,
it would be easier to read a tuple unpack:
for course, prereq in prerequisites:
if not no_cycle: is clumsy.
Consider inverting the meaning of dfs' return value,
or rephrasing the assignment as:
cycle = not dfs(course)
I think that you are doing it in good way, but since Python is an interpreted language, it's normal to have slow runtime compared with compiled languages like C/C++ and Java, especially for large inputs.
Try to write the same code in C/C++ for example and compare the speed between them.
I have some use cases in which I need to run generator functions without caring about the yielded items.
I cannot make them non-generaor functions because in other use cases I certainly need the yielded values.
I am currently using a trivial self-made function to exhaust the generators.
def exhaust(generator):
for _ in generator:
pass
I wondered, whether there is a simpler way to do that, which I'm missing?
Edit
Following a use case:
def create_tables(fail_silently=True):
"""Create the respective tables."""
for model in MODELS:
try:
model.create_table(fail_silently=fail_silently)
except Exception:
yield (False, model)
else:
yield (True, model)
In some context, I care about the error and success values…
for success, table in create_tables():
if success:
print('Creation of table {} succeeded.'.format(table))
else:
print('Creation of table {} failed.'.format(table), file=stderr)
… and in some I just want to run the function "blindly":
exhaust(create_tables())
Setting up a for loop for this could be relatively expensive, keeping in mind that a for loop in Python is fundamentally successive execution of simple assignment statements; you'll be executing n (number of items in generator) assignments, only to discard the assignment targets afterwards.
You can instead feed the generator to a zero length deque; consumes at C-speed and does not use up memory as with list and other callables that materialise iterators/generators:
from collections import deque
def exhaust(generator):
deque(generator, maxlen=0)
Taken from the consume itertools recipe.
One very simple and possibly efficient solution could be
def exhaust(generator): all(generator)
if we can assume that generator will always return True (as in your case where a tuple of 2 elements (success,table) is true even if success and table both are False), or: any(generator) if it will always return False, and in the "worst case", all(x or True for x in generator).
Being that short & simple, you might not even need a function for it!
Regarding the "why?" comment (I dislike these...): There are many cases where one may want to exhaust a generator. To cite just one, it's a way of doing a for loop as an expression, e.g., any(print(i,x) for i,x in enumerate(S)) - of course there are less trivial examples.
Based on your use case it's hard to imagine that there would be sufficiently many tables to create that you would need to consider performance.
Additionally, table creation is going to be much more expensive than iteration.
So the for loop that you already have would seem the simplest and most Pythonic solution - in this case.
You could just have two functions that each do one thing and call the appropriate one at the appropriate time?
def create_table(model, fail_silently=True):
"""Create the table."""
try:
model.create_table(fail_silently=fail_silently)
except Exception:
return (False, model)
else:
return (True, model)
def create_tables(MODELS)
for model in MODELS:
create_table(model)
def iter_create_tables(MODELS)
for model in MODELS:
yield create_table(model)
When you care about the returned values do:
for success, table in iter_create_tables(MODELS):
if success:
print('Creation of table {} succeeded.'.format(table))
else:
print('Creation of table {} failed.'.format(table), file=stderr)
when you don't just do
create_tables(MODELS)
Often times I find that, when working with Pythonic sets, the Pythonic way seems to be absent.
For example, doing something like a dijkstra or a*:
openSet, closedSet = set(nodes), set(nodes)
while openSet:
walkSet, openSet = openSet, set()
for node in walkSet:
for dest in node.destinations():
if dest.weight() < constraint:
if dest not in closedSet:
closedSet.add(dest)
openSet.add(dest)
This is a weakly contrived example, the focus is the last three lines:
if not value in someSet:
someSet.add(value)
doAdditionalThings()
Given the Python way of working with, for example, accessing/using values of a dict, I would have expected to be able to do:
try:
someSet.add(value)
except KeyError:
continue # well, that's ok then.
doAdditionalThings()
As a C++ programmer, my skin crawls a bit that I can't even do:
if someSet.add(value):
# add wasn't blocked by the value already being present
doAdditionalThings()
Is there a more Pythonic (and if possible more efficient) way to work with this sort of set-as-guard usage?
The add operation is not supposed to also tell you if the item was already in the set; it just makes sure it is in there after you add it. Or put another way, what you want is not "add an item and check if it worked"; you want to first check if the item is there, and if not, then do some special stuff. If all you wanted to do was add the item, you wouldn't do the check at all. There is nothing unpythonic about this pattern:
if item not in someSet:
someSet.add(item)
doStuff()
else:
doOtherStuff()
It is true that the API could have been designed so that .add returned whether the item was already in there, but in my experience that's not a particularly common use case. Part of the point of sets is that you can freely add items without worrying about whether they were already in there (since adding an already-included item has no effect). Also, having .add return None is consistent with the general convention for Python builtin types that methods that mutate their arguments return None. It is really things like dict.setdefault (which gets an item but first adds it if isn't there) that are the unusual case.
Update: What I really wanted all along were greenlets.
Note: This question mutated a bit as people answered and forced me to "raise the stakes", as my trivial examples had trivial simplifications; rather than continue to mutate it here, I will repose the question when I have it clearer in my head, as per Alex's suggestion.
Python generators are a thing of beauty, but how can I easily break one down into modules (structured programming)? I effectively want PEP 380, or at least something comparable in syntax burden, but in existing Python (e.g. 2.6)
As an (admittedly stupid) example, take the following:
def sillyGenerator():
for i in xrange(10):
yield i*i
for i in xrange(12):
yield i*i
for i in xrange(8):
yield i*i
Being an ardent believer in DRY, I spot the repeated pattern here and factor it out into a method:
def quadraticRange(n):
for i in xrange(n)
yield i*i
def sillyGenerator():
quadraticRange(10)
quadraticRange(12)
quadraticRange(8)
...which of course doesn't work. The parent must call the new function in a loop, yielding the results:
def sillyGenerator():
for i in quadraticRange(10):
yield i
for i in quadraticRange(12):
yield i
for i in quadraticRange(8):
yield i
...which is even longer than before!
If I want to push part of a generator into a function, I always need this rather verbose, two-line wrapper to call it. It gets worse if I want to support send():
def sillyGeneratorRevisited():
g = subgenerator()
v = None
try:
while True:
v = yield g.send(v)
catch StopIteration:
pass
if v < 4:
# ...
else:
# ...
And that's not taking into account passing in exceptions. The same boilerplate every time! Yet one cannot apply DRY and factor this identical code into a function, because...you'd need the boilerplate to call it! What I want is something like:
def sillyGenerator():
yield from quadraticRange(10)
yield from quadraticRange(12)
yield from quadraticRange(8)
def sillyGeneratorRevisited():
v = yield from subgenerator()
if v < 4:
# ...
else:
# ...
Does anyone have a solution to this problem? I have a first-pass attempt, but I'd like to know what others have come up with. Ultimately, any solution will have to tackle examples where the main generator performs complex logic based on the result of data sent into the generator, and potentially makes a very large number of calls to sub-generators: my use-case is generators used to implement long-running, complex state machines.
However, I'd like to make my
reusability criteria one notch harder:
what if I need a control structure
around my repeated generation?
itertools often helps even there - you need to provide concrete examples where you think it doesn't.
For
instance, I might want to call a
subgenerator forever with different
parameters.
itertools.chain.from_iterable.
Or my subgenerators might
be very costly, and I only want to
start them up as and when they are
reached.
Both chain and chain_from_iterable do that -- no sub-iterator is "started up" until the very instant the first item from it is needed.
Or (and this is a real
desire) I might want to vary what I do
next based on what my controller
passes me using send().
A specific example would be greatly appreciated. Anyway, worst case, you'll be coding for x in blargh: yield x where the suspended Pep3080 would let you code yield from blargh -- about 4 extra characters (not a tragedy;-).
And if some sophisticated coroutine-version of some itertools functionality (itertools mostly supports iterators - there's no equivalent coroutools module yet) becomes warranted, because a certain pattern of coroutine composition is often repeated in your code, then it's not too hard to code it yourself.
For example, suppose we often find ourselves doing something like: first yield a certain value; then, repeatedly, if we're sent 'foo', yield the next item from fooiter, if 'bla', from blaiter, if 'zop', from zopiter, anything else, from defiter. As soon as we spot the second occurrence of this compositional pattern, we can code:
def corou_chaiters(initsend, defiter, val2itermap):
currentiter = iter([initsend])
while True:
val = yield next(currentiter)
currentiter = val2itermap(val, defiter)
and call this simple compositional function as and when needed. If we need to compose other coroutines, rather than general iterators, we'll have a slightly different composer using the send method instead of the next built-in function; and so forth.
If you can offer an example that's not easily tamed by such techniques, I suggest you do so in a separate question (specifically targeted to coroutine-like generators), as there's already a lot of material on this one that will have little to do with your other, much more complex/sophisticated, example.
You want to chain several iterators together:
from itertools import chain
def sillyGenerator(a,b,c):
return chain(quadraticRange(a),quadraticRange(b),quadraticRange(c))
Impractical (unfortunately) answer:
from __future__ import PEP0380
def sillyGenerator():
yield from quadraticRange(10)
yield from quadraticRange(12)
yield from quadraticRange(8)
Potentially practical reference: Syntax for delegating to a subgenerator
Unfortunately making this impractical: Python language moratorium
UPDATE Feb 2011:
The moratorium has been lifted, and PEP 380 is on the TODO list for Python 3.3. Hopefully this answer will be practical soon.
Read Guido's remarks on comp.python.devel
import itertools
def quadraticRange(n):
for i in xrange(n)
yield i*i
def sillyGenerator():
return itertools.chain(
quadraticRange(10),
quadraticRange(12),
quadraticRange(8),
)
def sillyGenerator2():
return itertools.chain.from_iterable(
quadraticRange(n) for n in [10, 12, 8])
The last is useful if you want to make sure one iterator is exhausted before another starts (including its initialization code).
There is a Python Enhancement Proposal for providing a yield from statement for "delegating generation". Your example would be written as:
def sillyGenerator():
sq = lambda i: i * i
yield from map(sq, xrange(10))
yield from map(sq, xrange(12))
yield from map(sq, xrange(8))
Or better, in the spirit of DRY:
def sillyGenerator():
for i in [10, 12, 8]:
yield from quadraticRange(i)
The proposal is in draft status and its eventual inclusion is not certain, but it shows that other developers share your thoughts about generators.
For an arbitrary number of calls to quadraticRange:
from itertools import chain
def sillyGenerator(*args):
return chain(*map(quadraticRange, args))
This code uses map and itertools.chain. It takes an arbitrary number of arguments and passes them in order to quadraticRange. The resulting iterators are then chained.
There is a pattern which I call "generator kernel" where generators don't yield directly to the user but to some "kernel" loop that treats (some of) their yields as "system calls" with special meaning.
You can apply it here by an intermediate function that accepts yielded generators and unrolls them automatically. To make it easy to use, we'll create that intermediate function in a decorator:
import functools, types
def flatten(values_or_generators):
for x in values_or_generators:
if isinstance(x, GeneratorType):
for y in x:
yield y
else:
yield x
# Better name anyone?
def subgenerator(g):
"""Decorator making ``yield <gen>`` mean ``yield from <gen>``."""
#functools.wraps(g)
def flat_g(*args, **kw):
return flatten(g(*args, **kw))
return flat_g
and then you can just write:
def quadraticRange(n):
for i in xrange(n)
yield i*i
#subgenerator
def sillyGenerator():
yield quadraticRange(10)
yield quadraticRange(12)
yield quadraticRange(8)
Note that subgenerator() is unrolling exactly one level of the hierarchy. You could easily make it multi-level (by managing a manual stack, or just replacing the inner loop with for y in flatten(x): - but I think it's better as it is, so that every generator that wants to use this non-standard syntax has to be explicitly wrapped with #subgenerator.
Note also that detection of generators is imperfect! It will detect things written as generators, but that's an implementation detail. As a caller of a generator, all you care about is that it returns an iterator. It could be a function returning some itertools object, and then this decorator would fail.
Checking whether the object has a .next() method is too broad - you won't be able to yield strings without them being taken apart. So the most reliable way would be to check for some explicit marker, so you'd write e.g.:
#subgenerator
def sillyGenerator():
yield 'from', quadraticRange(10)
yield 'from', quadraticRange(12)
yield 'from', quadraticRange(8)
Hey, that's almost like the PEP!
[credits: this answer gives a similar function - but it's deep (which I consider wrong) and is not a framed as a decorator]
class Communicator:
def __init__(self, inflow):
self.outflow = None
self.inflow = inflow
Then you do:
c = Communicator(something)
yield c
response = c.outflow
And instead of the boilerplate code you can simply do:
for i in run():
something = i.inflow
# ...
i.outflow = value_to_return_back
It's simple enough code that works without much brain.