I made a program that extracts the text from a HTML file. It recurses down the HTML document and returns the list of tags. For eg,
input < li >no way < b > you < /b > are doing this < /li >
output ['no','way','you','are'...].
Here is a highly simplified pseudocode for this:
def get_leaves(node):
kids=getchildren(node)
for i in kids:
if leafnode(i):
get_leaves(i)
else:
a=process_leaf(i)
list_of_leaves.append(a)
def calling_fn():
list_of_leaves=[] #which is now in global scope
get_leaves(rootnode)
print list_of_leaves
I am now using list_of_leaves in a global scope from the calling function. The calling_fn() declares this variable, get_leaves() appends to this.
My question is, how do I modify my function so that I am able to do something like list_of_leaves=get_leaves(rootnode), ie without using a global variable?
I dont want each instance of the function to duplicate the list, as the list can get quite big.
Please dont critisize the design of this particular pseudocode, as I simplified this. It is meant for another purpose: extracting tokens along with associated tags using BeautifulSoup
You can pass the result list as optional argument.
def get_leaves(node, list_of_leaves=None):
list_of_leaves = [] if list_of_leaves is None else list_of_leaves
kids=getchildren(node)
for i in kids:
if leafnode(i):
get_leaves(i, list_of_leaves)
else:
a=process_leaf(i)
list_of_leaves.append(a)
def calling_fn():
result = []
get_leaves(rootnode, list_of_leaves=result)
print result
Python objects are always passed by reference. This has been discussed before here. Some of the built-in types are immutable (e.g. int, string), so you cannot modify them in place (a new string is created when you concatenate two strings and assign them to a variable). Instance of mutable types (e.g. list) can be modified in place. We are taking advantage of this by passing the original list for accumulating result in our recursive calls.
For extracting text from HTML in a real application, using a mature library like BeautifulSoup or lxml.html is always a much better option (as others have suggested).
No need to pass an accumulator to the function or accessing it through a global name if you turn get_leaves() into a generator:
def get_leaves(node):
for child in getchildren(node):
if leafnode(child):
for each in get_leaves(child):
yield each
else:
yield process_leaf(child)
def calling_fn():
list_of_leaves = list(get_leaves(rootnode))
print list_of_leaves
Use a decent HTML parser like BeautifulSoup instead of trying smarter than existing software.
#pillmincher's generator answer is the best, but as another alternative, you can turn your function into a class:
class TagFinder:
def __init__(self):
self.leaves = []
def get_leaves(self, node):
kids = getchildren(node)
for i in kids:
if leafnode(i):
self.get_leaves(i)
else:
a = process_leaf(i)
self.list_of_leaves.append(a)
def calling_fn():
finder = TagFinder()
finder.get_leaves(rootnode)
print finder.list_of_leaves
Your code likely involves a number of helper functions anyway, like leafnode, so a class also helps group them all together into one unit.
As a general question about recursion, this is a good one. It is common to have a recursive function that accumulates data into some collection. Either the collection needs to be a global variable (bad) or it is passed to the recursive function. When collections are passed in almost every language, only a reference is passed so you do not have to worry about space. Someone just posted an answer showing how to do this.
Related
So I came across a recursive solution to a problem that keeps track of a global variable differently than I've seen before. I am aware of two ways:
One being by using the global keyword:
count = 0
def global_rec(counter):
global count
count += 1
# do stuff
print(count)
And another using default variables:
def variable_recursive(counter, count=0):
count += 1
if counter <= 0:
return count
return variable_recursive(counter-1, count)
The new way:
#driver function
def driver(counter):
#recursive function being called here
rec_utility.result = 0 <---initializing
rec_utility(counter) <--- calling the recursive function
print(rec_utility.result)
def rec_utility(counter):
if counter <= 0:
return
rec_utility.result += 1 <---- 'what is happening here'
rec_utility(counter-1)
I find this way a lot simpler, as in default variable method we have to return the variables we want to keep a track of and the code get really messy really fast. Can someone please explain why passing a variable joint to a function, like an object property works? I understand that python functions are nothing but objects, but is this a hacky way of keeping track of the variables or is it common practice? If so why do we have so many ways to achieve the same task? Thanks!
This isn't as magical as you might think. It might be poor practice.
rec_utility is just a variable in your namespace which happens to be a function. dir() will show it listed when it is in scope. As an object it can have new fields set. dir(rec_utility) will show these new fields, along with __code__ and others.
Like any object, you can set a new field value, as you are doing in your code. There is only one rec_utility function, even though you call it recursively, so its the same field when you initialize it and when you modify it.
Once you understand it, you can decide if it is a good idea. It might be less confusing or error prone to use a parameter.
In some sense, this question has nothing to do with recursive functions. Suppose a function requires an item of information to operate correctly, then do you:
provide it via a global; or
pass it in as a parameter; or
set it as a function attribute prior to calling it.
In the final case, it’s worth considering that it is not entirely robust:
def f():
# f is not this function!!
return f.x + 1
f.x = 100
for f in range(10): pass
Generally, we would consider the second option the best one. There’s nothing special really about its recursive nature, other than the need to provide state, which is information, to the next invocation.
Let's say I have a code like this:
def read_from_file(filename):
list = []
for i in filename:
value = i[0]
list.append(value)
return list
def other_function(other_filename):
"""
That's where my question comes in. How can I get the list
from the other function if I do not know the value "filename" will get?
I would like to use the "list" in this function
"""
read_from_file("apples.txt")
other_function("pears.txt")
I'm aware that this code might not work or might not be perfect. But the only thing I need is the answer to my question in the code.
You have two general options. You can make your list a global variable that all functions can access (usually this is not the right way), or you can pass it to other_function (the right way). So
def other_function(other_filename, anylist):
pass # your code here
somelist = read_from_file("apples.txt")
other_function("pears.txt.", somelist)
You need to "catch" the value return from the first function, and then pass that to the second function.
file_name = read_from_file('apples.txt')
other_function(file_name)
You need to store the returned value in a variable before you can pass it onto another function.
a = read_from_file("apples.txt")
There are at least three reasonable ways to achieve this and two which a beginner will probably never need:
Store the returned value of read_from_file and give it as a parameter to other_function (so adjust the signature to other_function(other_filename, whatever_list))
Make whatever_list a global variable.
Use an object and store whatever_list as a property of that object
(Use nested functions)
(Search for the value via garbage collector gc ;-)
)
Nested functions
def foo():
bla = "OK..."
def bar():
print(bla)
bar()
foo()
Global variables
What are the rules for local and global variables in Python? (official docs)
Global and Local Variables
Very short example
Misc
You should not use list as a variable name as you're overriding a built-in function.
You should use a descriptive name for your variables. What is the content of the list?
Using global variables can sometimes be avoided in a good way by creating objects. While I'm not always a fan of OOP, it sometimes is just what you need. Just have a look of one of the plenty tutorials (e.g. here), get familiar with it, figure out if it fits for your task. (And don't use it all the time just because you can. Python is not Java.)
I am running into a problem writing recursive member functions in Python. I can't initialize the default value of a function parameter to be the same value as a member variable. I am guessing that Python doesn't support that capability as it says self isn't defined at the time I'm trying to assign the parameter. While I can code around it, the lack of function overloading in Python knocks out one obvious solution I would try.
For example, trying to recursively print a linked list I get the following code for my display function;
def display(self,head = -1):
if head == -1:
head = self.head
if not head:
return
print head,
self.display(head.link)
While this code works, it is ugly.
The main function looks like this:
def main():
l = List();
l.insert(3);
l.insert(40);
l.insert(43);
l.insert(45);
l.insert(65);
l.insert(76);
l.display()
if __name__ == "__main__":
main()
If I could set the display function parameter to default to self.head if it is called without parameters then it would look much nicer. I initially tried to create two versions of the function, one that takes two parameters and one that takes one but as I said, Python doesn't support overloading. I could pass in an argument list and check for the number of arguments but that would be pretty ugly as well (it would make it look like Perl!). The trouble is, if I put the line
head = self.head
inside the function body, it will be called during every recursive call, that's definitely not the behavior I need. None is also a valid value for the head variable so I can't pass that in as a default value. I am using -1 to basically know that I'm in the initial function call and not a recursive call. I realize I could write two functions, one driving the other but I'd rather have it all self contained in one recursive function. I'm pretty sure I'm missing some basic pythonic principle here, could someone help me out with the pythonic approach to the problem?
Thanks!
I don't really see what's wrong with your code. If you chose a falsy default value for head, you could do: head = head or self.head which is more concise.
Otherwise, this is pretty much what you have to do to handle default arguments. Alternatively, use kwargs:
def display(self,**kwargs):
head = kwargs.get("head", self.head)
if not head:
return
print head,
self.display(head=head.link) # you should always name an optional argument,
# and you must name it if **kwargs is used.
The idea is that when a new function is written, it's variable name is appended to a list automatically.
Just to note, I realise I can just use mylist.append(whatever) but I'm specifically looking for a way to automatically append, rather than manually.
So, if we start with...
def function1(*args):
print "string"
def function2(*args):
print "string 2"
mylist = []
...is there a way to append 'function1' and 'function2' to mylist automatically so that it would end up like this...
mylist = [function1, function2]
Specifically, I'd like to have the variable name listed, not a string (e.g. "function1").
I'm learning Python and just experimenting, so this doesn't serve any particular purpose at the moment, I just want to know if it's possible.
Thanks in advance for any suggestions and happy answer any questions if I've not been clear.
**
Just add the function object to the list:
mylist = [function1, function2]
or use .append():
mylist.append(function1)
mylist.append(function2)
Python functions are first-class objects. They are values, just like classes and strings and integers.
If you want to automate this for a whole module, you can use the globals() function to quickly list all functions defined in the module so far, with a little help from the inspect.isfunction() predicate:
import inspect
mylist = [v for v globals().itervalues() if inspect.isfunction(v) and v.__module__ == __name__]
The v.__module__ == __name__ test ensures we only list functions from the current module, not anything we imported.
However, explicit is still better than implicit. Either add mylist.append(functionname) below each function, or use a decorator:
mylist = []
def listed(func):
mylist.append(func)
return func
#listed
def function1():
pass
#listed
def function2():
pass
Each function you 'mark' with the #listed decorator is added to the mylist list.
In principle, you could do that with a decorator, which would probably qualify as a semi-automatic solution:
#gather
def function1():
print "function 1"
#gather
def function2():
print "function 2"
One implementation of such a decorator is essentially a function which gets a function as a parameter:
function_list = []
def gather(func):
function_list.append(func) # or .append(func.__name__)
return func
In this simple incarnation it is probably not useful at all, but popular libraries and frameworks often employ a somewhat enhanced version of this technique. As an example, see the Flask's #app.route decorator for specifying functions that handle specific HTTP requests.
Is there a pythonic preferred way to do this that I would do in C++:
for s in str:
if r = regex.match(s):
print r.groups()
I really like that syntax, imo it's a lot cleaner than having temporary variables everywhere. The only other way that's not overly complex is
for s in str:
r = regex.match(s)
if r:
print r.groups()
I guess I'm complaining about a pretty pedantic issue. I just miss the former syntax.
How about
for r in [regex.match(s) for s in str]:
if r:
print r.groups()
or a bit more functional
for r in filter(None, map(regex.match, str)):
print r.groups()
Perhaps it's a bit hacky, but using a function object's attributes to store the last result allows you to do something along these lines:
def fn(regex, s):
fn.match = regex.match(s) # save result
return fn.match
for s in strings:
if fn(regex, s):
print fn.match.groups()
Or more generically:
def cache(value):
cache.value = value
return value
for s in strings:
if cache(regex.match(s)):
print cache.value.groups()
Note that although the "value" saved can be a collection of a number of things, this approach is limited to holding only one such at a time, so more than one function may be required to handle situations where multiple values need to be saved simultaneously, such as in nested function calls, loops or other threads. So, in accordance with the DRY principle, rather than writing each one, a factory function can help:
def Cache():
def cache(value):
cache.value = value
return value
return cache
cache1 = Cache()
for s in strings:
if cache1(regex.match(s)):
# use another at same time
cache2 = Cache()
if cache2(somethingelse) != cache1.value:
process(cache2.value)
print cache1.value.groups()
...
There's a recipe to make an assignment expression but it's very hacky. Your first option doesn't compile so your second option is the way to go.
## {{{ http://code.activestate.com/recipes/202234/ (r2)
import sys
def set(**kw):
assert len(kw)==1
a = sys._getframe(1)
a.f_locals.update(kw)
return kw.values()[0]
#
# sample
#
A=range(10)
while set(x=A.pop()):
print x
## end of http://code.activestate.com/recipes/202234/ }}}
As you can see, production code shouldn't touch this hack with a ten foot, double bagged stick.
This might be an overly simplistic answer, but would you consider this:
for s in str:
if regex.match(s):
print regex.match(s).groups()
There is no pythonic way to do something that is not pythonic. It's that way for a reason, because 1, allowing statements in the conditional part of an if statement would make the grammar pretty ugly, for instance, if you allowed assignment statements in if conditions, why not also allow if statements? how would you actually write that? C like languages don't have this problem, because they don't have assignment statements. They make do with just assignment expressions and expression statements.
the second reason is because of the way
if foo = bar:
pass
looks very similar to
if foo == bar:
pass
even if you are clever enough to type the correct one, and even if most of the members on your team are sharp enough to notice it, are you sure that the one you are looking at now is exactly what is supposed to be there? it's not unreasonable for a new dev to see this and just fix it (one way or the other) and now its definitely wrong.
Whenever I find that my loop logic is getting complex I do what I would with any other bit of logic: I extract it to a function. In Python it is a lot easier than some other languages to do this cleanly.
So extract the code that just generates the items of interest:
def matching(strings, regex):
for s in strings:
r = regex.match(s)
if r: yield r
and then when you want to use it, the loop itself is as simple as they get:
for r in matching(strings, regex):
print r.groups()
Yet another answer is to use the "Assign and test" recipe for allowing assigning and testing in a single statement published in O'Reilly Media's July 2002 1st edition of the Python Cookbook and also online at Activestate. It's object-oriented, the crux of which is this:
# from http://code.activestate.com/recipes/66061
class DataHolder:
def __init__(self, value=None):
self.value = value
def set(self, value):
self.value = value
return value
def get(self):
return self.value
This can optionally be modified slightly by adding the custom __call__() method shown below to provide an alternative way to retrieve instances' values -- which, while less explicit, seems like a completely logical thing for a 'DataHolder' object to do when called, I think.
def __call__(self):
return self.value
Allowing your example to be re-written:
r = DataHolder()
for s in strings:
if r.set(regex.match(s))
print r.get().groups()
# or
print r().groups()
As also noted in the original recipe, if you use it a lot, adding the class and/or an instance of it to the __builtin__ module to make it globally available is very tempting despite the potential downsides:
import __builtin__
__builtin__.DataHolder = DataHolder
__builtin__.data = DataHolder()
As I mentioned in my other answer to this question, it must be noted that this approach is limited to holding only one result/value at a time, so more than one instance is required to handle situations where multiple values need to be saved simultaneously, such as in nested function calls, loops or other threads. That doesn't mean you should use it or the other answer, just that more effort will be required.