Maximum recursion depth exceeded. Multiprocessing and bs4 - python

I'm trying to make a parser use beautifulSoup and multiprocessing. I have an error:
RecursionError: maximum recursion depth exceeded
My code is:
import bs4, requests, time
from multiprocessing.pool import Pool
html = requests.get('https://www.avito.ru/moskva/avtomobili/bmw/x6?sgtd=5&radius=0')
soup = bs4.BeautifulSoup(html.text, "html.parser")
divList = soup.find_all("div", {'class': 'item_table-header'})
def new_check():
with Pool() as pool:
pool.map(get_info, divList)
def get_info(each):
pass
if __name__ == '__main__':
new_check()
Why I get this error and how I can fix it?
UPDATE:
All text of error is
Traceback (most recent call last):
File "C:/Users/eugen/PycharmProjects/avito/main.py", line 73, in <module> new_check()
File "C:/Users/eugen/PycharmProjects/avito/main.py", line 67, in new_check
pool.map(get_info, divList)
File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 644, in get
raise self._value
File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 424, in _handle_tasks
put(task)
File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
RecursionError: maximum recursion depth exceeded

When you use multiprocessing, everything you pass to a worker has to be pickled.
Unfortunately, many BeautifulSoup trees can't be pickled.
There are a few different reasons for this. Some of them are bugs that have since been fixed, so you could try making sure you have the latest bs4 version, and some are specific to different parsers or tree builders… but there's a good chance nothing like this will help.
But the fundamental problem is that many elements in the tree contain references to the rest of the tree.
Occasionally, this leads to an actual infinite loop, because the circular references are too indirect for its circular reference detection. But that's usually a bug that gets fixed.
But, even more importantly, even when the loop isn't infinite, it can still drag in more than 1000 elements from all over the rest of the tree, and that's already enough to cause a RecursionError.
And I think the latter is what's happening here. If I take your code and try to pickle divList[0], it fails. (If I bump the recursion limit way up and count the frames, it needs a depth of 23080, which is way, way past the default of 1000.) But if I take that exact same div and parse it separately, it succeeds with no problem.
So, one possibility is to just do sys.setrecursionlimit(25000). That will solve the problem for this exact page, but a slightly different page might need even more than that. (Plus, it's usually not a great idea to set the recursion limit that high—not so much because of the wasted memory, but because it means actual infinite recursion takes 25x as long, and 25x as much wasted resources, to detect.)
Another trick is to write code that "prunes the tree", eliminating any upward links from the div before/as you pickle it. This is a great solution, except that it might be a lot of work, and requires diving into the internals of how BeautifulSoup works, which I doubt you want to do.
The easiest workaround is a bit clunky, but… you can convert the soup to a string, pass that to the child, and have the child re-parse it:
def new_check():
divTexts = [str(div) for div in divList]
with Pool() as pool:
pool.map(get_info, divTexts)
def get_info(each):
div = BeautifulSoup(each, 'html.parser')
if __name__ == '__main__':
new_check()
The performance cost for doing this is probably not going to matter; the bigger worry is that if you had imperfect HTML, converting to a string and re-parsing it might not be a perfect round trip. So, I'd suggest that you do some tests without multiprocessing first to make sure this doesn't affect the results.

Related

What is iteration error in Wolframalpha api and how can I fix it?

So I'm working on making a simple research bot but I've run into a problem. I was following a guide on using wolfram alpha in python and when I test it I sometimes get the error
Traceback (most recent call last):
File "python", line 6, in <module>
StopIteration`.
Here is my code:
import wolframalpha
import wikipedia
client = wolframalpha.Client('my_id')
q=input('Problem: ')
res = client.query(q)
print(next(res.results).text)
It only happens with some queries and it often works, but still its rather annoying. I looked online but didn't find any help, so I don't know if this is new or something is wrong with my code. Anyway, here is a link to a repl I made where its not working here. Try it with "uranium" I know that one brings the error and so do a few others I've tried. Thanks!
This error is telling you that your query had no results.
This line:
print(next(res.results).text)
… calls next on an iterator, res.results, without a default value:
Retrieve the next item from the iterator by calling its __next__() method. If default is given, it is returned if the iterator is exhausted, otherwise StopIteration is raised.
If res had no results to show you, res.results is an empty iterator. Meaning it's exhausted right from the start, so when you call next on it, you're going to get StopIteration.
And just passing a default isn't going to do much good here. Consider this:
print(next(res.results, None).text)
Now, if there are no results, next will return your default value None, and you'll immediately try to do None.text, which will just raise an AttributeError.
One way to fix this is to just handle the error:
try:
print(next(res.results).text)
except StopIteration:
print('No results')
Another is to break that compound expression up into simpler ones, so you can use a default:
result = next(res.results, None)
print(res.text if res else 'No results')
However, res can include 2 or 3 results just as easily as 0—that's the whole reason it returns an iterator. And usually, you're going to want all of them, or at least a few of them. If that's the case, the best solution is to use a for loop. Iterators are born hoping they'll be used in a for loop, because it makes everyone's like easier:
for result in res.results:
print(result.text)
This will do nothing if results is empty, print one result if there's only one, or print all of the results if there are multiple.
If you're worried about getting 500 results when you only wanted a few, you can stop at, say, 3:
for result in itertools.islice(res.results, 3):
print(result.text)
… or:
for i, result in enumerate(res.results):
print(result.text)
if i > 2: break

Python HTMLParser - stop parsing

I am using Python's HTMLParser from html.parser module.
I am looking for a single tag and when it is found it would make sense to stop the parsing. Is this possible? I tried to call close() but I am not sure if this is the way to go.
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
login_form = False
if tag == "form":
print("finished")
self.close()
However this seems to have recursive effects ending with
File "/usr/lib/python3.4/re.py", line 282, in _compile
p, loc = _cache[type(pattern), pattern, flags]
RuntimeError: maximum recursion depth exceeded in comparison
According to the docs, the close() method does this:
Force processing of all buffered data as if it were followed by an end-of-file mark.
You're still inside the handle_starttag and haven't finished working with the buffer yet, so you definitely do not want to process all the buffered data - that's why you're getting stuck with recursion. You can't stop the machine from inside the machine.
From the description of reset() this sounds more like what you want:
Reset the instance. Loses all unprocessed data.
but also this can't be called from the things which it calls, so this also shows recursion.
It sounds like you have two options:
raise an Exception (like for example a StopIteration) and catch it from your call to the parser. Depending on what else you're doing in the parsing this may retain the information you need. You may need to do some checks to see that files aren't left open.
use a simple flag (True / False) to signify whether you have aborted or not. At the very start of handle_starttag just exit if aborted. So the machinery will still go through all the tags of the html, but do nothing for each one. Obviously if you're processing handle_endtag as well then this would also check the flag. You can set the flag back to normal either when you receive a <html> tag or by overwriting the feed method.

Why does my generator hang instead of throwing exception?

I have a generator that returns lines from a number of files, through a filter. It looks like this:
def line_generator(self):
# Find the relevant files
files = self.get_files()
# Read lines
input_object = fileinput.input(files)
for line in input_object:
# Apply filter and yield if it is not *None*
filtered = self.__line_filter(input_object.filename(), line)
if filtered is not None:
yield filtered
input_object.close()
The method self.get_files() returns a list of file paths or an empty list.
I have tried to do s = fileinput.input([]), and then call s.next(). This is where it hangs, and I cannot understand why. I'm trying to be pythonic, and not handling all errors myself, but I guess this is one where there is no way around. Or is there?
Unfortunately I have no means of testing this on Linux right now, but could someone please try the following on Linux, and comment what they get?
import fileinput
s = fileinput.input([])
s.next()
I'm on Windows with Python 2.7.5 (64 bit).
All in all, I'd really like to know:
Is this a bug in Python, or me that is doing something wrong?
Shouldn't .next() always return something, or raise a StopIteration?
fileinput defaults to stdin if the list is empty, so it's just waiting for you to type something.
An obvious fix would be to get rid of fileinput (is not terribly useful anyway) and to be explicit, as python zen suggests:
for path in self.get_files():
with open(path) as fp:
for line in fp:
etc
As others already have answered, I try to answer one specific sub-item:
Shouldn't .next() always return something, or raise a StopIteration?
Yes, but it is not specified when this return is supposed to happen: within some milliseconds, seconds or even longer.
If you have a blocking iterator, you can define some wrapper around it so that it runs inside a different thread, filling a list or something, and the originating thread gets an interface to determine if there are data, if there are currently no data or if the source is exhausted.
I can elaborate on this even more if needed.

Python - Looping through functions is throwing errors in bg, until maximum recursion depth exceeded

So I think I am just fundamentally doing something wrong, but here is a basic example of what I am doing
some variables here
some code here to run once
def runfirst():
do some stuff
runsecond()
def runsecond():
do some different stuff
runthird():
def runthird():
do some more stuff
runfirst():
runfirst()
So it basically pulls some info I need at beginning, and then runs through 3 different variables. What I am doing is pulling info from db, then watching some counts on the db, and if any of those counts goes over a certain number over a time period, it sends email. This is for monitoring purposes, and I need it to run all the time. The problem I get is, all the time it is running, in the background it is throwing errors like "File "asdf.py", line blah, in firstrun"
I think it is complaining because it sees that I am looping through functions, but for what I need this to do, it works perfectly, except for the errors, and eventually killing my script due to maximum recursion depth exceeded. Any help?
You have infinite recursion here. Because you call runfirst from runthird, it keeps going deeper and deeper and none of the functions ever return. You might want to consider putting the functions in a while True loop instead of calling them from each other.
def runfirst():
do some stuff
def runsecond():
do some different stuff
def runthird():
do some more stuff
while True:
runfirst()
runsecond()
runthird()
You're not looping.
You're calling a function that calls another function that calls a third function that calls the first function which calls the second function which calls the third function which again calls the first function...and so on until your stack overflows.

Twisted sometimes throws (seemingly incomplete) 'maximum recursion depth exceeded' RuntimeError

Because the Twisted getPage function doesn't give me access to headers, I had to write my own getPageWithHeaders function.
def getPageWithHeaders(contextFactory=None, *args, **kwargs):
try:
return _makeGetterFactory(url, HTTPClientFactory,
contextFactory=contextFactory,
*args, **kwargs)
except:
traceback.print_exc()
This is exactly the same as the normal getPage function, except that I added the try/except block and return the factory object instead of returning the factory.deferred
For some reason, I sometimes get a maximum recursion depth exceeded error here. It happens consistently a few times out of 700, usually on different sites each time. Can anyone shed any light on this? I'm not clear why or how this could be happening, and the Twisted codebase is large enough that I don't even know where to look.
EDIT: Here's the traceback I get, which seems bizarrely incomplete:
Traceback (most recent call last):
File "C:\keep-alive\utility\background.py", line 70, in getPageWithHeaders
factory = _makeGetterFactory(url, HTTPClientFactory, timeout=60 , contextFactory=context, *args, **kwargs)
File "c:\Python26\lib\site-packages\twisted\web\client.py", line 449, in _makeGetterFactory
factory = factoryFactory(url, *args, **kwargs)
File "c:\Python26\lib\site-packages\twisted\web\client.py", line 248, in __init__
self.headers = InsensitiveDict(headers)
RuntimeError: maximum recursion depth exceeded
This is the entire traceback, which clearly isn't long enough to have exceeded our max recursion depth. Is there something else I need to do in order to get the full stack? I've never had this problem before; typically when I do something like
def f(): return f()
try: f()
except: traceback.print_exc()
then I get the kind of "maximum recursion depth exceeded" stack that you'd expect, with a ton of references to f()
The specific traceback that you're looking at is a bit mystifying. You could try traceback.print_stack rather than traceback.print_exc to get a look at the entire stack above the problematic code, rather than just the stack going back to where the exception is caught.
Without seeing more of your traceback I can't be certain, but you may be running into the problem where Deferreds will raise a recursion limit exception if you chain too many of them together.
If you turn on Deferred debugging (from twisted.internet.defer import setDebugging; setDebugging(True)) you may get more useful tracebacks in some cases, but please be aware that this may also slow down your server quite a bit.
You should look at the traceback you're getting together with the exception -- that will tell you what function(s) is/are recursing too deeply, "below" _makeGetterFactory. Most likely you'll find that your own getPageWithHeaders is involved in the recursion, exactly because instead of properly returning a deferred it tries to return a factory that's not ready yet. What happens if you do go back to returning the deferred?
The URL opener is likely following an un-ending series of 301 or 302 redirects.

Categories