BeautifulSoup timing out on instantiation? - python

I'm just doing some web scraping with BeautifulSoup and I'm running into a weird error. Code:
print "Running urllib2"
g = urllib2.urlopen(link + "about", timeout=5)
print "Finished urllib2"
about_soup = BeautifulSoup(g, 'lxml')
Here's the output:
Running urllib2
Finished urllib2
Error
Traceback (most recent call last):
File "/Users/pspieker/Documents/projects/ThePyStrikesBack/tests/TestSpringerOpenScraper.py", line 10, in test_strip_chars
for row in self.instance.get_entries():
File "/Users/pspieker/Documents/projects/ThePyStrikesBack/src/JournalScrapers.py", line 304, in get_entries
about_soup = BeautifulSoup(g, 'lxml')
File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__
markup = markup.read()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 355, in read
data = self._sock.recv(rbufsize)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 588, in read
return self._read_chunked(amt)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 648, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 703, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 384, in read
data = self._sock.recv(left)
timeout: timed out
I understand that the urllib2.urlopen could be causing problems, but the exception occurs in the line instantiating BeautifulSoup. I did some googling but couldn't find anything about BeautfiulSoup timeout issues.
Any ideas on what is happening?

This is urllib2 part that causing the timeout.
The reason you see it is failing on the BeautifulSoup instantiation line is that g, the file-like object, is being read by BeautifulSoup internally. This is the part of the stacktrace proving that:
File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__
markup = markup.read()

Related

Beautiful Soup Traceback on First Attempt

Hello I'm new to python and Beautiful Soup. I have downloaded BS4 with pip install and am attempting to do some web scaping. I have looked through a lot of help guides and haven't been able to get my BeautifulSoup() to work through the cmd compiler. Here is my code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
This is the traceback I get with an URL input:
C:\Users\aaron\OneDrive\Desktop\Coding>python urllinks_get.py
Enter - http://www.dr-chuck.com/page1.htm
Traceback (most recent call last):
File "C:\Users\aaron\OneDrive\Desktop\Coding\urllinks_get.py", line 21, in <module>
soup = BeautifulSoup(html, 'html.parser')
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\__init__.py", line 215, in __init__
self._feed()
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\__init__.py", line 239, in _feed
self.builder.feed(self.markup)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\builder\_htmlparser.py", line 164, in feed
parser.feed(markup)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2032.0_x64__qbz5n2kfra8p0\lib\html\parser.py", line 110, in feed
self.goahead(0)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2032.0_x64__qbz5n2kfra8p0\lib\html\parser.py", line 170, in goahead
k = self.parse_starttag(i)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2032.0_x64__qbz5n2kfra8p0\lib\html\parser.py", line 344, in parse_starttag
self.handle_starttag(tag, attrs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\builder\_htmlparser.py", line 62, in handle_starttag
self.soup.handle_starttag(name, None, None, attr_dict)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\__init__.py", line 404, in handle_starttag
self.currentTag, self._most_recent_element)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1001, in __getattr__
return self.find(tag)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1238, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1259, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 516, in _find_all
strainer = SoupStrainer(name, attrs, text, **kwargs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1560, in __init__
self.text = self._normalize_search_value(text)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1565, in _normalize_search_value
if (isinstance(value, str) or isinstance(value, collections.Callable) or hasattr(value, 'match')
AttributeError: module 'collections' has no attribute 'Callable'
Really would like to continue my online classes so any help would be much appreciated!
Thanks!
Found my issue. I had installed beautifulsoup4 as well as used the bs4 folder in the same directory as my program ran in. I didn't realize they would interfere with one another. As soon as I removed the bs4 folder from the directory my program ran fine :)

Multi-thread in Python [duplicate]

This question already has answers here:
No schema supplied and other errors with using requests.get()
(6 answers)
Closed 6 years ago.
I'm following the book "Automate the boring tasks with Python" and I'm trying to create a progrma that downloads multiple comics from http://xkcd.com
simultaneously, but has ran into some problems. I'm copying the exact same program as it is on the book.
Here's my code:
# multidownloadXkcd.py - Downloads XKCD comics using multiple threads.
import requests, os ,bs4, threading
os.chdir('c:\\users\\patty\\desktop')
os.makedirs('xkcd', exist_ok=True) # store comics on ./xkcd
def downloadXkcd(startComic, endComic):
for urlNumber in range(startComic, endComic):
#Download the page
print('Downloading page http://xkcd.com/%s...' %(urlNumber))
res = requests.get('http://xkcd.com/%s' % (urlNumber))
res.raise_for_status()
soup= bs4.BeautifulSoup(res.text, "html.parser")
#Find the URL of the comic image.
comicElem = soup.select('#comic img')
if comicElem == []:
print('Could not find comic image.')
else:
comicUrl = comicElem[0].get('src')
#Download the image.
print('Downloading image %s...' % (comicUrl))
res = requests.get(comicUrl, "html.parser")
res.raise_for_status()
#Save the image to ./xkcd.
imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
downloadThreads = [] # a list of all the Thread objects
for i in range(0,1400, 100): # loops 14 times, creates 14 threads
downloadThread = threading.Thread(target=downloadXkcd, args=(i, i + 99))
downloadThreads.append(downloadThread)
downloadThread.start()
# Wait for all threads to end.
for downloadThread in downloadThreads:
downloadThread.join()
print('Done.')
I'm getting the following exception:
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Python\Python35\lib\threading.py", line 914, in _bootstrap_inner
self.run()
File "C:\Python\Python35\lib\threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\PATTY\PycharmProjects\CH15_TASKS\practice.py", line 13, in downloadXkcd
res.raise_for_status()
File "C:\Python\Python35\lib\site-packages\requests\models.py", line 862, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://xkcd.com/0
Exception in thread Thread-2:
Traceback (most recent call last):
File "C:\Python\Python35\lib\threading.py", line 914, in _bootstrap_inner
self.run()
File "C:\Python\Python35\lib\threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\PATTY\PycharmProjects\CH15_TASKS\practice.py", line 25, in downloadXkcd
res = requests.get(comicUrl, "html.parser")
File "C:\Python\Python35\lib\site-packages\requests\api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "C:\Python\Python35\lib\site-packages\requests\api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python\Python35\lib\site-packages\requests\sessions.py", line 461, in request
prep = self.prepare_request(req)
File "C:\Python\Python35\lib\site-packages\requests\sessions.py", line 394, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Python\Python35\lib\site-packages\requests\models.py", line 294, in prepare
self.prepare_url(url, params)
File "C:\Python\Python35\lib\site-packages\requests\models.py", line 354, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '//imgs.xkcd.com/comics/family_circus.jpg': No schema supplied. Perhaps you meant http:////imgs.xkcd.com/comics/family_circus.jpg?
It's says that the URL is invalid but whenever I copy paste that url into the webrowser it seems to be valid. Anyone know how would I fix this? Thanks
Yea , as #spectras said , just because your url fixes your url it doesn't mean that is valid.
Try using a "http://www." before it and try to see if its working.

HTML Link parsing using BeautifulSoup

here is my Python code which I'm using to extract the Specific HTML from the Page links I'm sending as parameter. I'm using BeautifulSoup. This code works fine for sometimes and sometimes it is getting stuck!
import urllib
from bs4 import BeautifulSoup
rawHtml = ''
url = r'http://iasexamportal.com/civilservices/tag/voice-notes?page='
for i in range(1, 49):
#iterate url and capture content
sock = urllib.urlopen(url+ str(i))
html = sock.read()
sock.close()
rawHtml += html
print i
Here I'm printing the loop variable to find out where it is getting stuck. It shows me that it's getting stuck randomly at any of the loop sequence.
soup = BeautifulSoup(rawHtml, 'html.parser')
t=''
for link in soup.find_all('a'):
t += str(link.get('href')) + "</br>"
#t += str(link) + "</br>"
f = open("Link.txt", 'w+')
f.write(t)
f.close()
what could be the possible issue. Is it the problem with the socket configuration or some other issue.
This is the error I got. I checked these links - python-gaierror-errno-11004,ioerror-errno-socket-error-errno-11004-getaddrinfo-failed for the solution. But I didn't find it much helpful.
d:\python>python ext.py
Traceback (most recent call last):
File "ext.py", line 8, in <module>
sock = urllib.urlopen(url+ str(i))
File "d:\python\lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "d:\python\lib\urllib.py", line 213, in open
return getattr(self, name)(url)
File "d:\python\lib\urllib.py", line 350, in open_http
h.endheaders(data)
File "d:\python\lib\httplib.py", line 1049, in endheaders
self._send_output(message_body)
File "d:\python\lib\httplib.py", line 893, in _send_output
self.send(msg)
File "d:\python\lib\httplib.py", line 855, in send
self.connect()
File "d:\python\lib\httplib.py", line 832, in connect
self.timeout, self.source_address)
File "d:\python\lib\socket.py", line 557, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed
It's running perfectly fine when I'm running it on my personal laptop. But It's giving error when I'm running it on Office Desktop. Also, My version of Python is 2.7. Hope these information will help.
Finally, guys.... It worked! Same script worked when I checked on other PC's too. So probably the problem was because of the firewall settings or proxy settings of my office desktop. which was blocking this website.

RuntimeError: maximum recursion depth exceeded while trying to scrape data from a website [duplicate]

I am following a tutorial to try to learn how to use BeautifulSoup. I am trying to remove names from the urls on a html page I downloaded. I have it working great to this point.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
print link
but when I enter this next part
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
names = link.contents[0]
fullLink = link.get('href')
print names
print fullLink
I get this error
Traceback (most recent call last):
File "C:/Python27/python tutorials/soupexample.py", line 13, in <module>
print names
File "C:\Python27\lib\idlelib\PyShell.py", line 1325, in write
return self.shell.write(s, self.tags)
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded
This is a buggy interaction between IDLE and BeautifulSoup's NavigableString objects (which subclass unicode). See issue 1757057; it's been around for a while.
The work-around is to convert the object to a plain unicode value first:
print unicode(names)

What is the best way to handle a bad link given to BeautifulSoup?

I'm working on something that pulls in urls from delicious and then uses those urls to discover associated feeds.
However, some of the bookmarks in delicious are not html links and cause BS to barf. Basically, I want to throw away a link if BS fetches it and it does not look like html.
Right now, this is what I'm getting.
trillian:Documents jauderho$ ./d2o.py "green data center"
processing http://www.greenm3.com/
processing http://www.eweek.com/c/a/Green-IT/How-to-Create-an-EnergyEfficient-Green-Data-Center/?kc=rss
Traceback (most recent call last):
File "./d2o.py", line 53, in <module>
get_feed_links(d_links)
File "./d2o.py", line 43, in get_feed_links
soup = BeautifulSoup(html)
File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 314, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u'</b />', at line 739, column 1
Update:
Jehiah's answer does the trick. For reference, here's some code to get the content type:
def check_for_html(link):
out = urllib.urlopen(link)
return out.info().getheader('Content-Type')
I simply wrap my BeautifulSoup processing and look for the HTMLParser.HTMLParseError exception
import HTMLParser,BeautifulSoup
try:
soup = BeautifulSoup.BeautifulSoup(raw_html)
for a in soup.findAll('a'):
href = a.['href']
....
except HTMLParser.HTMLParseError:
print "failed to parse",url
but further than that, you can check the content type of the responses when you crawl a page and make sure that it's something like text/html or application/xml+xhtml or something like that before you even try to parse it. That should head off most errors.

Categories