I am new in web scraping, and I ma having a few difficulties using beautifulsoup, which seems more related to installation than to the code itself. I have installed bs4, and want to get data from webpages. I started with a simple exercise as follows:
import requests
import urllib
from BeautifulSoup import BeautifulSoup
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
which gets me the following error message
Traceback (most recent call last):
File "<ipython-input-62-a9912850b0dc>", line 1, in <module>
soup = BeautifulSoup(page.content, 'html.parser')
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
File "/Users/../anaconda/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/Users/../anaconda/lib/python2.7/sgmllib.py", line 174, in goahead
k = self.parse_declaration(i)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1463, in parse_declaration
j = SGMLParser.parse_declaration(self, i)
File "/Users/../anaconda/lib/python2.7/markupbase.py", line 109, in parse_declaration
self.handle_decl(data)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1448, in handle_decl
self._toStringSubclass(data, Declaration)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1381, in _toStringSubclass
self.endData(subclass)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1251, in endData
(not self.parseOnlyThese.text or \
AttributeError: 'str' object has no attribute 'text'
If I remove 'html.parser' and use
soup = BeautifulSoup(page.content)
the code works, but, of course, it does not give me what I need.
Any clues as to how to solve this? I am in a OSX El Capitan, and use spyder as editor. I did re-installed bs4 a few times.
Thanks
You are using an old version of BeautifulSoup. Please uninstall it, and then install BeautifulSoup4, with pip install BeautifulSoup4; and then adjust your code thus:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168')
s = BeautifulSoup(r.content, 'html.parser')
Try this snippet:
soup = BeautifulSoup(page.content, 'html.parser')
# in place of page.content use page.text
soup = BeautifulSoup(page.text, 'html.parser')
Related
Hello I'm new to python and Beautiful Soup. I have downloaded BS4 with pip install and am attempting to do some web scaping. I have looked through a lot of help guides and haven't been able to get my BeautifulSoup() to work through the cmd compiler. Here is my code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
This is the traceback I get with an URL input:
C:\Users\aaron\OneDrive\Desktop\Coding>python urllinks_get.py
Enter - http://www.dr-chuck.com/page1.htm
Traceback (most recent call last):
File "C:\Users\aaron\OneDrive\Desktop\Coding\urllinks_get.py", line 21, in <module>
soup = BeautifulSoup(html, 'html.parser')
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\__init__.py", line 215, in __init__
self._feed()
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\__init__.py", line 239, in _feed
self.builder.feed(self.markup)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\builder\_htmlparser.py", line 164, in feed
parser.feed(markup)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2032.0_x64__qbz5n2kfra8p0\lib\html\parser.py", line 110, in feed
self.goahead(0)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2032.0_x64__qbz5n2kfra8p0\lib\html\parser.py", line 170, in goahead
k = self.parse_starttag(i)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2032.0_x64__qbz5n2kfra8p0\lib\html\parser.py", line 344, in parse_starttag
self.handle_starttag(tag, attrs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\builder\_htmlparser.py", line 62, in handle_starttag
self.soup.handle_starttag(name, None, None, attr_dict)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\__init__.py", line 404, in handle_starttag
self.currentTag, self._most_recent_element)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1001, in __getattr__
return self.find(tag)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1238, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1259, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 516, in _find_all
strainer = SoupStrainer(name, attrs, text, **kwargs)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1560, in __init__
self.text = self._normalize_search_value(text)
File "C:\Users\aaron\OneDrive\Desktop\Coding\bs4\element.py", line 1565, in _normalize_search_value
if (isinstance(value, str) or isinstance(value, collections.Callable) or hasattr(value, 'match')
AttributeError: module 'collections' has no attribute 'Callable'
Really would like to continue my online classes so any help would be much appreciated!
Thanks!
Found my issue. I had installed beautifulsoup4 as well as used the bs4 folder in the same directory as my program ran in. I didn't realize they would interfere with one another. As soon as I removed the bs4 folder from the directory my program ran fine :)
I am new to Python and its available libraries, and I am trying to make a script to scrape a website. I want to read all links on a parent page, and have my script parse out and read data from all children links from the parent page.
For some reason, I am getting this sequence of errors for my code:
python ./scrape.py
/
Traceback (most recent call last):
File "./scrape.py", line 27, in <module>
a = requests.get(url)
File "/Library/Python/2.7/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 494, in request
prep = self.prepare_request(req)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 437, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/Library/Python/2.7/site-packages/requests/models.py", line 305, in prepare
self.prepare_url(url, params)
File "/Library/Python/2.7/site-packages/requests/models.py", line 379, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/': No schema supplied. Perhaps you meant http:///?
From my Python script here:
from bs4 import BeautifulSoup
import requests
#somesite = 'https://www.somesite.com/"
page = 'https://www.investopedia.com/terms/s/stop-limitorder.asp'
count = 0
#url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(page) #requests html document
data = r.text #set data = to html text
soup = BeautifulSoup(data, "html.parser") #parse data with BS
#count = 0;
#souplist = []
#list
A = []
#loop to seach for all <a> tags that hold urls, store page data in array
for link in soup.find_all('a'):
#print(link.get('href'))
url = link.get('href')
print(url)
a = requests.get(url)
#a = requests.get(url)
#data1 = a.text
#souplist.insert(0, BeautifulSoup[data1])
#++count
#
#for link in soup.find_all('p'):
#print(link.getText())
Some of the links the page your are scraping are relative URLs to the website(https://www.investopedia.com) . So you may have to crawl such URLs by adding the site.
from urlparse import urlparse, urljoin
# Python 3
# from urllib.parse import urlparse
# from urllib.parse import urljoin
site = urlparse(page).scheme + "://" + urlparse(page).netloc
for link in soup.find_all('a'):
#print(link.get('href'))
url = link.get('href')
if not urlparse(url).scheme:
url = urljoin(site, url)
print(url)
a = requests.get(url)
I need to get the column headers from the second tbody in this url.
http://bepi.mpob.gov.my/index.php/statistics/price/daily.html
Specifically, i would like to see "september, october"... etc.
I am getting the following error:
runfile('C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py', wdir='C:/Python27/Lib/site-packages/xy/workspace/webscrape')
Traceback (most recent call last):
File "<ipython-input-8-ab4005f51fa3>", line 1, in <module>
runfile('C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py', wdir='C:/Python27/Lib/site-packages/xy/workspace/webscrape')
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py", line 26, in <module>
soup.findAll('tbody', limit=2)[1].findAll('tr').findAll('th')]
IndexError: list index out of range
can anyone here please help me out? I shall be eternally grateful!
have posted my code below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://bepi.mpob.gov.my/index.php/statistics/price/daily.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
column_headers = [th.getText() for th in
soup.findAll('tbody', limit=2)[1].findAll('tr').findAll('th')]
When you click "View Price" button a POST request is sent to the http://bepi.mpob.gov.my/admin2/price_local_daily_view3.php endpoint. Simulate this POST request and parse the resulting HTML:
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
session.get("http://bepi.mpob.gov.my/index.php/statistics/price/daily.html")
response = session.post("http://bepi.mpob.gov.my/admin2/price_local_daily_view3.php", data={
"tahun": "2016",
"bulan": "9",
"Submit2222": "View Price"
})
soup = BeautifulSoup(response.content, 'lxml')
table = soup.find("table", id="hor-zebra")
headers = [td.get_text() for td in table.find_all("tr")[2].find_all("td")]
print(headers)
Prints the headers of the table:
[u'Tarikh', u'September', u'October', u'November', u'December', u'September', u'October', u'November', u'December', u'September', u'October', u'November', u'December']
I am following a tutorial to try to learn how to use BeautifulSoup. I am trying to remove names from the urls on a html page I downloaded. I have it working great to this point.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
print link
but when I enter this next part
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
names = link.contents[0]
fullLink = link.get('href')
print names
print fullLink
I get this error
Traceback (most recent call last):
File "C:/Python27/python tutorials/soupexample.py", line 13, in <module>
print names
File "C:\Python27\lib\idlelib\PyShell.py", line 1325, in write
return self.shell.write(s, self.tags)
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded
This is a buggy interaction between IDLE and BeautifulSoup's NavigableString objects (which subclass unicode). See issue 1757057; it's been around for a while.
The work-around is to convert the object to a plain unicode value first:
print unicode(names)
I'm making a web page scraper using BeautifulSoup4 and requests libraries. I had some trouble with BeautifulSoup working but got some help and was able to get that fixed. Now I've run into a new problem and I'm not sure how to fix it. I'm using requests 2.2.1 and I'm trying to run this program on Python 3.1.2. And when I do I get a traceback error.
here is my code:
from bs4 import BeautifulSoup
import requests
url = input("Enter a URL (start with www): ")
link = "http://" + url
page = requests.get(link).content
soup = BeautifulSoup(page)
for url in soup.find_all('a'):
print(url.get('href'))
print()
and the error:
Enter a URL (start with www): www.google.com
Traceback (most recent call last):
File "/Users/user/Desktop/project.py", line 8, in <module>
page = requests.get(link).content
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/sessions.py", line 349, in request
prep = self.prepare_request(req)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/sessions.py", line 287, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/models.py", line 287, in prepare
self.prepare_url(url,params)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/models.py", line 321, in prepare_url
url = str(url)
TypeError: 'tuple' object is not callable
I've done some looking and when others have gotten this error (in django mostly) there was a comma missing but I'm not sure where to put a comma at? Any help will be appreciated.