I need to get the column headers from the second tbody in this url.
http://bepi.mpob.gov.my/index.php/statistics/price/daily.html
Specifically, i would like to see "september, october"... etc.
I am getting the following error:
runfile('C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py', wdir='C:/Python27/Lib/site-packages/xy/workspace/webscrape')
Traceback (most recent call last):
File "<ipython-input-8-ab4005f51fa3>", line 1, in <module>
runfile('C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py', wdir='C:/Python27/Lib/site-packages/xy/workspace/webscrape')
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py", line 26, in <module>
soup.findAll('tbody', limit=2)[1].findAll('tr').findAll('th')]
IndexError: list index out of range
can anyone here please help me out? I shall be eternally grateful!
have posted my code below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://bepi.mpob.gov.my/index.php/statistics/price/daily.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
column_headers = [th.getText() for th in
soup.findAll('tbody', limit=2)[1].findAll('tr').findAll('th')]
When you click "View Price" button a POST request is sent to the http://bepi.mpob.gov.my/admin2/price_local_daily_view3.php endpoint. Simulate this POST request and parse the resulting HTML:
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
session.get("http://bepi.mpob.gov.my/index.php/statistics/price/daily.html")
response = session.post("http://bepi.mpob.gov.my/admin2/price_local_daily_view3.php", data={
"tahun": "2016",
"bulan": "9",
"Submit2222": "View Price"
})
soup = BeautifulSoup(response.content, 'lxml')
table = soup.find("table", id="hor-zebra")
headers = [td.get_text() for td in table.find_all("tr")[2].find_all("td")]
print(headers)
Prints the headers of the table:
[u'Tarikh', u'September', u'October', u'November', u'December', u'September', u'October', u'November', u'December', u'September', u'October', u'November', u'December']
Related
I am new to Python and its available libraries, and I am trying to make a script to scrape a website. I want to read all links on a parent page, and have my script parse out and read data from all children links from the parent page.
For some reason, I am getting this sequence of errors for my code:
python ./scrape.py
/
Traceback (most recent call last):
File "./scrape.py", line 27, in <module>
a = requests.get(url)
File "/Library/Python/2.7/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 494, in request
prep = self.prepare_request(req)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 437, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/Library/Python/2.7/site-packages/requests/models.py", line 305, in prepare
self.prepare_url(url, params)
File "/Library/Python/2.7/site-packages/requests/models.py", line 379, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/': No schema supplied. Perhaps you meant http:///?
From my Python script here:
from bs4 import BeautifulSoup
import requests
#somesite = 'https://www.somesite.com/"
page = 'https://www.investopedia.com/terms/s/stop-limitorder.asp'
count = 0
#url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(page) #requests html document
data = r.text #set data = to html text
soup = BeautifulSoup(data, "html.parser") #parse data with BS
#count = 0;
#souplist = []
#list
A = []
#loop to seach for all <a> tags that hold urls, store page data in array
for link in soup.find_all('a'):
#print(link.get('href'))
url = link.get('href')
print(url)
a = requests.get(url)
#a = requests.get(url)
#data1 = a.text
#souplist.insert(0, BeautifulSoup[data1])
#++count
#
#for link in soup.find_all('p'):
#print(link.getText())
Some of the links the page your are scraping are relative URLs to the website(https://www.investopedia.com) . So you may have to crawl such URLs by adding the site.
from urlparse import urlparse, urljoin
# Python 3
# from urllib.parse import urlparse
# from urllib.parse import urljoin
site = urlparse(page).scheme + "://" + urlparse(page).netloc
for link in soup.find_all('a'):
#print(link.get('href'))
url = link.get('href')
if not urlparse(url).scheme:
url = urljoin(site, url)
print(url)
a = requests.get(url)
I am new in web scraping, and I ma having a few difficulties using beautifulsoup, which seems more related to installation than to the code itself. I have installed bs4, and want to get data from webpages. I started with a simple exercise as follows:
import requests
import urllib
from BeautifulSoup import BeautifulSoup
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
which gets me the following error message
Traceback (most recent call last):
File "<ipython-input-62-a9912850b0dc>", line 1, in <module>
soup = BeautifulSoup(page.content, 'html.parser')
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
File "/Users/../anaconda/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/Users/../anaconda/lib/python2.7/sgmllib.py", line 174, in goahead
k = self.parse_declaration(i)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1463, in parse_declaration
j = SGMLParser.parse_declaration(self, i)
File "/Users/../anaconda/lib/python2.7/markupbase.py", line 109, in parse_declaration
self.handle_decl(data)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1448, in handle_decl
self._toStringSubclass(data, Declaration)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1381, in _toStringSubclass
self.endData(subclass)
File "/Users/../anaconda/lib/python2.7/site-packages/BeautifulSoup.py", line 1251, in endData
(not self.parseOnlyThese.text or \
AttributeError: 'str' object has no attribute 'text'
If I remove 'html.parser' and use
soup = BeautifulSoup(page.content)
the code works, but, of course, it does not give me what I need.
Any clues as to how to solve this? I am in a OSX El Capitan, and use spyder as editor. I did re-installed bs4 a few times.
Thanks
You are using an old version of BeautifulSoup. Please uninstall it, and then install BeautifulSoup4, with pip install BeautifulSoup4; and then adjust your code thus:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168')
s = BeautifulSoup(r.content, 'html.parser')
Try this snippet:
soup = BeautifulSoup(page.content, 'html.parser')
# in place of page.content use page.text
soup = BeautifulSoup(page.text, 'html.parser')
Hi my code won't work when actually running online, it returns None when i use Find how can i fix this?
This is my code;
import time
import sys
import urllib
import re
from bs4 import BeautifulSoup, NavigableString
print "Initializing Python Script"
print "The passed arguments are "
urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]
i =0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
word = "tweakers"
alternate = "alternate"
while i<len(urls):
dataraw = urllib.urlopen(urls[i])
data = dataraw.read()
soup = BeautifulSoup(data)
table = soup.find("table", {"class" : "spec-detail"})
print table
i+=1
Here is the outcome:
Initializing Python Script
The passed arguments are
None
None
None
None
Script finalized
i have tried using findAll and other methods.. But i don't seem to understand why it is working on my Command line but not on the server itself...
Any help?
Edit
Traceback (most recent call last):
File "python_script.py", line 35, in
soup = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
I'm suspecting you are experiencing the differences between parsers.
Specifying the parser explicitly works for me:
import urllib2
from bs4 import BeautifulSoup
urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",
"http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",
"https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798",
"http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]
for url in urls:
soup = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
table = soup.find("table", {"class": "spec-detail"})
print table
In this case, I'm using html.parser, but you can play around and specify lxml or html5lib, for example.
Note that the third url doesn't contain a table with class="spec-detail" and, therefore, it prints None for it.
I've also introduced few improvements:
removed unused imports
replaced a while loop with indexing with a nice for loop
removed unrelated code
replaced urllib with urllib2
You can also use requests module and set appropriate User-Agent header pretending to be a real browser:
from bs4 import BeautifulSoup
import requests
urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",
"http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",
"https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798",
"http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36'}
for url in urls:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find("table", {"class": "spec-detail"})
print table
Hope that helps.
i want to create a script that returns all the urls found in a page a google for example , so i create this script : (using BeautifulSoup)
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("https://www.google.dz/search?q=see")
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print link["href"]
and it return this 403 forbidden result :
Traceback (most recent call last):
File "C:\Python27\sql\sql.py", line 3, in <module>
page = urllib2.urlopen("https://www.google.dz/search?q=see")
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 438, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 372, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
any idea to avoid this error or another methode to get the urls from of the search result ?
No problem using requests
import requests
from BeautifulSoup import BeautifulSoup
page = requests.get("https://www.google.dz/search?q=see")
soup = BeautifulSoup(page.content)
links = soup.findAll("a")
Some of the links have links are like search%:http:// where the end of one joins another so we need to split then using re
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.google.dz/search?q=see")
soup = BeautifulSoup(page.content)
import re
links = soup.findAll("a")
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
print re.split(":(?=http)",link["href"].replace("/url?q=",""))
['https://www.see.asso.fr/&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CBIQFjAA&usg=AFQjCNF2_I8jB98JwR3jcKniLZekSrRO7Q']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:f7M8NX1XmDsJ', 'https://www.see.asso.fr/%252Bsee%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CBUQIDAA&usg=AFQjCNF8WJButjMNXQXvXBbtyXnF1SgiOg']
['https://www.see.asso.fr/3ei&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CBgQ0gIoADAA&usg=AFQjCNGnPL1RiX5TekI_yMUc-w_f2oVXtw']
['https://www.see.asso.fr/node/9587&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CBkQ0gIoATAA&usg=AFQjCNHX-6AzBgLQUF0s8TxFcZjIhxz_Hw']
['https://www.see.asso.fr/ree&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CBoQ0gIoAjAA&usg=AFQjCNGkkd8e1JjiNrhSM4HQYE-M6g6j-w']
['https://www.see.asso.fr/node/130&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CBsQ0gIoAzAA&usg=AFQjCNEkVdpcbXDz5-cV9u2NNYoV6aM8VA']
['http://www.wordreference.com/enfr/see&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CB0QFjAB&usg=AFQjCNHQGwcsGpro26dhxFP6q-fQvwbB0Q']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:ooK-I_HuCkwJ', 'http://www.wordreference.com/enfr/see%252Bsee%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CCAQIDAB&usg=AFQjCNFRlV5Zv_n48Wivr4LeOkTQsA0D1Q']
['http://fr.wikipedia.org/wiki/S%25C3%25A9e&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CCMQFjAC&usg=AFQjCNGmtqmcXPqYZ_nwa0RWL0uYf5PMJw']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:GjcgkyzsUigJ', 'http://fr.wikipedia.org/wiki/S%2525C3%2525A9e%252Bsee%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CCYQIDAC&usg=AFQjCNHesOIBU3OXBspARcONbK_k_8-gnw']
['http://fr.wikipedia.org/wiki/Camille_S%25C3%25A9e&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CCkQFjAD&usg=AFQjCNGO-WIDl4TrBeo88WY9QsopWmsMyQ']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:izhQjC85nOoJ', 'http://fr.wikipedia.org/wiki/Camille_S%2525C3%2525A9e%252Bsee%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CCwQIDAD&usg=AFQjCNEfcIKsKbf026xgWT7NkrAueZvL0A']
['http://de.wikipedia.org/wiki/Zugersee&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CDEQ9QEwBA&usg=AFQjCNHpfJW5-XdsgpFUSP-jEmHjXQUWHQ']
['http://commons.wikimedia.org/wiki/File:Champex_See.jpg&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CDMQ9QEwBQ&usg=AFQjCNEordFWr2QIaob45WlR5Yi-ZvZSiA']
['http://www.all-free-photos.com/show/showphotop.php%3Fidtop%3D4%26lang%3Dfr&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CDUQ9QEwBg&usg=AFQjCNEC24FOIE5cvF4zmEDgq5-5xubM3w']
['http://www.allbestwallpapers.com/travel-zell_am_see,_kaprun,_austria_wallpapers.html&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CDcQ9QEwBw&usg=AFQjCNFkzMZDuthZHvnF-JvyksNUqjt1dQ']
['http://www.see-swe.org/&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CDkQFjAI&usg=AFQjCNF1zbcLfjanxgCXtHoOQXOdMgh_AQ']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:lzh6JxvKUTIJ', 'http://www.see-swe.org/%252Bsee%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CDwQIDAI&usg=AFQjCNFYN6tzzVaHsAc5aOvYNql3Zy4m3A']
['http://fr.wiktionary.org/wiki/see&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CD8QFjAJ&usg=AFQjCNFWYIGc1gj0prytowzqI-0LDFRvZA']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:G9v8lXWRCyQJ', 'http://fr.wiktionary.org/wiki/see%252Bsee%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CEIQIDAJ&usg=AFQjCNENzi4E1n-9qHYsNahY6lQzaW5Xvg']
['http://en.wiktionary.org/wiki/see&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CEUQFjAK&usg=AFQjCNECGZjw-rBUALO43WaTh2yB9BUhDg']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:ywc4URuPdIQJ', 'http://en.wiktionary.org/wiki/see%252Bsee%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CEgQIDAK&usg=AFQjCNE0pykIqXXRl08E-uTtoj03QEpnbg']
['http://see-concept.com/&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CEsQFjAL&usg=AFQjCNGFWjhiH7dEBhITJt01ob_JENlz1Q']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:jHTkOVEoRsAJ', 'http://see-concept.com/%252Bsee%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CE4QIDAL&usg=AFQjCNECPgxt9ZSFmZzK_ker9Hw_FoCi_A']
['http://www.theconjugator.com/la/conjugaison/du/verbe/see.html&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CFEQFjAM&usg=AFQjCNETCTQ0vPDIdV_2Q57qq11dyN0d8Q']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:xD7_Qo7roS8J', 'http://www.theconjugator.com/la/conjugaison/du/verbe/see.html%252Bsee%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CFQQIDAM&usg=AFQjCNF_hBCyDZncivYGnL7je5kYme9hEg']
['http://www.zellamsee-kaprun.com/fr&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CFcQFjAN&usg=AFQjCNFVDeBWrZMDSjK9jKYF4AQlIXa9lA']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:BFBEUp05w7YJ', 'http://www.zellamsee-kaprun.com/fr%252Bsee%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ei=ryv6U6PvEKzA7AaB4ICwCA&ved=0CFoQIDAN&usg=AFQjCNHtrOeEpYWqvT3f0M1p-gxUkYT1IA']
The best way to do this is to use the google API (pip install google)
GeeksforGeeks writes about it here
from googlesearch import search
# to search
query = "see"
links = []
for j in search(query, tld="co.in", num=10, stop=10, pause=2):
links.append(j)
import urllib.request
from BeautifulSoup import BeautifulSoup
page = urllib.request.urlopen("https://www.google.dz/search?q=see")
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print link["href"]
I am using Python 2.7 + BeautifulSoup 4.3.2.
I am trying to use Python and BeautifulSoup to pick up information on a webpage. Because the webpage is in the company website and requires login and redirection, I copied the target page's source code page into a file and saved it as “example.html” in C:\ for the convenience of practicing.
This is a part of the original code:
<tr class="ghj">
<td><span class="city-sh"><sh src="./citys/1.jpg" alt="boy" title="boy" /></span>port_new_cape</td>
<td class="position">452</td>
<td class="details"><div>South</div></td>
<td>May 09, 1997</td>
<td>Jan 23, 2009 12:05 pm </td>
</tr>
The code I worked out so far is:
from bs4 import BeautifulSoup
import re
import urllib2
url = "C:\example.html"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
cities = soup.find_all('span', {'class' : 'city-sh'})
for city in cities:
print city
This is just the first stage of testing, so it's somewhat incomplete.
However, when I run it, it gives an error message. Seems it’s improper to use urllib2.urlopen to open a local file.
Traceback (most recent call last):
File "C:\Python27\Testing.py", line 8, in <module>
page = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 404, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 427, in _open
'unknown_open', req)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1247, in unknown_open
raise URLError('unknown url type: %s' % type)
URLError: <urlopen error unknown url type: c>
How can I practice using a local file?
The best way to open a local file with BeautifulSoup is to pass it a file handler directly. http://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
from bs4 import BeautifulSoup
with open("C:\\example.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
for city in soup.find_all('span', {'class' : 'city-sh'}):
print(city)
With Chandan's help, the problem has been solved. All the credits shall go to him. :)
the "urllib2.url" is useless here.
from bs4 import BeautifulSoup
import re
# import urllib2
url = "C:\example.html"
page = open(url)
soup = BeautifulSoup(page.read())
cities = soup.find_all('span', {'class' : 'city-sh'})
for city in cities:
print city
You can try using lxml parser also. Here is an example for your html data.
from lxml.html import fromstring
import lxml.html as PARSER
data = open('example.html').read()
root = PARSER.fromstring(data)
for ele in root.getiterator():
if ele.tag == "td":
print ele.text_content()
o/p:
port_new_cape
452
South
May 09, 1997
Jan 23, 2009 12:05 pm