I'm using Python3, BeautifulSoup4
When I run code below, it gives just url "www.google.com" not XML.
I couldn't find it What is wrong.
from bs4 import BeautifulSoup
import urllib
html = "www.google.com"
soup = BeautifulSoup(html)
print (soup.prettify())
You need to use urllib2 or a similar library to fetch the HTML
import urllib2
html = urllib2.urlopen("www.google.com")
soup = BeautifulSoup(html)
print (soup.prettify())
EDIT: Just as a side note to clarify why I suggested urllib2. If you read the urllib documentation, you'll find "The urlopen() function has been removed in Python 3 in favor of urllib2.urlopen()." Given that you have tagged Python3, urllib2 would probably be your best option.
Related
I've been trying to get this to work, but keep getting the same TypeError object has no len(). The BeautifulSoup documentation hasn't been any help. This seems to work on every tutorial I watch, and read, but not for me. What am I doing wrong?
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt0366627/?ref_=nv_sr_1")
print(http)
This returns Response [200], but if I try to add soup... I get the len error:
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt0366627/?ref_=nv_sr_1")
soup = BeautifulSoup(http, 'lxml')
print(soup)
As the docs say:
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
A Response object is neither a string nor an open filehandle.
The simplest way to get one of the two, as shown in the first example in the requests docs, is the .text attribute. So:
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
soup = BeautifulSoup(http.text, 'lxml')
For other options see Response Content—e.g., you can get the bytes with .content to let BeautifulSoup guess at the encoding instead of reading it from the headers, or get the socket (which is an open filehandle) with .raw.
My final code. It just prints out the title, year and summary, which was all I wanted. Thank all of you for your help.
import requests
import lxml
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt0366627/?ref_=nv_sr_1")
soup = BeautifulSoup(http.content, 'lxml')
title = soup.find("div", class_="title_wrapper").find()
summary = soup.find(class_="summary_text")
print(title.text)
print(summary.text)
The Response-200 that you getting from the following code:
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
print(http)
shows that the your request is succeeded and returned a response. In-order to parse the HTML code there are two ways:
Direct print the text/String format
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
print(http.text)
Use a HTML parser
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
soup = BeautifulSoup(http.text, 'lxml')
print(soup)
it is better to use BeautifulSoup as using this will allow you to extract the required data from HTML, in-case you need it
Hi I'm trying to parse html from this website
However it takes for ever for the soup to load the whole html (about 17 seconds to print to terminal), I do realize this is only because of the website itself (as other directories seem to load instantly), but here is my code just in case:
import urllib2
from bs4 import BeautifulSoup
url1 = 'http://www.ukpets.co.uk/ukp/?sf=1716769780&rtn=temp87_224_76_126_at_1456&display_profile=§ion=Commercial&sub=Search_&rws=&method=search&tb=comdir1_8&class=comdir1_8&search_form=on&rf=coname&st=Food'
soup = BeautifulSoup(urllib2.urlopen(url1), 'lxml')
print soup
So my question, is there any other parser that could get this job done faster or can i use something along with bs
P.S. also tried selenium
I dunno what the problem is for you but this series of statements executed in the blink of an eye on my old computer. You could try doing this.
>>> from bs4 import BeautifulSoup
>>> from urllib.request import urlopen
>>> URL = 'http://www.ukpets.co.uk/ukp/?sf=1716769780&rtn=temp87_224_76_126_at_1456&display_profile=§ion=Commercial&sub=Search_&rws=&method=search&tb=comdir1_8&class=comdir1_8&search_form=on&rf=coname&st=Food'
>>> HTML = urlopen ( URL )
>>> soup = BeautifulSoup ( HTML )
C:\Python34\lib\site-packages\bs4\__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
I would like to retrieve data given in a SDMX file (like https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx&mode=its). I tried to use BeautifulSoup, but it seems, it does not see the tags. In the following the code
import urllib2
from bs4 import BeautifulSoup
url = "https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx"
html_source = urllib2.urlopen(url).read()
soup = BeautifulSoup(html_source, 'lxml')
ts_series = soup.findAll("bbk:Series")
which gives me an empty object.
Is BS4 the wrong tool, or (more likely) what am I doing wrong?
Thanks in advance
soup.findAll("bbk:series") would return the result.
In fact, in this case, even you use lxml as the parser, BeautifulSoup still parse it as html, since html tags are case insensetive, BeautifulSoup downcases all the tags, thus soup.findAll("bbk:series") works. See Other parser problems from the official doc.
If you want to parse it as xml, use soup = BeautifulSoup(html_source, 'xml') instead. It also uses lxml since lxml is the only xml parser BeautifulSoup has. Now you can use ts_series = soup.findAll("Series") to get the result as beautifulSoup will strip the namespace part bbk.
I am trying to get all the urls on a website using python. At the moment I am just copying the websites html into the python program and then using code to extract all the urls. Is there a way I could do this straight from the web without having to copy the entire html?
In Python 2, you can use urllib2.urlopen:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
In Python 3, you can use urllib.request.urlopen:
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you have to perform more complicated tasks like authentication or passing parameters I suggest to have a look at the requests library.
The most straightforward would probably be urllib.urlopen if you're using python2, or urllib.request.urlopen if you're using python3 (you have to do import urllib or import urllib.request first of course). That way you get an file like object from which you can read (ie f.read()) the html document.
Example for python 2:
import urllib
f = urlopen("http://stackoverflow.com")
http_document = f.read()
f.close()
The good news is that you seem to have done the hard part which is analyzing the html document for links.
You might want to use the bs4(BeautifulSoup) library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
You can download bs4 with the followig command at the cmd line. pip install BeautifulSoup4
import urllib2
import urlparse
from bs4 import BeautifulSoup
url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
print urlparse.urljoin(url, link['href'])
You can simply use the combination of requests and BeautifulSoup.
First make an HTTP request using requests to get the HTML content. You will get it as a Python string, which you can manipulate as you like.
Take the HTML content string and supply it into the BeautifulSoup, which has done all the job to extract the DOM, and get all URLs, i.e. <a> elements.
Here is an example of how to fetch all links from StackOverflow:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://stackoverflow.com')
html_str = response.text
bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a'))
for a_element in bs:
if a_element.has_attr('href'):
print(a_element['href'])
Sample output:
/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...
I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.
Basically:
example.com/events/
Event 1
Event 2
example.com/events/1
...some detail stuff I need
example.com/events/2
...some detail stuff I need
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
print anchor['href']
It will give you the list of urls. Now You can iterate over those urls and parse the data.
inner_div = soup.findAll("div", {"id": "y-shade"})
This is an example. You can go through the BeautifulSoup tutorials.
For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..
$ easy_install beautifulsoup4
$ pip install beautifulsoup4
To use in Python...
import bs4 as BeautifulSoup
Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com
Edit:
Recent discovery: Using BeautifulSoup through lxml with
from lxml.html.soupparser import fromstring
is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.
dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]
FULL PYTHON 3 EXAMPLE
Packages
# urllib (comes with standard python distribution)
# pip3 install beautifulsoup4
Example:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen('https://www.wikipedia.org/') as f:
data = f.read().decode('utf-8')
d = BeautifulSoup(data)
d.title.string
The above should print out 'Wikipedia'