Read HEAD contents from HTML - python

i need small script in python. Need to read custom block in a web file.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib2
req = urllib2.Request('http://target.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # Here is all page source with html tags, but
# i need read only section from <head> to </head>
# example the http://target.com source is:
# <html>
# <body>
# <head>
# ... need to read this section ...
# </head>
# ... page source ...
# </body>
# </html>
How read the custom section?

To parse HTML, we use a parser, such as BeautifulSoup.
Of course you can parse it using a regular expression, but that is something you should never do. Just because it works for some cases doesn't mean it is the standard way of doing it or is the proper way of doing it. If you are interested in knowing why, read this excellent answer here on SO.
Start with the BeautifulSoup tutorial and see how to parse the required information. It is pretty easy to do it. We are not going to do it for you, that is for you to read and learn!
Just to give you a heads up, you have the_page which contains the HTML data.
>> from BeautifulSoup import BeautifulSoup
>> soup = BeautifulSoup(the_page)
Now follow the tutorial and see how to get everything within the head tag.

from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen('http://www.example.com')
soup = BeautifulSoup(page.read())
print soup.find('head')
outputs
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example Web Page</title>
</head>

One solution would be to use the awesome python library Beautiful Soup. It allows you do parse the html/xml pretty easily, and will try to help out when the documents are broken or invalid.

Related

Why python requests module not pulling the whole html?

The link: https://www.hyatt.com/explore-hotels/service/hotels
code:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())
Tried also this:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.dumps(r.text)
print(data)
output:
<!DOCTYPE html>
<head>
</head>
<body>
<script src="SOME_value">
</script>
</body>
</html>
Its printing the html without the tag the data are in, only showing a single script tag.
How to access the data (shown in browsing view, looks like json)?browsing view my code code response)
I don't believe this can be done...That data simply isn't in the r.text
If you do this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.hyatt.com/explore-hotels/service/hotels")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
You get this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?tkrm_alpekz_s1.3=0EOFte3LjRKv3iJhEEV2hrnisE5M3Lwy3ac3UPZ19zdiB49A6ZtBjtiwBqgKQN3q2MEQ3NbFjTWfmP9GqArOIAML6zTvSb4lRHD7FsmJFVWNkSwuTNWUNuJWv6hEXBG37DhBtTXFEO50999RihfPbTjsB">
</script>
</body>
</html>
As you can see there is no <pre> tag for whatever reason. So you're unable to access that.
I also get an 429 Error when accessing the URL:
GET https://www.hyatt.com/explore-hotels/service/hotels 429
What is the end goal here? Because this site doesn't seem to be willing to do anything. Some sites are unable to be parsed, for various reasons. If you're wanting to play with JSON data I would look into using an API instead.
If you google https://www.hyatt.com and manually go to the URL you mentioned you get a 404 error.
I would say Hyatt don't want you parsing their site. So don't!
The response is JSON, not HTML. You can verify this by opening the Network tab in your browser's dev tools. There you will see that the content-type header is application/json; charset=utf-8.
You can parse this into a useable form with the standard json package:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.loads(r.text)
print(data)

Scraping Amazon deals page not returning html code - python

I am currently trying to scrape this Amazon page "https://www.amazon.com/b/?ie=UTF8&node=11552285011&ref_=sv_kstore_5" with the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.amazon.com/b/?ie=UTF8&node=11552285011&ref_=sv_kstore_5'
r = requests.get(url)
soup = BeautifulSoup(r.content)
print(soup.prettify)
However when I run it instead of getting the simple html source code I get a bunch of lines which don't make much sense to me starting like this:
<bound method Tag.prettify of <!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><!-- emit CSM JS -->
<style>
[class*=scx-line-clamp-]{overflow:hidden}.scx-offscreen-truncate{position:relative;left:-1000000px}.scx-line-clamp-1{max-height:16.75px}.scx-truncate-medium.scx-line-clamp-1{max-height:20.34px}.scx-truncate-small.scx-line-clamp-1{max-height:13px}.scx-line-clamp-2{max-height:35.5px}.scx-truncate-medium.scx-line-clamp-2{max-height:41.67px}.scx-truncate-small.scx-line-clamp-2{max-height:28px}.scx-line-clamp-3{max-height:54.25px}.scx-truncate-medium.scx-line-clamp-3{max-height:63.01px}.scx-truncate-small.scx-line-clamp-3{max-height:43px}.scx-line-clamp-4{max-height:73px}.scx-truncate-medium.scx-line-clamp-4{max-height:84.34px}.scx-truncate-small.scx-line-clamp-4{max-height:58px}.scx-line-clamp-5{max-height:91.75px}.scx-truncate-medium.scx-line-clamp-5{max-height:105.68px}.scx-truncate-small.scx-line-clamp-5{max-height:73px}.scx-line-clamp-6{max-height:110.5px}.scx-truncate-medium.scx-line-clamp-6{max-height:127.01
And even when I scroll down, there is nothing that really resemble a structured html code with all the info I need. What am I doing wrong ? (I am a beginner so it could be anything really). Thank you very much!
print(soup.prettify)
intend to call soup.prettify.__repr__(). The output is
<bound method Tag.prettify of <!DOCTYPE html><html class="a-no-js" data-19ax5a9jf="dingo"><head>...
while you need to call the prettify method:
print(soup.prettify())
The output:
<html class="a-no-js" data-19ax5a9jf="dingo">
<head>
<script>
var aPageStart = (new Date()).getTime();
</script>
<meta charset="utf-8"/>
<!-- emit CSM JS -->
<style>
...

Parsing MS specific html tags in BeautifulSoup

When trying to parse an email sent using MS Outlook, I want to be able to strip the annoying Microsoft XML tags that it has added. One such example is the o:p tag. When trying to use Python's BeautifulSoup to parse an email as HTML, it can't seem to find these specialty tags.
For example:
from bs4 import BeautifulSoup
textToParse = """
<html>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "html5lib")
body = soup.find('body')
for otag in body.find_all('o'):
print(otag)
for otag in body.find_all('o:p'):
print(otag)
This will output no text to the console, but if I switched the find_all call to search for p then it would output the p node as expected.
How come these custom tags do not seem to work?
It's a namespace issue. Apparently, BeautifulSoup does not consider custom namespaces valid when parsed with "html5lib".
You can work around this with a regular expression, which – strangely – does work correctly!
print (soup.find_all(re.compile('o:p')))
>>> [<o:p>This should go</o:p>]
but the "proper" solution is to change the parser to "lxml-xml" and introducing o: as a valid namespace.
from bs4 import BeautifulSoup
textToParse = """
<html xmlns:o='dummy_url'>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "lxml-xml")
body = soup.find('body')
print ('this should find nothing')
for otag in body.find_all('o'):
print(otag)
print ('this should find o:p')
for otag in body.find_all('o:p'):
print(otag)
>>>
this should find nothing
this should find o:p
<o:p>This should go</o:p>

Beautiful Soup, Python and the swedish characters ÅÄÖ

I'm using BeautifulSoup to scrape a Swedish web page. On the web page, the information I want to extract looks like this:
"Öhman Företagsobligationsfond"
When I print the information from the Python script it looks like this:
"Öhman Företagsobligationsfond"
I'm new to Python and I have searched for answers and tried using # -- coding: utf-8 -- in the beginning of the code but it does not work.
I'm thinking of moving from Sweden to solve this issue.
When using # -- coding: utf-8 -- you only specify the encoding of the source code document. The page that you are parsing has probably declared a faulty encoding (or none at all), and therefore Beautiful Soup fails. Try to specify the encoding when building the soup. Here's a small example:
markup = '''
<html>
<head>
<title>Övriga fakta</title>
<meta charset="latin-1" />
</head>
<body>
<h1>Öhman Företagsobligationsfond</h1>
<p>Detta är en svensk sida.</p>
</body>
</html>
'''
soup = BeautifulSoup(markup)
print soup.find('h1')
try:
# Version 4
soup = BeautifulSoup(markup, from_encoding='utf-8')
except TypeError:
# Version 3
soup = BeautifulSoup(markup, fromEncoding='utf-8')
print soup.find('h1')
The output from this is:
<h1>Ãhman Företagsobligationsfond</h1>
<h1>Öhman Företagsobligationsfond</h1>
In Beautiful Soup 4, the parameter is from_encoding, while in version 3, the parameter is fromEncoding.

Issues with BeautifulSoup parsing

I am trying to parse an html page with BeautifulSoup, but it appears that BeautifulSoup doesn't like the html or that page at all. When I run the code below, the method prettify() returns me only the script block of the page (see below). Does anybody has an idea why it happens?
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1"
html = "".join(urllib2.urlopen(url).readlines())
print "-- HTML ------------------------------------------"
print html
print "-- BeautifulSoup ---------------------------------"
print BeautifulSoup(html).prettify()
The is the output produced by BeautifulSoup.
-- BeautifulSoup ---------------------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script language="JavaScript">
<!--
function highlight(img) {
document[img].src = "/marketing/sony/images/en/" + img + "_on.gif";
}
function unhighlight(img) {
document[img].src = "/marketing/sony/images/en/" + img + "_off.gif";
}
//-->
</script>
Thanks!
UPDATE: I am using the following version, which appears to be the latest.
__author__ = "Leonard Richardson (leonardr#segfault.org)"
__version__ = "3.1.0.1"
__copyright__ = "Copyright (c) 2004-2009 Leonard Richardson"
__license__ = "New-style BSD"
Try with version 3.0.7a as Łukasz suggested. BeautifulSoup 3.1 was designed to be compatible with Python 3.0 so they had to change the parser from SGMLParser to HTMLParser which seems more vulnerable to bad HTML.
From the changelog for BeautifulSoup 3.1:
"Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't"
Try lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup, so it might work better for you. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
BeautifulSoup isn't magic: if the incoming HTML is too horrible then it isn't going to work.
In this case, the incoming HTML is exactly that: too broken for BeautifulSoup to figure out what to do. For instance it contains markup like:
SCRIPT type=""javascript""
(Notice the double quoting.)
The BeautifulSoup docs contains a section what you can do if BeautifulSoup can't parse you markup. You'll need to investigate those alternatives.
Samj: If I get things like
HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>"
I just remove the culprit from markup before I serve it to BeautifulSoup and all is dandy:
html = urllib2.urlopen(url).read()
html = html.replace("</scr' + 'ipt>","")
soup = BeautifulSoup(html)
I had problems parsing the following code too:
<script>
function show_ads() {
document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'></scr"+"ipt></div>");
}
</script>
HTMLParseError: bad end tag: u'', at line 26, column 127
Sam
I tested this script on BeautifulSoup version '3.0.7a' and it returns what appears to be correct output. I don't know what changed between '3.0.7a' and '3.1.0.1' but give it a try.
import urllib
from BeautifulSoup import BeautifulSoup
>>> page = urllib.urlopen('http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1')
>>> soup = BeautifulSoup(page)
>>> soup.prettify()
In my case by executing the above statements, it returns the entire HTML page.

Categories