Python BeautifulSoup double question mark in xml definition - python

I thought it must be a bug so i issued a bug report here.
On the other hand i might be missing something so i need another look on the code.
The problem is, when i initialize BeautifulSoup with contents of an .xhtml file, xml definition gets two question marks at the end of it.
Can you reproduce the problem? Is there a way to avoid it? Am i missing a function, a method, an argument or something?
Edit0: It's BeautifulSoup 4 on python 2.x.
Edit1: Why downvote?
The problem:
<?xml version="1.0" encoding="UTF-8"??>
Terminal Output:
>>> from bs4 import BeautifulSoup as bs
>>> with open('example.xhtml', 'r') as f:
... txt = f.read()
... soup = bs(txt)
...
>>> print txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>
>>> print soup
<?xml version="1.0" encoding="UTF-8"??>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>

This is a bug. I've committed a fix which will be in the next release of Beautiful Soup.
The root cause:
The HTMLParser class uses the SGML syntactic rules for processing instructions. An XHTML processing instruction using the trailing '?' will cause the '?' to be included in data.
In general, you'll have better results parsing XHTML with the "xml" parser, as ThiefMaster suggested.

Consider using an XML parser:
soup = bs(txt, 'xml')

Using your contents of variable in 'txt' in an example.xhtml file I can't reproduce the issue with Python2.7 and the corrosponding BeautifulSoup module (not bs4). Works fine and dandy for me.
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> print soup
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
</head>
<body>
</body>
</html>
What is the issue you are encountering with it, as in what is your end aim, maybe then somoone can suggest a workaround

Related

How do I distinguish between XML and HTML programmatically in Python?

I am sending an http request and get an http response, but I'd like to be able to extract the body of the response and know whether it contains XML or HTML.
Ideally, this method should work even if the content type isn't clear in the response (ie. it should work for websites where content type isn't necessarily specified).
Currently, I'm using lxml to parse the html/xml, but don't know at parse time whether I'm dealing with HTML or XML.
You can check content-type header to know which type of response you got:
import requests
respond = requests.get(URL)
file_type = respond.headers['content-type']
print(file_type)
>>>'text/html; charset=utf-8'
You can also do
print(file_type.split(';')[0].split('/')[1])
to get "html" or "xml" as output
I don't understand why you would like to do it, and I'm sure there is a better way to do it. But...
The difference beween xml and html is the declaration, HTML must start with <!DOCTYPE HTML>, and XML with <?xml version="1.0>
Example of XML
<?xml version="1.0">
<address>
<name> Krishna Rungta</name>
<contact>9898613050</contact>
<email>krishnaguru99#gmail.com </email>
<birthdate>1985-09-27</birthdate>
</address>
Example of HTML
<!DOCTYPE html>
<html>
<head>
<title> Page title </title> </head>
<body>
<hl> First Heading</hl> <p> First paragraph.</p> </body>
</html>
If I were you, I would use BeautifulSoup to select DOCTYPE, and if you can't find/select it means it is XML. You can see how to do that here.
If this doesn't answer your question try reading this or try using this library

Scraping Amazon deals page not returning html code - python

I am currently trying to scrape this Amazon page "https://www.amazon.com/b/?ie=UTF8&node=11552285011&ref_=sv_kstore_5" with the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.amazon.com/b/?ie=UTF8&node=11552285011&ref_=sv_kstore_5'
r = requests.get(url)
soup = BeautifulSoup(r.content)
print(soup.prettify)
However when I run it instead of getting the simple html source code I get a bunch of lines which don't make much sense to me starting like this:
<bound method Tag.prettify of <!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/><!-- emit CSM JS -->
<style>
[class*=scx-line-clamp-]{overflow:hidden}.scx-offscreen-truncate{position:relative;left:-1000000px}.scx-line-clamp-1{max-height:16.75px}.scx-truncate-medium.scx-line-clamp-1{max-height:20.34px}.scx-truncate-small.scx-line-clamp-1{max-height:13px}.scx-line-clamp-2{max-height:35.5px}.scx-truncate-medium.scx-line-clamp-2{max-height:41.67px}.scx-truncate-small.scx-line-clamp-2{max-height:28px}.scx-line-clamp-3{max-height:54.25px}.scx-truncate-medium.scx-line-clamp-3{max-height:63.01px}.scx-truncate-small.scx-line-clamp-3{max-height:43px}.scx-line-clamp-4{max-height:73px}.scx-truncate-medium.scx-line-clamp-4{max-height:84.34px}.scx-truncate-small.scx-line-clamp-4{max-height:58px}.scx-line-clamp-5{max-height:91.75px}.scx-truncate-medium.scx-line-clamp-5{max-height:105.68px}.scx-truncate-small.scx-line-clamp-5{max-height:73px}.scx-line-clamp-6{max-height:110.5px}.scx-truncate-medium.scx-line-clamp-6{max-height:127.01
And even when I scroll down, there is nothing that really resemble a structured html code with all the info I need. What am I doing wrong ? (I am a beginner so it could be anything really). Thank you very much!
print(soup.prettify)
intend to call soup.prettify.__repr__(). The output is
<bound method Tag.prettify of <!DOCTYPE html><html class="a-no-js" data-19ax5a9jf="dingo"><head>...
while you need to call the prettify method:
print(soup.prettify())
The output:
<html class="a-no-js" data-19ax5a9jf="dingo">
<head>
<script>
var aPageStart = (new Date()).getTime();
</script>
<meta charset="utf-8"/>
<!-- emit CSM JS -->
<style>
...

Parse XHTML5 with undefined entities

Please consider this:
import xml.etree.ElementTree as ET
xhtml = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
parser = ET.XMLParser()
parser.entity['nbsp'] = ' '
tree = ET.fromstring(xhtml, parser=parser)
print(ET.tostring(tree, method='xml'))
which renders nice text representation of xhtml string.
But, for same XHTML document with HTML5 doctype:
xhtml = '''<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
I get Exception:
xml.etree.ElementTree.ParseError: undefined entity: line 5, column 19
so the parser can't handle it, although I added nbsp to entities dict.
Same happens if I use lxml:
from lxml import etree
parser = etree.XMLParser(resolve_entities=False)
tree = etree.fromstring(xhtml, parser=parser)
print etree.tostring(tree, method='xml')
raises:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 5, column 26
although I've set the parser to ignore entities.
Why is this, and how to make parsing of XHTML files with HTML5 doctype declaration possible?
Partial solution for lxml is to use recoverer:
parser = etree.XMLParser(resolve_entities=False, recover=True)
but I'm still waiting for better one.
The problem here is, the Expat parser used behind the scenes won't usually report unknown entities - it will rather throw an error, so the fallback code in xml.etree.ElementTree you were trying to trigger won't even run. You can use the UseForeignDTD method to change this behavior, it will make Expat ignore the doctype declaration and pass all entity declarations to xml.etree.ElementTree. The following code works correctly:
import xml.etree.ElementTree as ET
xhtml = '''<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
parser = ET.XMLParser()
parser._parser.UseForeignDTD(True)
parser.entity['nbsp'] = u'\u00A0'
tree = ET.fromstring(xhtml, parser=parser)
print(ET.tostring(tree, method='xml'))
The side-effect of this approach: as I said, the doctype declaration is completely ignored. This means that you have to declare all entities, even the ones supposedly covered by the doctype.
Note that the values you put into ElementTree.XMLParser.entity dictionary have to be regular strings, text that the entity will be replaced by - you can no longer refer to other entities there. So it should be u'\u00A0' for .

HTML encoding and lxml parsing

I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered:
1.
<!DOCTYPE html>
<html lang='en'>
<head>
<title>Unicode Chars: 은 —’</title>
<meta charset='utf-8'>
</head>
<body></body>
</html>
2.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
<title>Unicode Chars: 은 —’</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>
3.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>
My basic script:
from lxml.html import fromstring
...
doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title
The results are:
Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’
So, obviously an issue with sample 1 and the missing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag. The solution from here will correctly recognize sample 1 as utf-8 and so it is functionally equivalent to my original code.
The lxml docs appear conflicted:
From here the example seems to suggest we should use UnicodeDammit to encode the markup as unicode.
from BeautifulSoup import UnicodeDammit
def decode_html(html_string):
converted = UnicodeDammit(html_string, isHTML=True)
if not converted.unicode:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.triedEncodings))
# print converted.originalEncoding
return converted.unicode
root = lxml.html.fromstring(decode_html(tag_soup))
However here it says:
[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.
If I try to follow the the first suggestion in the lxml docs, my code is now:
from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title
I now get the following results:
Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.
Sample 1 now works correctly but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag.
Is there a correct way to handle all of these cases? Is there a better solution than the following?
dammit = UnicodeDammit(raw_html)
try:
doc = fromstring(dammit.unicode_markup)
except ValueError:
doc = fromstring(raw_html)
lxml has several issues related to handling Unicode. It might be best to use bytes (for now) while specifying the character encoding explicitly:
#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit
for filename in glob.glob('*.html'):
with open(filename, 'rb') as file:
content = file.read()
doc = UnicodeDammit(content, is_html=True)
parser = html.HTMLParser(encoding=doc.original_encoding)
root = html.document_fromstring(content, parser=parser)
title = root.find('.//title').text_content()
print(title)
Output
Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’
The problem probably stems from the fact that <meta charset> is a relatively new standard (HTML5 if I'm not mistaken, or it wasn't really used before it.)
Until such a time when the lxml.html library is updated to reflect it, you will need to handle that case specially.
If you only care about ISO-8859-* and UTF-8, and can afford to throw away non-ASCII compatible encodings (such as UTF-16 or the East Asian traditional charsets), you can do a regular expression substitution on the byte string, replacing the newer <meta charset> with the older http-equiv format.
Otherwise, if you need a proper solution, your best bet is to patch the library yourself (and contributing the fix while you're at it.) You might want to ask the lxml developers if they have any half-baked code laying around for this particular bug, or if they are tracking the bug on their bug tracking system in the first place.

Only Firefox displays HTML Code and not the page

I have this complicated problem that I can't find a answer to.
I have a Python HTTPServer running that serves webpages. These webpages are created at runtime with help of Beautiful Soup. Problem is that the Firefox shows HTML Code for the webpage and not the actual page? I really don't know know who is causing this problem -
- Python HTTPServer
- Beautiful Soup
- HTML Code
Any case, I have copied parts of the webpage HTML:-
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>
My title
</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<script src="./123_ui.js">
</script>
</head>
<body>
<div>
Hellos
</div>
</body>
</html>
Just to help you, here are the things that I have already tried-
- I have made sure that Python HTTPServer is sending the MIME header as text/html
- Just copying and pasting the HTML Code will show you correct page as its static. I can tell from here that the problem is in HTTPServer side
- The Firebug shows that is empty and "This element has no style rules. You can create a rule for it." is displayed
I just want to know if the error is in Beautiful Soup or HTTPServer or HTML?
Thanks,
Amit
Why are you adding this at the top of the document?
<?xml version="1.0" encoding="iso-8859-1"?>
That will make the browser think the entire document is XML and not XHTML. Removing that line should make it render correctly. I assume Firefox is displaying a page with a bunch of elements which you can expand/collapse to see the content like it normally would for an XML document, even though the HTTP headers might say it's text/html.
So guys,
I have finally solved this problem. The reason was because I wasn't sending MIME header (even though I thought I was) with content type "text/html"
In python HTTPServer, before writing anything to file you always do this:-
self.send_response(301)
self.send_header("Location", self.path + "/")
self.end_headers()
# Once you have called the above methods, you can send the HTML to Client
self.wfile.write('ANY HTML CODE YOU WANT TO WRITE')

Categories