How do I distinguish between XML and HTML programmatically in Python?

How do I distinguish between XML and HTML programmatically in Python? - python

I am sending an http request and get an http response, but I'd like to be able to extract the body of the response and know whether it contains XML or HTML.
Ideally, this method should work even if the content type isn't clear in the response (ie. it should work for websites where content type isn't necessarily specified).
Currently, I'm using lxml to parse the html/xml, but don't know at parse time whether I'm dealing with HTML or XML.

You can check content-type header to know which type of response you got:
import requests
respond = requests.get(URL)
file_type = respond.headers['content-type']
print(file_type)
>>>'text/html; charset=utf-8'
You can also do
print(file_type.split(';')[0].split('/')[1])
to get "html" or "xml" as output

I don't understand why you would like to do it, and I'm sure there is a better way to do it. But...
The difference beween xml and html is the declaration, HTML must start with <!DOCTYPE HTML>, and XML with <?xml version="1.0>
Example of XML
<?xml version="1.0">
<address>
<name> Krishna Rungta</name>
<contact>9898613050</contact>
<email>krishnaguru99#gmail.com </email>
<birthdate>1985-09-27</birthdate>
</address>
Example of HTML
<!DOCTYPE html>
<html>
<head>
<title> Page title </title> </head>
<body>
<hl> First Heading</hl> <p> First paragraph.</p> </body>
</html>
If I were you, I would use BeautifulSoup to select DOCTYPE, and if you can't find/select it means it is XML. You can see how to do that here.
If this doesn't answer your question try reading this or try using this library

Related

parse html using Python's "xml" module ParseError on meta tag

I'm trying to parse some html using the xml python library. The html I'm trying to parse is from download.docker.com which breaks out to,
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Index of linux/ubuntu/dists/jammy/pool/stable/amd64/</title>
</head>
<body>
<h1>Index of linux/ubuntu/dists/jammy/pool/stable/amd64/</h1>
<hr>
<pre>../
containerd.io_1.5.10-1_amd64.deb
...
</pre><hr></body></html>
Parsing the html with the following code,
import urllib
import xml.etree.ElementTree as ET
html_doc = urllib.request.urlopen(<MY_URL>).read()
root = ET.fromstring(html_doc)
>>> ParseError: mismatched tag: line 6, column 2
unless I'm mistaken, this is because of the <meta charset="UTF-8">. Using something like lxml, I can make this work with,
import urllib
from lxml import html
html_doc = urllib.request.urlopen(<MY_URL>).read()
root = = html.fromstring(html_doc)
Is there any way to parse this html using the xml python library instead of lxml?

Is there any way to parse this html using the xml python library instead of lxml?
The answer is no.
An XML library (for example xml.etree.ElementTree) cannot be used to parse arbitrary HTML. It can be used to parse HTML that also happens to be well-formed XML. But your HTML document is not well-formed.
lxml on the other hand can be used for both XML and HTML.
By the way, note that "the xml python library" is ambiguous. There are several submodules in the xml package in the standard library (https://docs.python.org/3/library/xml.html). All of them will reject the HTML document in the question.

<!DOCTYPE html> missing in Selenium Python page_source

I'm using Selenium for functional testing of a Django application and thought I'd try html5lib as a way of validating the html output. One of the validations is that the page starts with a <!DOCTYPE ...> tag.
The unit test checks with response.content.decode() all worked fine, correctly flagging errors, but I found that Selenium driver.page_source output starts with an html tag. I have double-checked that I'm using the correct template by modifying the title and making sure that the change is reflected in the page_source. There is also a missing newline and indentation between the <html> tag and the <title> tag.
This is what the first few lines looks like in the Firefox browser.
<!DOCTYPE html>
<html>
<head>
<title>NetLog</title>
</head>
Here's the Python code.
self.driver.get(f"{self.live_server_url}/netlog/")
print(self.driver.page_source
And here's the first few lines of the print when run under the Firefox web driver.
<html><head>
<title>NetLog</title>
</head>
The page body looks fine, while the missing newline is also present between </body> and </html>. Is this expected behaviour? I suppose I could just stuff the DOCTYPE tag in front of the string as a workaround but would prefer to have it behave as intended.
Chris

Only Firefox displays HTML Code and not the page

I have this complicated problem that I can't find a answer to.
I have a Python HTTPServer running that serves webpages. These webpages are created at runtime with help of Beautiful Soup. Problem is that the Firefox shows HTML Code for the webpage and not the actual page? I really don't know know who is causing this problem -
- Python HTTPServer
- Beautiful Soup
- HTML Code
Any case, I have copied parts of the webpage HTML:-
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>
My title
</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<script src="./123_ui.js">
</script>
</head>
<body>
<div>
Hellos
</div>
</body>
</html>
Just to help you, here are the things that I have already tried-
- I have made sure that Python HTTPServer is sending the MIME header as text/html
- Just copying and pasting the HTML Code will show you correct page as its static. I can tell from here that the problem is in HTTPServer side
- The Firebug shows that is empty and "This element has no style rules. You can create a rule for it." is displayed
I just want to know if the error is in Beautiful Soup or HTTPServer or HTML?
Thanks,
Amit

Why are you adding this at the top of the document?
<?xml version="1.0" encoding="iso-8859-1"?>
That will make the browser think the entire document is XML and not XHTML. Removing that line should make it render correctly. I assume Firefox is displaying a page with a bunch of elements which you can expand/collapse to see the content like it normally would for an XML document, even though the HTTP headers might say it's text/html.

So guys,
I have finally solved this problem. The reason was because I wasn't sending MIME header (even though I thought I was) with content type "text/html"
In python HTTPServer, before writing anything to file you always do this:-
self.send_response(301)
self.send_header("Location", self.path + "/")
self.end_headers()
# Once you have called the above methods, you can send the HTML to Client
self.wfile.write('ANY HTML CODE YOU WANT TO WRITE')

Read HEAD contents from HTML

i need small script in python. Need to read custom block in a web file.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib2
req = urllib2.Request('http://target.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # Here is all page source with html tags, but
# i need read only section from <head> to </head>
# example the http://target.com source is:
# <html>
# <body>
# <head>
# ... need to read this section ...
# </head>
# ... page source ...
# </body>
# </html>
How read the custom section?

To parse HTML, we use a parser, such as BeautifulSoup.
Of course you can parse it using a regular expression, but that is something you should never do. Just because it works for some cases doesn't mean it is the standard way of doing it or is the proper way of doing it. If you are interested in knowing why, read this excellent answer here on SO.
Start with the BeautifulSoup tutorial and see how to parse the required information. It is pretty easy to do it. We are not going to do it for you, that is for you to read and learn!
Just to give you a heads up, you have the_page which contains the HTML data.
>> from BeautifulSoup import BeautifulSoup
>> soup = BeautifulSoup(the_page)
Now follow the tutorial and see how to get everything within the head tag.

from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen('http://www.example.com')
soup = BeautifulSoup(page.read())
print soup.find('head')
outputs
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example Web Page</title>
</head>

One solution would be to use the awesome python library Beautiful Soup. It allows you do parse the html/xml pretty easily, and will try to help out when the documents are broken or invalid.

Is it possible to hook up a more robust HTML parser to Python mechanize?

I am trying to parse and submit a form on a website using mechanize, but it appears that the built-in form parser cannot detect the form and its elements. I suspect that it is choking on poorly formed HTML, and I'd like to try pre-parsing it with a parser better designed to handle bad HTML (say lxml or BeautifulSoup) and then feeding the prettified, cleaned-up output to the form parser. I need mechanize not only for submitting the form but also for maintaining sessions (I'm working this form from within a login session.)
I'm not sure how to go about doing this, if it is indeed possible.. I'm not that familiar with the various details of the HTTP protocol, how to get various parts to work together etc. Any pointers?

I had a problem where a form field was missing from a form, I couldn't find any malformed html but I figured that was the cause so I used BeautifulSoup's prettify function to parse it and it worked.
resp = br.open(url)
soup = BeautifulSoup(resp.get_data())
resp.set_data(soup.prettify())
br.set_response(resp)
I'd love to know how to this automatically.
Edit: found out how to do this automatically
class PrettifyHandler(mechanize.BaseHandler):
def http_response(self, request, response):
if not hasattr(response, "seek"):
response = mechanize.response_seek_wrapper(response)
# only use BeautifulSoup if response is html
if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']):
soup = BeautifulSoup(response.get_data())
response.set_data(soup.prettify())
return response
# also parse https in the same way
https_response = http_response
br = mechanize.Browser()
br.add_handler(PrettifyHandler())
br will now use BeautifulSoup to parse all responses where html is contained in the content type (mime type), eg text/html

reading from the big example on the first page of the mechanize website:
# Sometimes it's useful to process bad headers or bad HTML:
response = br.response() # this is a copy of response
headers = response.info() # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("<!---", "<!--"))
br.set_response(response)
so it seems very possible to preprocess the response with another parser which will regenerate well-formed HTML, then feed it back to mechanize for further processing.

What you're looking for can be done with lxml.etree which is the xml.etree.ElementTree emulator (and replacement) provided by lxml:
First we take bad mal-formed HTML:
% cat bad.html
<html>
<HEAD>
<TITLE>this HTML is awful</title>
</head>
<body>
<h1>THIS IS H1</H1>
<A HREF=MYLINK.HTML>This is a link and it is awful</a>
<img src=yay.gif>
</body>
</html>
(Observe the mixed case between opening and closing tags, missing quotation marks).
And then parse it:
>>> from lxml import etree
>>> bad = file('bad.html').read()
>>> html = etree.HTML(bad)
>>> print etree.tostring(html)
<html><head><title>this HTML is awful</title></head><body>
<h1>THIS IS H1</h1>
This is a link and it is awful
<img src="yay.gif"/></body></html>
Observe that the tagging and quotation has been corrected for us.
If you were having problems parsing the HTML before, this might be the answer you're looking for. As for the details of HTTP, that is another matter entirely.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I distinguish between XML and HTML programmatically in Python? - python

Related

parse html using Python's "xml" module ParseError on meta tag

<!DOCTYPE html> missing in Selenium Python page_source

Only Firefox displays HTML Code and not the page

Read HEAD contents from HTML

Is it possible to hook up a more robust HTML parser to Python mechanize?

Categories

Resources