Using Python - Get a table out of some html and display it? - python

There's a lot of help on here but some of it goes over my head, so hopefully by asking my question and getting a tailored answer I will better understand.
So far I have managed to connect to a website, authenticate as a user, fill in a form and then pull down the html. The html contains a table I want. I just want to say some thing like:-
read html... when you read table start tags keep going until you reach table end tags and then disply that, or write it to a new html file and open it keeping the tags so it's formmated for me.
Here is the code I have so far.
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
s.post(LOGINURL, data=login)
# print
r = s.get(LOGINURL)
print r.url
# An authorised request.
r = s.get(APURL)
print r.url
# etc...
s.post(APURL)
#
r = s.post(APURL, data=findaps)
r = s.get(APURL)
#print r.text
f = open("makethisfile.html", "w")
f.write('\n'.join(['<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">',
'<html>',
' <head>',
' <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">',
' <title>THE TITLE</title>',
' <link rel="stylesheet" href="css/displayEventLists.css" type="text/css">',
r.text #this just does everything, i need to get the table.
])
)
f.close()

Although it's best to parse the file properly, a quick-and-dirty method uses a regex.
m = re.search("<table.*?>(.+)</table>", r.text, re.S)
if (m):
print m.group()
else:
print "Error: table not found"
As an example of why parsing is better, the regex as written will fail with the following (rather contrived!) example:
<!-- <table> -->
blah
blah
<table>
this is the actual
table
</table>
And as written it will get the first table in the file. But you could just loop to get the 2nd, etc., (or make the regex specific to the table you want if possible) so that's not a problem.

Related

Why python requests module not pulling the whole html?

The link: https://www.hyatt.com/explore-hotels/service/hotels
code:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())
Tried also this:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.dumps(r.text)
print(data)
output:
<!DOCTYPE html>
<head>
</head>
<body>
<script src="SOME_value">
</script>
</body>
</html>
Its printing the html without the tag the data are in, only showing a single script tag.
How to access the data (shown in browsing view, looks like json)?browsing view my code code response)
I don't believe this can be done...That data simply isn't in the r.text
If you do this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.hyatt.com/explore-hotels/service/hotels")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
You get this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?tkrm_alpekz_s1.3=0EOFte3LjRKv3iJhEEV2hrnisE5M3Lwy3ac3UPZ19zdiB49A6ZtBjtiwBqgKQN3q2MEQ3NbFjTWfmP9GqArOIAML6zTvSb4lRHD7FsmJFVWNkSwuTNWUNuJWv6hEXBG37DhBtTXFEO50999RihfPbTjsB">
</script>
</body>
</html>
As you can see there is no <pre> tag for whatever reason. So you're unable to access that.
I also get an 429 Error when accessing the URL:
GET https://www.hyatt.com/explore-hotels/service/hotels 429
What is the end goal here? Because this site doesn't seem to be willing to do anything. Some sites are unable to be parsed, for various reasons. If you're wanting to play with JSON data I would look into using an API instead.
If you google https://www.hyatt.com and manually go to the URL you mentioned you get a 404 error.
I would say Hyatt don't want you parsing their site. So don't!
The response is JSON, not HTML. You can verify this by opening the Network tab in your browser's dev tools. There you will see that the content-type header is application/json; charset=utf-8.
You can parse this into a useable form with the standard json package:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.loads(r.text)
print(data)

Insert variables into an html file

I am trying to send an email using sendgrid. For this I created an html file and I want to format some variables into it.
Very basic test.html example:
<html>
<head>
</head>
<body>
Hello World, {name}!
</body>
</html>
Now in my Python code I am trying to do something like this:
html = open("test.html", "r")
text = html.read()
msg = MIMEText(text, "html")
msg['name'] = 'Boris'
and then proceed to send the email
Sadly, this does not seem to be working. Any way to make this work?
There are a few ways to approach this depending on how dynamic this must be and how many elements you are going to need to insert. If it is one single value name then #furas is correct and you can simply put
html = open("test.html", "r")
text = html.read().format(name="skeletor")
print(text)
And get:
<html>
<head>
</head>
<body>
Hello World, skeletor!
</body>
</html>
Alternatively you can use Jinja2 templates.
import jinja2
html = open("test.html", "r")
text = html.read()
t = jinja2.Template(text)
print(t.render(name="Skeletor"))
Helpful links: Jinja website
Real Python Primer on Jinja
Python Programming Jinja

Parsing MS specific html tags in BeautifulSoup

When trying to parse an email sent using MS Outlook, I want to be able to strip the annoying Microsoft XML tags that it has added. One such example is the o:p tag. When trying to use Python's BeautifulSoup to parse an email as HTML, it can't seem to find these specialty tags.
For example:
from bs4 import BeautifulSoup
textToParse = """
<html>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "html5lib")
body = soup.find('body')
for otag in body.find_all('o'):
print(otag)
for otag in body.find_all('o:p'):
print(otag)
This will output no text to the console, but if I switched the find_all call to search for p then it would output the p node as expected.
How come these custom tags do not seem to work?
It's a namespace issue. Apparently, BeautifulSoup does not consider custom namespaces valid when parsed with "html5lib".
You can work around this with a regular expression, which – strangely – does work correctly!
print (soup.find_all(re.compile('o:p')))
>>> [<o:p>This should go</o:p>]
but the "proper" solution is to change the parser to "lxml-xml" and introducing o: as a valid namespace.
from bs4 import BeautifulSoup
textToParse = """
<html xmlns:o='dummy_url'>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "lxml-xml")
body = soup.find('body')
print ('this should find nothing')
for otag in body.find_all('o'):
print(otag)
print ('this should find o:p')
for otag in body.find_all('o:p'):
print(otag)
>>>
this should find nothing
this should find o:p
<o:p>This should go</o:p>

Read HEAD contents from HTML

i need small script in python. Need to read custom block in a web file.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib2
req = urllib2.Request('http://target.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # Here is all page source with html tags, but
# i need read only section from <head> to </head>
# example the http://target.com source is:
# <html>
# <body>
# <head>
# ... need to read this section ...
# </head>
# ... page source ...
# </body>
# </html>
How read the custom section?
To parse HTML, we use a parser, such as BeautifulSoup.
Of course you can parse it using a regular expression, but that is something you should never do. Just because it works for some cases doesn't mean it is the standard way of doing it or is the proper way of doing it. If you are interested in knowing why, read this excellent answer here on SO.
Start with the BeautifulSoup tutorial and see how to parse the required information. It is pretty easy to do it. We are not going to do it for you, that is for you to read and learn!
Just to give you a heads up, you have the_page which contains the HTML data.
>> from BeautifulSoup import BeautifulSoup
>> soup = BeautifulSoup(the_page)
Now follow the tutorial and see how to get everything within the head tag.
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen('http://www.example.com')
soup = BeautifulSoup(page.read())
print soup.find('head')
outputs
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example Web Page</title>
</head>
One solution would be to use the awesome python library Beautiful Soup. It allows you do parse the html/xml pretty easily, and will try to help out when the documents are broken or invalid.

Is it possible to hook up a more robust HTML parser to Python mechanize?

I am trying to parse and submit a form on a website using mechanize, but it appears that the built-in form parser cannot detect the form and its elements. I suspect that it is choking on poorly formed HTML, and I'd like to try pre-parsing it with a parser better designed to handle bad HTML (say lxml or BeautifulSoup) and then feeding the prettified, cleaned-up output to the form parser. I need mechanize not only for submitting the form but also for maintaining sessions (I'm working this form from within a login session.)
I'm not sure how to go about doing this, if it is indeed possible.. I'm not that familiar with the various details of the HTTP protocol, how to get various parts to work together etc. Any pointers?
I had a problem where a form field was missing from a form, I couldn't find any malformed html but I figured that was the cause so I used BeautifulSoup's prettify function to parse it and it worked.
resp = br.open(url)
soup = BeautifulSoup(resp.get_data())
resp.set_data(soup.prettify())
br.set_response(resp)
I'd love to know how to this automatically.
Edit: found out how to do this automatically
class PrettifyHandler(mechanize.BaseHandler):
def http_response(self, request, response):
if not hasattr(response, "seek"):
response = mechanize.response_seek_wrapper(response)
# only use BeautifulSoup if response is html
if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']):
soup = BeautifulSoup(response.get_data())
response.set_data(soup.prettify())
return response
# also parse https in the same way
https_response = http_response
br = mechanize.Browser()
br.add_handler(PrettifyHandler())
br will now use BeautifulSoup to parse all responses where html is contained in the content type (mime type), eg text/html
reading from the big example on the first page of the mechanize website:
# Sometimes it's useful to process bad headers or bad HTML:
response = br.response() # this is a copy of response
headers = response.info() # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("<!---", "<!--"))
br.set_response(response)
so it seems very possible to preprocess the response with another parser which will regenerate well-formed HTML, then feed it back to mechanize for further processing.
What you're looking for can be done with lxml.etree which is the xml.etree.ElementTree emulator (and replacement) provided by lxml:
First we take bad mal-formed HTML:
% cat bad.html
<html>
<HEAD>
<TITLE>this HTML is awful</title>
</head>
<body>
<h1>THIS IS H1</H1>
<A HREF=MYLINK.HTML>This is a link and it is awful</a>
<img src=yay.gif>
</body>
</html>
(Observe the mixed case between opening and closing tags, missing quotation marks).
And then parse it:
>>> from lxml import etree
>>> bad = file('bad.html').read()
>>> html = etree.HTML(bad)
>>> print etree.tostring(html)
<html><head><title>this HTML is awful</title></head><body>
<h1>THIS IS H1</h1>
This is a link and it is awful
<img src="yay.gif"/></body></html>
Observe that the tagging and quotation has been corrected for us.
If you were having problems parsing the HTML before, this might be the answer you're looking for. As for the details of HTTP, that is another matter entirely.

Categories