Get content from meta in external website - python

I need to extract the meta description of an external website. I've already searched and maybe the simple answer is already out there, but I wasn't able to apply it to my code.
Currently, I can get its title doing the following:
from urllib.request import urlopen
from lxml.html import parse
url = "https://technofall.com/how-to-report-traffic-violation-pay-vehicle-fines-e-challan/"
page = urlopen(URL)
p = parse(page)
print (p.find(".//title").text)
I am getting title from here
However, the description is a bit trickier. It can come in the form of:
<meta name="og:description" content="blabla"
<meta property="og:description" content="blabla"
<meta name="description" content="blabla"
So what I want is to extract the first one of these that appears inside the Html.

Related

parse html using Python's "xml" module ParseError on meta tag

I'm trying to parse some html using the xml python library. The html I'm trying to parse is from download.docker.com which breaks out to,
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Index of linux/ubuntu/dists/jammy/pool/stable/amd64/</title>
</head>
<body>
<h1>Index of linux/ubuntu/dists/jammy/pool/stable/amd64/</h1>
<hr>
<pre>../
containerd.io_1.5.10-1_amd64.deb
...
</pre><hr></body></html>
Parsing the html with the following code,
import urllib
import xml.etree.ElementTree as ET
html_doc = urllib.request.urlopen(<MY_URL>).read()
root = ET.fromstring(html_doc)
>>> ParseError: mismatched tag: line 6, column 2
unless I'm mistaken, this is because of the <meta charset="UTF-8">. Using something like lxml, I can make this work with,
import urllib
from lxml import html
html_doc = urllib.request.urlopen(<MY_URL>).read()
root = = html.fromstring(html_doc)
Is there any way to parse this html using the xml python library instead of lxml?
Is there any way to parse this html using the xml python library instead of lxml?
The answer is no.
An XML library (for example xml.etree.ElementTree) cannot be used to parse arbitrary HTML. It can be used to parse HTML that also happens to be well-formed XML. But your HTML document is not well-formed.
lxml on the other hand can be used for both XML and HTML.
By the way, note that "the xml python library" is ambiguous. There are several submodules in the xml package in the standard library (https://docs.python.org/3/library/xml.html). All of them will reject the HTML document in the question.

Why python requests module not pulling the whole html?

The link: https://www.hyatt.com/explore-hotels/service/hotels
code:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())
Tried also this:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.dumps(r.text)
print(data)
output:
<!DOCTYPE html>
<head>
</head>
<body>
<script src="SOME_value">
</script>
</body>
</html>
Its printing the html without the tag the data are in, only showing a single script tag.
How to access the data (shown in browsing view, looks like json)?browsing view my code code response)
I don't believe this can be done...That data simply isn't in the r.text
If you do this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.hyatt.com/explore-hotels/service/hotels")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
You get this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?tkrm_alpekz_s1.3=0EOFte3LjRKv3iJhEEV2hrnisE5M3Lwy3ac3UPZ19zdiB49A6ZtBjtiwBqgKQN3q2MEQ3NbFjTWfmP9GqArOIAML6zTvSb4lRHD7FsmJFVWNkSwuTNWUNuJWv6hEXBG37DhBtTXFEO50999RihfPbTjsB">
</script>
</body>
</html>
As you can see there is no <pre> tag for whatever reason. So you're unable to access that.
I also get an 429 Error when accessing the URL:
GET https://www.hyatt.com/explore-hotels/service/hotels 429
What is the end goal here? Because this site doesn't seem to be willing to do anything. Some sites are unable to be parsed, for various reasons. If you're wanting to play with JSON data I would look into using an API instead.
If you google https://www.hyatt.com and manually go to the URL you mentioned you get a 404 error.
I would say Hyatt don't want you parsing their site. So don't!
The response is JSON, not HTML. You can verify this by opening the Network tab in your browser's dev tools. There you will see that the content-type header is application/json; charset=utf-8.
You can parse this into a useable form with the standard json package:
r = requests.get('https://www.hyatt.com/explore-hotels/service/hotels')
data = json.loads(r.text)
print(data)

Parsing MS specific html tags in BeautifulSoup

When trying to parse an email sent using MS Outlook, I want to be able to strip the annoying Microsoft XML tags that it has added. One such example is the o:p tag. When trying to use Python's BeautifulSoup to parse an email as HTML, it can't seem to find these specialty tags.
For example:
from bs4 import BeautifulSoup
textToParse = """
<html>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "html5lib")
body = soup.find('body')
for otag in body.find_all('o'):
print(otag)
for otag in body.find_all('o:p'):
print(otag)
This will output no text to the console, but if I switched the find_all call to search for p then it would output the p node as expected.
How come these custom tags do not seem to work?
It's a namespace issue. Apparently, BeautifulSoup does not consider custom namespaces valid when parsed with "html5lib".
You can work around this with a regular expression, which – strangely – does work correctly!
print (soup.find_all(re.compile('o:p')))
>>> [<o:p>This should go</o:p>]
but the "proper" solution is to change the parser to "lxml-xml" and introducing o: as a valid namespace.
from bs4 import BeautifulSoup
textToParse = """
<html xmlns:o='dummy_url'>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "lxml-xml")
body = soup.find('body')
print ('this should find nothing')
for otag in body.find_all('o'):
print(otag)
print ('this should find o:p')
for otag in body.find_all('o:p'):
print(otag)
>>>
this should find nothing
this should find o:p
<o:p>This should go</o:p>

Urllib2 get garbled string instead of page source [duplicate]

This question already has answers here:
Does python urllib2 automatically uncompress gzip data fetched from webpage?
(4 answers)
Closed 7 years ago.
When I crawl the webpage using urllib2, I can't get the page source but a garbled string which I can't understand what it's. And my code as follow:
url = 'http://finance.sina.com.cn/china/20150905/065523161502.shtml'
conn = urllib2.urlopen(url)
content = conn.read()
print content
Can anyone help me find out what's wrong? Thank you so much.
Update: I think you can run the code above to get what I get. and follows is what I get in python:
{G?0????l???%ߐ?C0 ?K?z?%E
|?B ??|?F?oeB?'??M6?
y???~???;j????H????L?mv:??:]0Z?Wt6+Y+LV? VisV:캆P?Y?,
O?m?p[8??m/???Y]????f.|x~Fa]S?op1M?H?imm5??g?????k?K#?|??? ???????p:O
??(? P?FThq1??N4??P???X??lD???F???6??z?0[?}??z??|??+?pR"s?Lq??&g#?v[((J~??w1#-?G?8???'?V+ks0?????%???5)
And this is what I expected (using curl):
<html>
<head>
<link rel="mask-icon" sizes="any" href="http://www.sina.com.cn/favicon.svg" color="red">
<meta charset="gbk"/>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
Here is a possible way to get the source information using requests and BeautifulSoup
import requests
from bs4 import BeautifulSoup
#Url to request
url = "http://finance.sina.com.cn/china/20150905/065523161502.shtml"
r = requests.get(url)
#Use BeautifulSoup to organise the 'requested' content
soup=BeautifulSoup(r.content,"lxml")
print soup

Read HEAD contents from HTML

i need small script in python. Need to read custom block in a web file.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib2
req = urllib2.Request('http://target.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # Here is all page source with html tags, but
# i need read only section from <head> to </head>
# example the http://target.com source is:
# <html>
# <body>
# <head>
# ... need to read this section ...
# </head>
# ... page source ...
# </body>
# </html>
How read the custom section?
To parse HTML, we use a parser, such as BeautifulSoup.
Of course you can parse it using a regular expression, but that is something you should never do. Just because it works for some cases doesn't mean it is the standard way of doing it or is the proper way of doing it. If you are interested in knowing why, read this excellent answer here on SO.
Start with the BeautifulSoup tutorial and see how to parse the required information. It is pretty easy to do it. We are not going to do it for you, that is for you to read and learn!
Just to give you a heads up, you have the_page which contains the HTML data.
>> from BeautifulSoup import BeautifulSoup
>> soup = BeautifulSoup(the_page)
Now follow the tutorial and see how to get everything within the head tag.
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen('http://www.example.com')
soup = BeautifulSoup(page.read())
print soup.find('head')
outputs
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example Web Page</title>
</head>
One solution would be to use the awesome python library Beautiful Soup. It allows you do parse the html/xml pretty easily, and will try to help out when the documents are broken or invalid.

Categories