Extract title from url with python - python

I want to use urllib to extract the title from the following html document. I have provided the beginning part below:
html_doc = """
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
<title>Three Little Pigs</title>
<meta name="generator" content="Amaya, see http://www.w3.org/Amaya/">
</head>
<body>
I used urlopen in urllib.request but it seems like the url type in the html document does not allow me to extract anything.
I have tried:
from bs4 import BeautifulSoup
from urllib.request import urlopen
def get_title():
soup = urlopen(html_doc)
print(soup.title.string)
get_title()
I got the result of:
ValueError: unknown url type: '!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n "http://www.w3.org/TR/html4/loose.dtd">\n<html>\n<head>\n <meta http-equiv="content-type" content="text/html; charset=iso-8859-1">\n <title>Three Little Pigs</title>\n <meta name="generator" content="Amaya, see http://www.w3.org/Amaya/">\n</head>\n\n<body'
Can anyone help with this problem?

html_doc is not an URL, it's the actual source code string, you can use BeautifulSoup's html.parser to parse it and then extract the title from it:
from bs4 import BeautifulSoup
def get_title():
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.string)
get_title()
Output:
Three Little Pigs

Related

BeautifulSoup parser adds unnecessary closing html tags

For example
you have html like
<head>
<meta charset="UTF-8">
<meta name="description" content="Free Web tutorials">
<meta name="keywords" content="HTML,CSS,XML,JavaScript">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
python:
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')
print(soup.prettify())
And if you parse it using BeautifulSoup in python and print it with prettify it will give output like this
output:
<html>
<head>
<meta charset="UTF-8">
<meta name="description" content="Free Web tutorials">
<meta name="keywords" content="HTML,CSS,XML,JavaScript">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</meta>
</meta>
</meta>
</meta>
</meta>
</head>
but if you have html meta tag like
<meta name="description" content="Free Web tutorials" />
It will give output as it is. It won't add an ending tag
so how to stop BeautifulSoup from adding unnecessary ending tags?
To solve this you just need to change your html parser to lxml parser
then you python script will be
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'lxml')
print(soup.prettify())
you just need to change soup = bs(page.data, 'html.parser') to soup = bs(page.data, 'lxml')

Inserting a html file into python file

I'm using Pycharm on Windows 10 and I'd like to use a html file inside a python file, so what should I do? I have my code already written, but the webpage seems not to run this html file.
To visualize this, I share my code:
from flask import Flask, render_template
app=Flask(__name__)
#app.route('/')
def home():
return render_template("home.html")
#app.route('/about/')
def about():
return render_template("about.html")
if __name__=="__main__":
app.run(debug=True)
And after deploying this python file locally, I'd like these htmls to work, but the program doesn't seem to see them. Where should I put these html files or what should I do with them? I have them all in a one folder on my PC.
Use BeautifulSoup. Here's an example there a meta tag is inserted right after the title tag using insert_after():
from bs4 import BeautifulSoup as Soup
html = """
<html>
<head>
<title>Test Page</title>
</head>
<body>
<div>test</div>
</html>
"""
soup = Soup(html)
title = soup.find('title')
meta = soup.new_tag('meta')
meta['content'] = "text/html; charset=UTF-8"
meta['http-equiv'] = "Content-Type"
title.insert_after(meta)
print soup
prints:
<html>
<head>
<title>Test Page</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
</head>
<body>
<div>test</div>
</body>
</html>
You can also find head tag and use insert() with a specified position:
head = soup.find('head')
head.insert(1, meta)
Also see:
Add parent tags with beautiful soup
How to append a tag after a link with BeautifulSoup

keeping html entitles when using BeautifulSoup in python

I'm trying to keep html entitles while parsing an html page.
Here is the html code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div>{"key": "my "value"" }</div>
</body>
</html>
Heres my python (2.7.11) code
from bs4 import BeautifulSoup
page = open("test.html", "r").read()
soup = BeautifulSoup(page, "html.parser")
for
print soup.div
result will be <div>{"key": "my "value"" }</div>
for print soup.div.get_text() result will be {"key": "my "value"" }
For both cases I'm losing ". Is there any way to keep it while using BeautifulSoup, especially when using get_text()?
Because next I want to parse the text into json.
so when I use json.dumps(soup.div.get_text()) it's not working.

BeautifulSoup - proper way of dealing with self-closing tags

I have an html file with some self-closing tags, but BeautifulSoup doesn't like them.
from bs4 import BeautifulSoup
html = '<head><meta content="text/html" http-equiv="Content-Type"><meta charset="utf-8"></head>'
doc = BeautifulSoup(html, 'html.parser')
print doc.prettify()
prints
<head>
<meta content="text/html" http-equiv="Content-Type">
<meta charset="utf-8"/>
</meta>
</head>
Must I manually check if each tag is self-closing and modify appropriately, or is there a better way of handling this?
As you may already know, you can specify different parsers that BeautifulSoup would use internally. And, as noted in BeautifulSoup docs:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will
give different results.
In this particular case, both lxml and html5lib produce two separate meta tags:
In [4]: doc = BeautifulSoup(html, 'lxml')
In [5]: print(doc.prettify())
<html>
<head>
<meta content="text/html" http-equiv="Content-Type"/>
<meta charset="utf-8"/>
</head>
</html>
In [6]: doc = BeautifulSoup(html, 'html5lib')
In [7]: print(doc.prettify())
<html>
<head>
<meta content="text/html" http-equiv="Content-Type"/>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>

Get document DOCTYPE with BeautifulSoup

I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object.
Given the following html:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta charset=utf-8 />
<meta name="viewport" content="width=620" />
<title>HTML5 Demos and Examples</title>
<link rel="stylesheet" href="/css/html5demos.css" type="text/css" />
<script src="js/h5utils.js"></script>
</head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Can anyone tell me if there's a way of extracting the declared doctype from it using BeautifulSoup?
Beautiful Soup 4 has a class for DOCTYPE declarations, so you can use that to extract all the declarations at top level (though you're no doubt expecting one or none!)
def doctype(soup):
items = [item for item in soup.contents if isinstance(item, bs4.Doctype)]
return items[0] if items else None
You can go through top-level elements and check each to see whether it is a declaration. Then you can inspect it to find out what kind of declaration it is:
for child in soup.contents:
if isinstance(child, BS.Declaration):
declaration_type = child.string.split()[0]
if declaration_type.upper() == 'DOCTYPE':
declaration = child
You could just fetch the first item in soup contents:
>>> soup.contents[0]
u'DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"'

Categories