keeping html entitles when using BeautifulSoup in python - python

I'm trying to keep html entitles while parsing an html page.
Here is the html code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div>{"key": "my "value"" }</div>
</body>
</html>
Heres my python (2.7.11) code
from bs4 import BeautifulSoup
page = open("test.html", "r").read()
soup = BeautifulSoup(page, "html.parser")
for
print soup.div
result will be <div>{"key": "my "value"" }</div>
for print soup.div.get_text() result will be {"key": "my "value"" }
For both cases I'm losing ". Is there any way to keep it while using BeautifulSoup, especially when using get_text()?
Because next I want to parse the text into json.
so when I use json.dumps(soup.div.get_text()) it's not working.

Related

Prevent BS4 to add duplicate tags

I'm appending HTML snippet/elements to existing an HTML and BS4 duplicates the element(s) inside it. How to prevent it?
Simplified code
from bs4 import BeautifulSoup as bs4
html = bs4("<!DOCTYPE html>", "html5lib")
message = bs4("<span>Complete all required fields.<span>", "html.parser")
html.select("body")[0].append(message)
print(html.prettify())
Output
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<span>
Complete all required fields.
<span>
</span>
</span>
</body>
</html>
Expected
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<span>
Complete all required fields.
</span>
</body>
</html>
You did it all right, but you forgot to close the span
from bs4 import BeautifulSoup as bs4
html = bs4("<!DOCTYPE html>", "html5lib")
message = bs4("<span>Complete all required fields.</span>", "html.parser")#changed
html.select("body")[0].append(message)
print(html.prettify())
o/p:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<span>
Complete all required fields.
</span>
</body>
</html>

Extract title from url with python

I want to use urllib to extract the title from the following html document. I have provided the beginning part below:
html_doc = """
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
<title>Three Little Pigs</title>
<meta name="generator" content="Amaya, see http://www.w3.org/Amaya/">
</head>
<body>
I used urlopen in urllib.request but it seems like the url type in the html document does not allow me to extract anything.
I have tried:
from bs4 import BeautifulSoup
from urllib.request import urlopen
def get_title():
soup = urlopen(html_doc)
print(soup.title.string)
get_title()
I got the result of:
ValueError: unknown url type: '!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n "http://www.w3.org/TR/html4/loose.dtd">\n<html>\n<head>\n <meta http-equiv="content-type" content="text/html; charset=iso-8859-1">\n <title>Three Little Pigs</title>\n <meta name="generator" content="Amaya, see http://www.w3.org/Amaya/">\n</head>\n\n<body'
Can anyone help with this problem?
html_doc is not an URL, it's the actual source code string, you can use BeautifulSoup's html.parser to parse it and then extract the title from it:
from bs4 import BeautifulSoup
def get_title():
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.string)
get_title()
Output:
Three Little Pigs

Inserting a html file into python file

I'm using Pycharm on Windows 10 and I'd like to use a html file inside a python file, so what should I do? I have my code already written, but the webpage seems not to run this html file.
To visualize this, I share my code:
from flask import Flask, render_template
app=Flask(__name__)
#app.route('/')
def home():
return render_template("home.html")
#app.route('/about/')
def about():
return render_template("about.html")
if __name__=="__main__":
app.run(debug=True)
And after deploying this python file locally, I'd like these htmls to work, but the program doesn't seem to see them. Where should I put these html files or what should I do with them? I have them all in a one folder on my PC.
Use BeautifulSoup. Here's an example there a meta tag is inserted right after the title tag using insert_after():
from bs4 import BeautifulSoup as Soup
html = """
<html>
<head>
<title>Test Page</title>
</head>
<body>
<div>test</div>
</html>
"""
soup = Soup(html)
title = soup.find('title')
meta = soup.new_tag('meta')
meta['content'] = "text/html; charset=UTF-8"
meta['http-equiv'] = "Content-Type"
title.insert_after(meta)
print soup
prints:
<html>
<head>
<title>Test Page</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
</head>
<body>
<div>test</div>
</body>
</html>
You can also find head tag and use insert() with a specified position:
head = soup.find('head')
head.insert(1, meta)
Also see:
Add parent tags with beautiful soup
How to append a tag after a link with BeautifulSoup

How to access html data in python flask [duplicate]

This question already has answers here:
beautiful soup just get the value inside the tag
(5 answers)
Closed 4 years ago.
I have html file which contain the different information.Example
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1>welcome</h1>
</body>
</html>
I want to access the welcome in python file which is in the flask server.How we can access that data.
Use the package BeautifulSoup:
from bs4 import BeautifulSoup
html_doc = """
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1>welcome</h1>
</body>
</html> """
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.h1.text)
# 'welcome'

BeautifulSoup - proper way of dealing with self-closing tags

I have an html file with some self-closing tags, but BeautifulSoup doesn't like them.
from bs4 import BeautifulSoup
html = '<head><meta content="text/html" http-equiv="Content-Type"><meta charset="utf-8"></head>'
doc = BeautifulSoup(html, 'html.parser')
print doc.prettify()
prints
<head>
<meta content="text/html" http-equiv="Content-Type">
<meta charset="utf-8"/>
</meta>
</head>
Must I manually check if each tag is self-closing and modify appropriately, or is there a better way of handling this?
As you may already know, you can specify different parsers that BeautifulSoup would use internally. And, as noted in BeautifulSoup docs:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will
give different results.
In this particular case, both lxml and html5lib produce two separate meta tags:
In [4]: doc = BeautifulSoup(html, 'lxml')
In [5]: print(doc.prettify())
<html>
<head>
<meta content="text/html" http-equiv="Content-Type"/>
<meta charset="utf-8"/>
</head>
</html>
In [6]: doc = BeautifulSoup(html, 'html5lib')
In [7]: print(doc.prettify())
<html>
<head>
<meta content="text/html" http-equiv="Content-Type"/>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>

Categories