How to concatenate two html file bodies with BeautifulSoup? - python

I need to concatenate the bodies of two html files into one html file, with a bit of arbitrary html as a separator in between. I have code that used to work for this, but stopped working when I upgraded from Xubuntu 11.10 (or was it 11.04?) to 12.10, probably due to a BeautifulSoup update (I'm currently using 3.2.1; I don't know what version I had previously) or to a vim update (I use vim to auto-generate the html files from plaintext ones). This is the stripped-down version of the code:
from BeautifulSoup import BeautifulSoup
soup_original_1 = BeautifulSoup(''.join(open('test1.html')))
soup_original_2 = BeautifulSoup(''.join(open('test2.html')))
contents_1 = soup_original_1.body.renderContents()
contents_2 = soup_original_2.body.renderContents()
contents_both = contents_1 + "\n<b>SEPARATOR\n</b>" + contents_2
soup_new = BeautifulSoup(''.join(open('test1.html')))
while len(soup_new.body.contents):
soup_new.body.contents[0].extract()
soup_new.body.insert(0, contents_both)
The bodies of the two input files used for the test case are very simple: contents_1 is \n<pre>\nFile 1\n</pre>\n' and contents_2 is '\n<pre>\nFile 2\n</pre>\n'.
I would like soup_new.body.renderContents() to be a concatenation of those two with the separator text in between, but instead all the <'s change into < etc. - the desired result is '\n<pre>\nFile 1\n</pre>\n\n<b>SEPARATOR\n</b>\n<pre>\nFile 2\n</pre>\n', which is what I used to get prior to the OS update; the current result is '\n<pre>\nFile 1\n</pre>\n\n<b>SEPARATOR\n</b>\n<pre>\nFile 2\n</pre>\n', which is pretty useless.
How do I make BeautifulSoup stop turning < into < etc when inserting html as a string into a soup object's body? Or should I just be doing this in an entirely different way? (This is my only experience with BeautifulSoup and most other html parsing, so I'm guessing this may well be the case.)
The html files are automatically generated from plaintext files with vim (the real cases I use are obviously more complicated, and involve custom syntax highlighting, which is why I'm doing it this way at all). The full test1.html file looks like this, and test2.html is identical except for contents and title.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>~/programs/lab_notebook_and_printing/concatenate-html_problem_2013/test1.txt.html</title>
<meta name="Generator" content="Vim/7.3" />
<meta name="plugin-version" content="vim7.3_v10" />
<meta name="syntax" content="none" />
<meta name="settings" content="ignore_folding,use_css,pre_wrap,expand_tabs,ignore_conceal" />
<style type="text/css">
pre { white-space: pre-wrap; font-family: monospace; color: #000000; background-color: #ffffff; white-space: pre-wrap; word-wrap: break-word }
body { font-family: monospace; color: #000000; background-color: #ffffff; font-size: 0.875em }
</style>
</head>
<body>
<pre>
File 1
</pre>
</body>
</html>

Trying to read the HTML as text just to insert it into HTML and fighting the encoding and decoding in both directions is making a whole lot of extra work that's very difficult to get right.
The easy thing to do is just not do that. You want to insert everything in the body of test2 after everything in the body of test1, right? So just do that:
for element in soup_original_2.body:
soup_original_1.body.append(element)
To append a separator first, just do the same thing with the separator:
b = soup.new_tag('b')
b.append('SEPARATOR')
soup.original_1.body.append(b)
for element in soup_original_2.body:
soup_original_1.body.append(element)
That's it.
See the documentation section Modifying the tree for a tutorial that covers all of this.

As mentioned in the comments to the answer by abarnert, there is a problem with append.
This answer by Martijn Pieters♦ does the job.
From BeautifulSoup 4.4 onwards (released July '15), you can use:
import copy
document2.body.append(copy.copy(element))

I was having trouble with my html documents and looping through the elements. I found that BeautifulSoup just wasn't successfully parsing some of my HTML files. I ended up inserting a tag around all of the elements inside the body tag:
<body><span id="entirebody">
:
</span></body>
This meant that all of the elements were encompassed in one span element and processed successfully. I want to dig into exactly what is happening when I don't do this but it is one way to get around problems you might encounter.
def insertSpan(htmlString):
'''
Insert a span tag around all of body contents:
<body><span id="entirebody">....</span></body>
'''
subRe = re.compile(r'(<body>)(.*)(<\/body>)', re.DOTALL)
htmlString = subRe.sub("\g<1><span id=\"entirebody\">\g<2></span>\g<3>",htmlString)
return htmlString

Related

How do I distinguish between XML and HTML programmatically in Python?

I am sending an http request and get an http response, but I'd like to be able to extract the body of the response and know whether it contains XML or HTML.
Ideally, this method should work even if the content type isn't clear in the response (ie. it should work for websites where content type isn't necessarily specified).
Currently, I'm using lxml to parse the html/xml, but don't know at parse time whether I'm dealing with HTML or XML.
You can check content-type header to know which type of response you got:
import requests
respond = requests.get(URL)
file_type = respond.headers['content-type']
print(file_type)
>>>'text/html; charset=utf-8'
You can also do
print(file_type.split(';')[0].split('/')[1])
to get "html" or "xml" as output
I don't understand why you would like to do it, and I'm sure there is a better way to do it. But...
The difference beween xml and html is the declaration, HTML must start with <!DOCTYPE HTML>, and XML with <?xml version="1.0>
Example of XML
<?xml version="1.0">
<address>
<name> Krishna Rungta</name>
<contact>9898613050</contact>
<email>krishnaguru99#gmail.com </email>
<birthdate>1985-09-27</birthdate>
</address>
Example of HTML
<!DOCTYPE html>
<html>
<head>
<title> Page title </title> </head>
<body>
<hl> First Heading</hl> <p> First paragraph.</p> </body>
</html>
If I were you, I would use BeautifulSoup to select DOCTYPE, and if you can't find/select it means it is XML. You can see how to do that here.
If this doesn't answer your question try reading this or try using this library

How to remove unwanted html code from strings in python list?

I'm currently trying to clean items on my python list which look like this at the moment:
emails = ['<div dir="auto">Hi,</div><div dir="auto"><br></div><div dir="auto"> I would like to ask about. Some more text here...<br></div><div dir="auto"><br></div><div dir="auto">Regards</div> <br> <span style="color:rgb(34,34,34);font-family:Arial,sans-serif;font-size:12.8px;background-color:rgb(255,255,255)"></span><html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>My Title</title> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <style type="text/css">/* e-mail bugfixes */ #outlook a {padding:0 0 0 0;} .ReadMsgBody {width:100%;} .ExternalClass {width:100%; line-height:100%;} sup, sub {vertical-align:baseline; position:relative; top:-0.4em;} sub {top:0.4em;} /* General classes */ body {width:100% !important; margin:0;...',...]
my goal is to ideally keep only the text that is in between html tags and remove all the html code for each item on the list
I've been trying using BeautifulSoup however it mostly only removed brackets from html tags and kept the rest instead of giving me only the actual email content:
noHtml = []
for x in emails:
soup = BeautifulSoup(x)
noHtml.append(soup.get_text())
Would someone be able to help with this? Any possible way I can achieve it in Python? (not necessarily with BS) Thanks in advance!

Regex in Python - find all stylesheets in html

This is part of my html code:
<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' />
<link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />
I have to find all hrefs of stylesheets.
I tried to use regular expression like
<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>
The full code is
body = '''<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' />
<link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />''''
real_viraz = '''<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>'''
r = re.findall(real_viraz, body, re.I|re.DOTALL)
print r
But the problem is that rel='stylesheet' and href='' can be in any order in <link ...>, and it can be almost everything between them.
Please help me to find the right regular expression. Thanks.
Somehow, your name looks like a power automation tool Sikuli :)
If you are trying to parse HTML/XML based text in Python. BeautifulSoup (DOCUMENT)is an extremely powerful library to help you with that. Otherwise, you are indeed reinventing the wheel(an interesting story from Randy Sargent).
from bs4 import BeautifulSoup4
# in case you need to get the page first.
#import urllib2
#url = "http://selenium-python.readthedocs.org/en/latest/"
#text = urllib2.urlopen("url").read()
text = """<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" /><link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' /><link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />"""
soup = BeautifulSoup(text)
links = soup.find_all("link", {"rel":"stylesheet"})
for link in links:
try:
print link['href']
except:
pass
the output is:
catalog/view/theme/default/stylesheet/stylesheet.css
http://1
http://2
Learn beautifulsoup well and you are 100% ready for parsing anything in HTML or XML.
(You might also want to put Selenium, Scrapy into your toolbox in the future.)
Short answer: Don't use regular expressions to parse (X)HTML, use a (X)HTML parser.
In Python, this would be lxml. You could parse the HTML using lxml's HTML Parser, and use an XPath query to get all the link elements, and collect their href attributes:
from lxml import etree
parser = etree.HTMLParser()
doc = etree.parse(open('sample.html'), parser)
links = doc.xpath("//head/link[#rel='stylesheet']")
hrefs = [l.attrib['href'] for l in links]
print hrefs
Output:
['catalog/view/theme/default/stylesheet/stylesheet.css', 'http://1', 'http://2']
I'm amazed by the many developers here in Stack-Exchange who insist on using outside Modules over the RE module for obtaining data and Parsing Strings,HTML and CSS. Nothing works more efficiently or faster than RE.
These two lines not only grab the CSS style-sheet path but also grab several if there is more than one CSS stylesheet and place them into a nice Python List for processing and or for a urllib request method.
a = re.findall('link rel="stylesheet" href=".*?"', t)
a=str(a)
Also for those unaware of Native C's use of what most developers know to be the HTML Comment Out Lines.
<!-- stuff here -->
Which allows anything in RE to process and grab data at will from HTML or CSS. And or to remove chunks of pesky Java Script for testing browser capabilities in a single iteration as shown below.
txt=re.sub('<script>', '<!--', txt)
txt=re.sub('</script>', '-->', txt)
txt=re.sub('<!--.*?-->', '', txt)
Python retains all the regular expressions from native C,, so use them people. That's what their for and nothing is as slow as Beautiful Soup and HTMLParser.
Use the RE module to grab all your data from Html tags as well as CSS. Or from anything a string can contain. And if you have a problem with a variable not being of type string then make it a string with a single tiny line of code.
var=str(var)

Django output word files(.doc),only show raw html in the contents

I am writing a web app using Django 1.4.I want one of my view to output mirosoft word docs using the follwoing codes:
response = HttpResponse(view_data, content_type='application/vnd.ms-word')
response['Content-Disposition'] = 'attachment; filename=file.doc'
return response
Then ,I can download the file.doc successfully ,but when I open the .doc file ,I only find the raw html like this
<h1>some contents</h1>
not a heading1 title.
I am new to python & Django ,I know this maybe some problems with html escape,can some one please help me with this ?
Thank you !:)
Unless you have some method of converting your response (here HTML I assume) to a .doc file, all you will get is a text file containing your response with the extension .doc. If you are willing to go for .docx files there is a wonderful python library called python-docx you should look in to that allows you to generate well formed docx files using the lxml library.
Alternatively, use a template such as:
<html>
<head>
<META HTTP-EQUIV=""Content-Type"" CONTENT=""text/html; charset=UTF-8"">
<meta name=ProgId content=Word.Document>
<meta name=Generator content=""Microsoft Word 9"">
<meta name=Originator content=""Microsoft Word 9"">
<style>
#page Section1 {size:595.45pt 841.7pt; margin:1.0in 1.25in 1.0in 1.25in;mso-header-margin:.5in;mso-footer-margin:.5in;mso-paper-source:0;}
div.Section1 {page:Section1;}
#page Section2 {size:841.7pt 595.45pt;mso-page-orientation:landscape;margin:1.25in 1.0in 1.25in 1.0in;mso-header-margin:.5in;mso-footer-margin:.5in;mso-paper-source:0;}
div.Section2 {page:Section2;}
</style>
</head>
<body>
<div class=Section2>
'Section1: Portrait, Section2: Landscape
[your text here]
</div>
</body>
</html>
This should, according to this asp.net forum post make a valid .doc file when returned as mime type application/msword using UTF-8 charset (so make sure strings are all unicode).

Only Firefox displays HTML Code and not the page

I have this complicated problem that I can't find a answer to.
I have a Python HTTPServer running that serves webpages. These webpages are created at runtime with help of Beautiful Soup. Problem is that the Firefox shows HTML Code for the webpage and not the actual page? I really don't know know who is causing this problem -
- Python HTTPServer
- Beautiful Soup
- HTML Code
Any case, I have copied parts of the webpage HTML:-
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>
My title
</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<script src="./123_ui.js">
</script>
</head>
<body>
<div>
Hellos
</div>
</body>
</html>
Just to help you, here are the things that I have already tried-
- I have made sure that Python HTTPServer is sending the MIME header as text/html
- Just copying and pasting the HTML Code will show you correct page as its static. I can tell from here that the problem is in HTTPServer side
- The Firebug shows that is empty and "This element has no style rules. You can create a rule for it." is displayed
I just want to know if the error is in Beautiful Soup or HTTPServer or HTML?
Thanks,
Amit
Why are you adding this at the top of the document?
<?xml version="1.0" encoding="iso-8859-1"?>
That will make the browser think the entire document is XML and not XHTML. Removing that line should make it render correctly. I assume Firefox is displaying a page with a bunch of elements which you can expand/collapse to see the content like it normally would for an XML document, even though the HTTP headers might say it's text/html.
So guys,
I have finally solved this problem. The reason was because I wasn't sending MIME header (even though I thought I was) with content type "text/html"
In python HTTPServer, before writing anything to file you always do this:-
self.send_response(301)
self.send_header("Location", self.path + "/")
self.end_headers()
# Once you have called the above methods, you can send the HTML to Client
self.wfile.write('ANY HTML CODE YOU WANT TO WRITE')

Categories