I have a html that contains:
<b>
<p align="left">TXT1</p>
</b>
<p align="left">
<b>NR1</b>
<b>TXT2</b>
TXT3
<b>TXT4</b>
TXT5
</p>
When I do:
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen('url')
htmlr = html.read()
soup = BeautifulSoup(htmlr)
print soup
I get something different:
<p align="left">TXT1</p>
<p align="left">NR1 <b>TXT2</b> TXT3 <b>TXT4</b>
TXT5</p>
I am analyzing html document layout, so losing tags is quite frustrating. Why is it happening and whats the best way to stop it? Help much appriciated!
EDIT: I need to handle the badly formed html documents for information extraction purposes. If their creator wanted some text to be rendered bold, I have to take it into account, even if the person created an invalid html.
The HTML is invalid. You can't have a <p> inside a <b>. BeautifulSoup is attempting to perform error recovery (as do browsers).
The best way to stop it is to fix the HTML.
HTML Tidy appears to correctly repair the invalid HTML. They have a web implementation of it here: http://infohound.net/tidy/
I entered:
<b><p>hello world</p></b>
and got this result:
<p><b>hello world</b></p>
There appears to by a python version here:
http://www.egenix.com/products/python/mxExperimental/mxTidy/
You could try html5lib instead of BeautifulSoup. Html5lib implements the HTML5 parser algorithm, so it should result in producing the same DOM as a modern browser does.
Disclaimer: I've not tried the html5lib parser for myself, so I don't know it's current stability level.
Same As quentin suggested.
If you want the <p> element to be bold then use inline CSS instead of <b> tag.
<p style='font-weight:bold;' align="left">TXT1</p>
<p align="left">
<b>NR1</b>
<b>TXT2</b>
TXT3
<b>TXT4</b>
TXT5
</p>
Related
im trying to understand if there's a relatively simple way, to take an HTML string, and "insert" it inside a different HTML string. I Tried converting the HTML into a simple DIV, and put it in the first HTML, but that didn't work and caused weird failures.
Some more info: I'm creating a report using bokeh, and have some figures. My code is creating some figures and appending them to a list, which eventually is parsed into an HTML and saved on my PC. What i want to do, is read a different HTML string, and append it entirely in my report.
You can do that with BeautifulSoup. See this example:
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<html><body><p>my paragraph</p></body></html>")
body = soup.find("body")
new_tag = soup.new_tag("a", href="http://www.example.com")
body.append(new_tag)
another_new_tag = soup.new_tag("p")
another_new_tag.insert(0, NavigableString("bla bla, and more bla"))
body.append(another_new_tag)
print(soup.prettify())
The result is:
<html>
<body>
<p>
my paragraph
</p>
<a href="http://www.example.com">
</a>
<p>
bla bla, and more bla
</p>
</body>
</html>
So what i was looking for, and what solves my problem, is just using iframe with srcdoc attribute.
iframe = '<iframe srcdoc="%s"></iframe>' % raw_html
and then i can push this iframe into the original HTML wherever i want
I am trying an example from the BeautifulSoupDocs and found it acting weird. When I try to access the next_sibling value, instead of the "body" a '\n' is coming in to picture.
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.head.next_sibling
u'\n'
I am using latest version of beautifulSoup4. i.e 4.3.2. Please help me out.
Thanks in advance.
There are 3 kinds of objects that BeautifulSoup "sees" in the HTML:
Tag
NavigableString
Comment
When you get .next_sibling it returns you the next object after the current which, in your case, is a text node (NavigableString). Explained in the documentation here.
If you want to find the next Tag after the current, use find_next_sibling(), or, with specifying the tag name: find_next_sibling("body").
You can also use the "next sibling" CSS Selector:
soup.select("head + *")
try this
soup.head.find_next_sibling()
or
soup.head.next_sibling.next_sibling
I have a Tag which is available to me as a string only.
Example: tag_str = 'hello'
When I do the following:
template_logo_h1_tag.insert(0, tag_str)
Where
template_logo_h1_tag is a h1 tag
the resulting template_logo_h1_tag is
<h1 id="logo"><a>hello</a></h1>
I want to avoid this HTML escaping
and the resulting tag to be
<h1 id="logo"><a>hello</a></h1>
Is there anything I am missing?
I tried BeautifulSoup.HTML_ENTITIES but this to unescape already "html-escaped" strings.
It would be great if you could help me out!
I found a dirty hack:
template_logo_h1_tag.insert(0, BeautifulSoup('hello').a)
I think you are looking for Beautiful Soup's .append method: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#append
Coupled with the factory method for creating a new tag: soup.new_tag()
Updating with code:
soup=BeautifulSoup('<h1 id="logo"></h1>')
template_logo_h1_tag=soup.h1
newtag=soup.new_tag("a")
newtag.append("hello")
template_logo_h1_tag.append(newtag)
Then
print soup.prettify
yields
<h1 id="logo">
<a>
hello
</a>
</h1>
OK. I have a massive HTML file, and I only want the text that occurs between the tags
<center><span style="font-size: 144%;"></span></center>
and
<dl> <dd><i></i></dd> </dl>
I am using Python2.6 and Beautifulsoup, but I have no idea where to begin. I'm assuming it's not difficult?
Try something like:
soup = BeautifulSoup.BeautifulSoup(YOUR_HTML)
texts = soup.findAll(text=True)
print texts
<td>
<a name="corner"></a>
<div>
<div style="aaaaa">
<div class="class-a">My name is alis</div>
</div>
<div>
<span><span class="class-b " title="My title"><span>Very Good</span></span> </span>
<b>My Description</b><br />
My Name is Alis I am a python learner...
</div>
<div class="class-3" style="style-2 clear: both;">
alis
</div>
</div>
<br /></td>
I want the description after scraping it:
My Name is Alis I am a python learner...
I tried a lots of thing but i could not figure it out the best way. Can you guys give the in general solution for this.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Your sample html here")
soup.td.div('div')[2].contents[-1]
This will return the string you are looking for (the unicode string, with any applicable whitespace, it should be noted).
This works by parsing the html, grabbing the first td tag and its contents, grabbing any div tags within the first div tag, selecting the 3rd item in the list (list index 2), and grabbing the last of its contents.
In BeautifulSoup, there are A LOT of ways to do this, so this answer probably hasn't taught you much and I genuinely recommend you read the tutorial that David suggested.
Have you tried reading the examples provided in the documentation? They quick start is located here http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick Start
Edit:
To find
You would load your html up via
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("My html here")
myDiv = soup.find("div", { "class" : "class-a" })
Also remember you can do most of this via the python console and then using dir() along with help() walk through what you're trying to do. It might make life easier on you to try out ipython or perhaps python IDLE which have very friendly consoles for beginners.