im trying to understand if there's a relatively simple way, to take an HTML string, and "insert" it inside a different HTML string. I Tried converting the HTML into a simple DIV, and put it in the first HTML, but that didn't work and caused weird failures.
Some more info: I'm creating a report using bokeh, and have some figures. My code is creating some figures and appending them to a list, which eventually is parsed into an HTML and saved on my PC. What i want to do, is read a different HTML string, and append it entirely in my report.
You can do that with BeautifulSoup. See this example:
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<html><body><p>my paragraph</p></body></html>")
body = soup.find("body")
new_tag = soup.new_tag("a", href="http://www.example.com")
body.append(new_tag)
another_new_tag = soup.new_tag("p")
another_new_tag.insert(0, NavigableString("bla bla, and more bla"))
body.append(another_new_tag)
print(soup.prettify())
The result is:
<html>
<body>
<p>
my paragraph
</p>
<a href="http://www.example.com">
</a>
<p>
bla bla, and more bla
</p>
</body>
</html>
So what i was looking for, and what solves my problem, is just using iframe with srcdoc attribute.
iframe = '<iframe srcdoc="%s"></iframe>' % raw_html
and then i can push this iframe into the original HTML wherever i want
Related
I want to get the source code only of a section from website instead of whole page and then parsing out the section, as it will be faster than loading whole page and then parsing. I tried passing the section link as url parameter but still getting whole page.
url = 'https://stackoverflow.com/questions/19012495/smooth-scroll-to-div-id-jquery/#answer-19013712'
response = requests.get(url)
print(response.text)
You cannot get specific section directly with requests api, but you can use beautifulsoup for that purpose.
A small sample is given by dataquest website:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
page.content
Running the above script will output this html String.
<html>
<head>
<title>A simple example page
</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
First paragraph.
</p><p class="inner-text">
Second paragraph.
</p></div>
<p class="outer-text first-item" id="second"><b>
First outer paragraph.
</b></p><p class="outer-text"><b>
Second outer paragraph.
</b>
</p>
</body>
</html>
You can get specific section by finding it through tag type, class or id.
By tag-type:
soup.find_all('p')
By class:
soup.find_all('p', class_='outer-text')
By Id:
soup.find_all(id="first")
HTTPS will not allow you to do that.
You can use the Stackoverflow API instead. You can pass the answer id 19013712. And thus only get that specific answer via the API.
Note, you may still have to register for an APP key
Here is my output from a webpage. After
soup = BeautifulSoup(data)
I have this:
<html>
<body>
<p>EXCHANGE%3DNSE
MARKET_OPEN_MINUTE=555
MARKET_CLOSE_MINUTE=930
INTERVAL=900 COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME,CDAYS DATA=TIMEZONE_OFFSET=330 a1497240000,1634.7,1648.85,1633.85,1641.95,171301,0,1,1635.7,1644.45,1634.35,1634.7,50969,02,1640.05,1640.4,1635.5,1635.5,131752,0
The entire text is in tag so got that p tag and used data.split(). and slicing the strings in lines again. Not sure if its the efficient way but I need only a particular value. I will look at the regex. 10Q for inputs
I have a Tag which is available to me as a string only.
Example: tag_str = 'hello'
When I do the following:
template_logo_h1_tag.insert(0, tag_str)
Where
template_logo_h1_tag is a h1 tag
the resulting template_logo_h1_tag is
<h1 id="logo"><a>hello</a></h1>
I want to avoid this HTML escaping
and the resulting tag to be
<h1 id="logo"><a>hello</a></h1>
Is there anything I am missing?
I tried BeautifulSoup.HTML_ENTITIES but this to unescape already "html-escaped" strings.
It would be great if you could help me out!
I found a dirty hack:
template_logo_h1_tag.insert(0, BeautifulSoup('hello').a)
I think you are looking for Beautiful Soup's .append method: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#append
Coupled with the factory method for creating a new tag: soup.new_tag()
Updating with code:
soup=BeautifulSoup('<h1 id="logo"></h1>')
template_logo_h1_tag=soup.h1
newtag=soup.new_tag("a")
newtag.append("hello")
template_logo_h1_tag.append(newtag)
Then
print soup.prettify
yields
<h1 id="logo">
<a>
hello
</a>
</h1>
I have a html that contains:
<b>
<p align="left">TXT1</p>
</b>
<p align="left">
<b>NR1</b>
<b>TXT2</b>
TXT3
<b>TXT4</b>
TXT5
</p>
When I do:
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen('url')
htmlr = html.read()
soup = BeautifulSoup(htmlr)
print soup
I get something different:
<p align="left">TXT1</p>
<p align="left">NR1 <b>TXT2</b> TXT3 <b>TXT4</b>
TXT5</p>
I am analyzing html document layout, so losing tags is quite frustrating. Why is it happening and whats the best way to stop it? Help much appriciated!
EDIT: I need to handle the badly formed html documents for information extraction purposes. If their creator wanted some text to be rendered bold, I have to take it into account, even if the person created an invalid html.
The HTML is invalid. You can't have a <p> inside a <b>. BeautifulSoup is attempting to perform error recovery (as do browsers).
The best way to stop it is to fix the HTML.
HTML Tidy appears to correctly repair the invalid HTML. They have a web implementation of it here: http://infohound.net/tidy/
I entered:
<b><p>hello world</p></b>
and got this result:
<p><b>hello world</b></p>
There appears to by a python version here:
http://www.egenix.com/products/python/mxExperimental/mxTidy/
You could try html5lib instead of BeautifulSoup. Html5lib implements the HTML5 parser algorithm, so it should result in producing the same DOM as a modern browser does.
Disclaimer: I've not tried the html5lib parser for myself, so I don't know it's current stability level.
Same As quentin suggested.
If you want the <p> element to be bold then use inline CSS instead of <b> tag.
<p style='font-weight:bold;' align="left">TXT1</p>
<p align="left">
<b>NR1</b>
<b>TXT2</b>
TXT3
<b>TXT4</b>
TXT5
</p>
OK. I have a massive HTML file, and I only want the text that occurs between the tags
<center><span style="font-size: 144%;"></span></center>
and
<dl> <dd><i></i></dd> </dl>
I am using Python2.6 and Beautifulsoup, but I have no idea where to begin. I'm assuming it's not difficult?
Try something like:
soup = BeautifulSoup.BeautifulSoup(YOUR_HTML)
texts = soup.findAll(text=True)
print texts