How to highlight text in complex html - python

I have an html that looks like this:
<h3>
Heading 3
</h3>
<ol>
<li>
<ol>
....
</li>
</ol>
Need to highlight the entire html starting from first ol. I have found this solution:
soup = bs4.BeautifulSoup(open('temp.html').read(), 'lxml')
new_h1 = soup.new_tag('h1')
new_h1.string = 'Hello '
mark = soup.new_tag('mark')
mark.string = 'World'
new_h1.append(mark)
h1 = soup.h1
h1.replace_with(new_h1)
print(soup.prettify())
Is there any way to highlight entire html without having to find out the specific text?
Edit:
This is what I mean by highlighted text
Edit:
I have tried this code but it only highlights the very innermost li:
for node in soup2.findAll('li'):
if not node.string:
continue
value = node.string
mark = soup2.new_tag('mark')
mark.string = value
node.replace_with(mark)

This will highlight all the <li> content.
As I have no clear idea of how your HTML code looks like, I have tried to highlight all the <li> content. You can modify this code to suit your requirements.
from bs4 import BeautifulSoup
with open('index.html') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
tag = soup.findAll('li')
# Highlights the <li> content
for li in tag:
newtag = soup.new_tag('mark')
li.string.wrap(newtag)
print(soup)
After Highlighting: https://i.stack.imgur.com/iIbXk.jpg

Related

Is there a way to replace text in NavigableString object to tag object in beautifulsoup?

I have a sample html document.
html_doc = '''<html><body><div>
<h5>This is my heading 1</h5>
<p>I have some content here</p>
I am point one.\n\nI am point two.
<h5>Some more text here</h5> Some more text outside a tag.</div></body></html>'''
I'm trying to extract text from line 4 and 5 that is outside html tags and convert it into p tag element. I have tried this-
from bs4.element import NavigableString
soup = BeautifulSoup(html_doc, 'html.parser')
div_tags = soup.div
for idx in range(len(div_tag.contents)):
if type(div_tag.contents[idx]) == NavigableString:
count = 0
for a_str in div_tag.contents[idx].split('\n'):
if a_str == '':
continue
else:
count +=1
tag = parsed_html.new_tag("p")
tag.string = a_str
div_tag.contents[idx+count].insert_before(tag)
With above code, I'm not able to convert last NavigableString to a p tag. Also, the previous text of NavigableString stays in the tree. But the desired output is -
<html><body><div>
<h5>This is my heading 1</h5>
<p>I have some content here</p>
<p>I am point one.<\p>
<p>I am point two.<\p>
<h5>Some more text here</h5>
<p>Some more text outside a tag.
</p></div></body></html>
You can use this example to wrap all lines that are outside html tags into <p>...</p>:
from bs4 import BeautifulSoup, NavigableString
html_doc = """<html><body><div>
<h5>This is my heading 1</h5>
<p>I have some content here</p>
I am point one.\n\nI am point two.
<h5>Some more text here</h5> Some more text outside a tag.</div></body></html>"""
soup = BeautifulSoup(html_doc, "html.parser")
# root tag of the text:
root_tag = soup.find("div")
# replace all strings that are "outside" in the root tag:
for c in root_tag.contents:
if isinstance(c, NavigableString) and c.strip():
to_replace = [
"<p>{}</p>".format(line)
for line in map(str.strip, c.split("\n"))
if line
]
c.replace_with(
BeautifulSoup("\n" + "\n".join(to_replace) + "\n", "html.parser")
)
print(soup)
Prints:
<html><body><div>
<h5>This is my heading 1</h5>
<p>I have some content here</p>
<p>I am point one.</p>
<p>I am point two.</p>
<h5>Some more text here</h5>
<p>Some more text outside a tag.</p>
</div></body></html>

How to check if <a href> element exist in a <div> element?

the html code is like this:
<div class="AAA">Text of AAADisplay text of URL A</div>
<div class="BBB">Text of BBBDisplay text of URL B</div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>
I want to parse the text for all the div, while check if there is url exist, if yes then also extract it out and display in output
output like this:
Text of AAA
Display text of URL A
......AAA/url
Text of BBB
Display text of URL B
......BBB/url
Text of CCC
Text of DDD
i tried to nest the loop of find_all('a') within find_all('div') loop, but messed up my output
from bs4 import BeautifulSoup
html="""
<div class="AAA">Text of AAADisplay text of URL A</div>
<div class="BBB">Text of BBBDisplay text of URL B</div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>
"""
soup = BeautifulSoup(html, "lxml")
for div in soup.findAll('div'):
print(div.text)
try:
print(div.find('a').text)
print(div.find('a')["href"])
except AttributeError:
pass
Output
Text of AAADisplay text of URL A
Display text of URL A
......AAA/url
Text of BBBDisplay text of URL B
Display text of URL B
......BBB/url
Text of CCC
Text of DDD
Don't know what your code looks like, but the basic idea is something like this:
data = soup.findAll('div')
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
print(a.text)
will give you the URL and the text.
You can loop over the divs and then print the elements of soup.contents:
s = """
<div class="AAA">Text of AAADisplay text of URL A .
</div>
<div class="BBB">Text of BBBDisplay text of URL B .
</div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>
"""
from bs4 import BeautifulSoup as soup
for _text, *_next in map(lambda x:x.contents, soup(s, 'html.parser').find_all('div')):
print(_text)
if _next:
print(_next[0].text)
print(_next[0]['href'])
Output:
Text of AAA
Display text of URL A
......AAA/url
Text of BBB
Display text of URL B
......BBB/url
Text of CCC
Text of DDD
It just easier to read, You can also use this to get expected output
divs = soup.find_all('div')
for div in divs:
print(div.contents[0]) # Text of AAA
link = div.find('a')
if link:
print(link.text) # Display text of URL A
print(link['href']) # ......AAA/url
thanks all, i worked out the solution
for h in ans_kin:
links = ""
link = h.find('a')
if link:
for l in link:
links = h.text + link.get('href')
else:
links = h.text
answer_kin.append(links)

Extracting li element and assigning it to variable with beautiful soup

Given the following element
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
How do I extract each li element and assign it to a variable with beautiful soup?
Currently, my code looks like this:
detail = car.find('ul', {'class': 'listing-key-specs'}).get_text(strip=True)
and it produces the following output:
2005 (05 reg)Saloon66,038 milesManual1.8L118 bhpPetrol
Please refer to the following question for more context: "None" returned during scraping.
Check online DEMO
from bs4 import BeautifulSoup
html_doc="""
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
lst = [_.get_text(strip=True) for _ in soup.find('ul', {'class': 'listing-key-specs'}).find_all('li')]
print(lst)
Currently, you are calling get_text() on the ul tag, which simply returns all its contents as one string. So
<div>
<p>Hello </p>
<p>World </p>
</div>
would become Hello World.
To extract each matching sub tag and store them as seperate elements, use car.find_all(), like this.
tag_list = car.find_all('li', class_='listing-key-specs')
my_list = [i.get_text() for i in tag_list]
This will give you a list of all li tags inside the class 'listing-key-specs'. Now you're free to assign variables, eg. carType = my_list[1]

How to extracts contents of a div tag containing a particular text using BeautifulSoup

I am new to BeautifulSoup and am looking to extract texts from a list inside a div tag. this is the code
<div class="contentBlurb">Description Synonyms
<ul class="definitionList">
<li>Awaiting bone marrow transplant</li>
</ul>
</div>
I would like to extract the text "Awaiting bone marrow transplant". This is the code which I use now which gives me an empty list:
for link in soup.findAll('div', text = re.compile('Description Synonyms ')):
print link
Sorry for not adding this. I do have other divs by the same class name. I am interested in only the description synonyms.The other div is listed below
<div class="contentBlurb">Applicable To
<ul class="definitionList">
<li>Patient waiting for organ availability</li>
</ul>
</div>
Dsoup.find(text='...') doesn't work if there's other text or tags inside that tag.
Try:
[i.find('ul', {'class': "definitionList"}).find('li').text
for i in soup.find_all('div', {'class': "contentBlurb"})
if 'Description Synonyms' in str(i.text)][0]
You can do this:
# coding: utf-8
from bs4 import BeautifulSoup
html = """
<div class="contentBlurb">Description Synonyms
<ul class="definitionList">
<li>Awaiting bone marrow transplant</li>
</ul>
</div>
<div class="contentBlurb">Applicable To
<ul class="definitionList">
<li>Patient waiting for organ availability</li>
</ul>
</div>
"""
souped = BeautifulSoup(html)
matching_divs = [div for div in souped.find_all(
'div', {'class': 'contentBlurb'}) if 'Description Synonyms' in div.getText()]
li_elements = []
matching_uls = []
for mdiv in matching_divs:
matching_uls.extend(mdiv.findAll('ul', {'class': 'definitionList'}))
for muls in matching_uls:
li_elements.extend(muls.findAll('li'))
for li in li_elements:
print(li.getText())
EDIT: Updated for matching particular div.
Try this, Change it to the required string in the if clause. The below snippet will print if the tag's text has Applicable To, you can change it to your requirement
val = soup.find('div', {'class': 'contentBlurb'}).text
if "Description Synonyms" in val:
print soup.find('div', {'class': 'contentBlurb'}).find('ul', {'class': 'definitionList'}).find('li').text

Extract content with BeautifulSoup and Python

I'm trying to scrap a forum but I can't deal with the comments, because the users use emoticons, and bold font, and cite previous messages, and and and...
For example, here's one of the comments that I have a problem with:
<div class="content">
<blockquote>
<div>
<cite>User write:</cite>
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
</div>
</blockquote>
<br/>
THIS IS THE COMMENT THAT I NEED!
</div>
I searching for help for the last 4 days and I couldn't find anything, so I decided to ask here.
This is the code that I'm using:
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
def get_messages(url):
soup = make_soup(url)
msg = soup.find("div", {"class" : "content"})
# I get in msg the hole message, exactly as I wrote previously
print msg
# Here I get:
# 1. <blockquote> ... </blockquote>
# 2. <br/>
# 3. THIS IS THE COMMENT THAT I NEED!
for item in msg.children:
print item
I'm looking for a way to deal with messages in a general way, no matter how they are. Sometimes they put emoticons between the text and I need to remove them and get the hole message (in this situation, bsp will put each part of the message (first part, emoticon, second part) in different items).
Thanks in advance!
Use decompose http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose
Decompose extracts tags that you don't want. In your case:
soup.blockquote.decompose()
or all unwanted tags:
for tag in ['blockquote', 'img', ... ]:
soup.find(tag).decompose()
Your example:
>>> from bs4 import BeautifulSoup
>>> html = """<div class="content">
... <blockquote>
... <div>
... <cite>User write:</cite>
... I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
... </div>
... </blockquote>
... <br/>
... THIS IS THE COMMENT THAT I NEED!
... </div>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('blockquote').decompose()
>>> soup.find("div", {"class" : "content"}).text.strip()
u'THIS IS THE COMMENT THAT I NEED!'
Update
Sometimes all you have is a tag starting point but you are actually interested in the content before or after that starting point. You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:
>>> html = """<div>No<blockquote>No</blockquote>Yes.<em>Yes!</em>Yes?</div>No!"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> elm = soup.blockquote.next_sibling
>>> txt = ""
>>> while elm:
... txt += elm.string
... elm = elm.next_sibling
...
>>> print(txt)
u'Yes.Yes!Yes?'
BeautifulSoup has a get_text method. Maybe this is what you want.
From their documentation:
markup = '\nI linked to <i>example.com</i>\n'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
If the text you want is never within any additional tags, as in your example, you can use extract() to get rid of all the tags and their contents:
html = '<div class="content">\
<blockquote>\
<div>\
<cite>User write:</cite>\
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">\
</div>\
</blockquote>\
<br/>\
THIS IS THE COMMENT THAT I NEED!\
</div>'
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='content')
tags = div.findAll(recursive=False)
for tag in tags:
tag.extract()
text = div.get_text(strip=True)
print(text)
This gives:
THIS IS THE COMMENT THAT I NEED!
To deal with emoticons, you'll have to do something more complicated. You'll probably have to define a list of emoticons to recognize yourself, and then parse the text to look for them.

Categories