How to add elements to beautiful soup element - python

if I have such bs4 element its called tab_window_uls[1]:
<ul>
<li><b>Cut:</b> Sits low on the waist.</li>
<li><b>Fit:</b> Skinny through the leg.</li>
<li><b>Leg opening:</b> Skinny.</li>
</ul>
How can I add new <li> to <ul>?
Currently my code looks like:
lines = ['a', 'b']
li_tag = tab_window_uls[1].new_tag('li')
for i in lines:
li_tag.string = i
tab_window_uls[1].b.string.insert_before(li_tag)

You have to create a new tag like I did and insert that tag within the ul. I load the soup, create a tag. append that tag within the other tag. (the <b> tag within the <li> tag). then load the ul tags. And insert the newly created li tag into the tree at position whatever. NOTE: you can't have it at the end, if you want it to be the last li in the list, use append.
from bs4 import BeautifulSoup
htmlText = '''
<ul>
<li><b>Cut:</b> Sits low on the waist.</li>
<li><b>Fit:</b> Skinny through the leg.</li>
<li><b>Leg opening:</b> Skinny.</li>
</ul>
'''
bs = BeautifulSoup(htmlText)
li_new_tag = bs.new_tag('li')
li_new_tag.string = 'Size:'
b_new_tag = bs.new_tag('b')
b_new_tag.string = '0 through 12.'
li_new_tag.append(b_new_tag)
tags = bs.ul
tags.insert(1, li_new_tag)
print bs.prettify()

Related

python beautifulsoup4 how to get span text in div tag

This is the html code
<div aria-label="RM 6,000 a month" class="salary-snippet"><span>RM 6,000 a month</span></div>
I used like this
divs = soup.find_all('div', class_='job_seen_beacon')
for item in divs:
print(item.find('div', class_='salary-snippet'))
i got the result a list such as
<div aria-label="RM 3,500 to RM 8,000 a month" class="salary-snippet"><span>RM 3,500 - RM 8,000 a month</span></div>
if i used
print(item.find('div', class_='salary-snippet').text.strip())
it will return the error
AttributeError: 'NoneType' object has no attribute 'text'
so how can i get only the span text? its my first time web scraping
May be this is what you are looking for.
First select all the <div> tags with class as salary-snippet as this is the parent of the <span> tag that you are looking for. Use .find_all()
Now Iterate over the all the selected <div> tags from above and find the <span> from each <div>.
Based on your question, I assume that All these <div> may not have the <span> tag. In that case you can print the text only if the <div> contains a span tag. See below
# Find all the divs
d = soup.find_all('div', class_='salary-snippet')
# Iterating over the <div> tags
for item in d:
# Find <span> in each item. If not exists x will be None
x = item.find('span')
# Check if x is not None and then only print
if x:
print(x.text.strip())
Here is the complete code.
from bs4 import BeautifulSoup
s = """<div aria-label="RM 6,000 a month" class="salary-snippet"><span>RM 6,000 a month</span></div>"""
soup = BeautifulSoup(s, 'lxml')
d = soup.find_all('div', class_='salary-snippet')
for item in d:
x = item.find('span')
if x:
print(x.text.strip())
RM 6,000 a month
I believe the line should be:
print(item.find('div', {'class':'salary-snippet'}).text.strip())
Alternatively, if there is only the span you can simply use:
item.find("span").text.strip()
Considering you used the .find_all() method you might want to ensure that every div returned from your HTML
soup.find_all('div', class_='job_seen_beacon')
contains the element you are looking for as thi could arise if only one element doesn't.
i.e.
divs = soup.find_all('div', class_='job_seen_beacon')
for item in divs:
try:
print(item.find('div', {'class':'salary-snippet'}).text.strip())
except AttributeError:
print("Item Not available")
What this will do is try get the text but if this fails will print the item that failed so you can identify why... perhaps it doesn't have the element you are searching for.

How to highlight text in complex html

I have an html that looks like this:
<h3>
Heading 3
</h3>
<ol>
<li>
<ol>
....
</li>
</ol>
Need to highlight the entire html starting from first ol. I have found this solution:
soup = bs4.BeautifulSoup(open('temp.html').read(), 'lxml')
new_h1 = soup.new_tag('h1')
new_h1.string = 'Hello '
mark = soup.new_tag('mark')
mark.string = 'World'
new_h1.append(mark)
h1 = soup.h1
h1.replace_with(new_h1)
print(soup.prettify())
Is there any way to highlight entire html without having to find out the specific text?
Edit:
This is what I mean by highlighted text
Edit:
I have tried this code but it only highlights the very innermost li:
for node in soup2.findAll('li'):
if not node.string:
continue
value = node.string
mark = soup2.new_tag('mark')
mark.string = value
node.replace_with(mark)
This will highlight all the <li> content.
As I have no clear idea of how your HTML code looks like, I have tried to highlight all the <li> content. You can modify this code to suit your requirements.
from bs4 import BeautifulSoup
with open('index.html') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
tag = soup.findAll('li')
# Highlights the <li> content
for li in tag:
newtag = soup.new_tag('mark')
li.string.wrap(newtag)
print(soup)
After Highlighting: https://i.stack.imgur.com/iIbXk.jpg

Extracting li element and assigning it to variable with beautiful soup

Given the following element
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
How do I extract each li element and assign it to a variable with beautiful soup?
Currently, my code looks like this:
detail = car.find('ul', {'class': 'listing-key-specs'}).get_text(strip=True)
and it produces the following output:
2005 (05 reg)Saloon66,038 milesManual1.8L118 bhpPetrol
Please refer to the following question for more context: "None" returned during scraping.
Check online DEMO
from bs4 import BeautifulSoup
html_doc="""
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
lst = [_.get_text(strip=True) for _ in soup.find('ul', {'class': 'listing-key-specs'}).find_all('li')]
print(lst)
Currently, you are calling get_text() on the ul tag, which simply returns all its contents as one string. So
<div>
<p>Hello </p>
<p>World </p>
</div>
would become Hello World.
To extract each matching sub tag and store them as seperate elements, use car.find_all(), like this.
tag_list = car.find_all('li', class_='listing-key-specs')
my_list = [i.get_text() for i in tag_list]
This will give you a list of all li tags inside the class 'listing-key-specs'. Now you're free to assign variables, eg. carType = my_list[1]

Using BeautifulSoup to extract an li element based on a string contained within

I have been attempting to use BeautifulSoup to retrieve any <li> element that contains any format of the following word: Ottawa. The problem is that ottawa is never within a tag of it's own such as <p>. So I want to only print li elements that contain Ottawa.
The HTML formatting is like this:
<html>
<body>
<blockquote>
<ul><li><b>name</b>
(National: Ottawa, ON)
<blockquote> some description </blockquote></li>
<li><b>name</b>
(National: Vancouver, BC)
<blockquote> some description </blockquote></li>
<li><b>name</b>
(Local: Ottawa, ON)
<blockquote> some description </blockquote></li>
</ul>
</blockquote>
</body>
</html>
My code is as follows:
from bs4 import BeautifulSoup
import re
import urllib2,sys
url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
re1='.*?'
re2='(Ottawa)'
ottawa = soup.findAll(text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL))
search = soup.findAll('li')
The results of the above code finds Ottawa correctly, and when using it to find the li elements, it does find the li elements but it gives me every single one on the page.
I understand that they are currently not in conjunction as trying to do search = soup.findAll('li', text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL)) results in []
My end goal is basically to get every <li> element that contains any mention of Ottawa and give me the entire <li> element with the name, description, link, etc.
Use the text attribute to filter the results of the findAll:
elems = [elem for elem in soup.findAll('li') if 'Ottawa' in str(elem.text)]
from bs4 import BeautifulSoup
import re
import urllib2,sys
url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
for item in soup.find_all(text=re.compile('\(.+: Ottawa', re.IGNORECASE)):
link = item.find_previous_sibling(lambda tag: tag.has_key('href'))
if link is None:
continue
print(u'{} [{}]: {}'.format(link.text,
item.strip(),
link['href']).encode('utf8'))

lxml: how to discard all <li> elements containing a link with particular class?

As is often the case, I'm struggling with the lack of proper lxml documentation (note to self: should write a proper lmxl tutorial and get lots of traffic!).
I want to find all <li> items that do not contain an <a> tag with a particular class.
For example:
<ul>
<li><small>pudding</small>: peaches and cream</li>
<li><small>cheese</small>: Epoisses and St Marcellin</li>
</ul>
I'd like to get hold of only the <li> that does not contain a link with class new, and I'd like to get hold of the text inside <small>. In other words, 'pudding'.
Can anyone help?
thanks!
import lxml.html as lh
content='''\
<ul>
<li><small>pudding</small>: peaches and cream</li>
<li><small>cheese</small>: Epoisses and St Marcellin</li>
</ul>
'''
tree=lh.fromstring(content)
for elt in tree.xpath('//li[not(descendant::a[#class="new"])]/small/text()'):
print(elt)
# pudding
The XPath has the following meaning:
// # from the root node, look at all descendants
li[ # select nodes of type <li> who
not(descendant::a[ # do not have a descendant of type <a>
#class="new"])] # with a class="new" attribute
/small # select the node of type <small>
/text() # return the text of that node
Quickly hacked together this code:
from lxml import etree
from lxml.cssselect import CSSSelector
str = r"""
<ul>
<li><small>pudding</small>: peaches and cream</li>
<li><small>cheese</small>: Epoisses and St Marcellin</li>
</ul>"""
html = etree.HTML(str)
bad_sel = CSSSelector('li > a.new')
good_sel = CSSSelector('li > small')
bad = [item.getparent() for item in bad_sel(html)]
good = filter(lambda item: item.getparent() not in bad, [item for item in good_sel(html)])
for item in good:
print(item.text)
It first builds a list of items you do not want, and then it builds the ones you do want by excluding the bad ones.

Categories