I have a little bit of difficulty extracting text from a div that has div inside (without it). So here it is:
<div style="width:100%">
<div class="status_p">
ACTIVE
</div>
Name
</div>
I want to extract Name without div that has ACTIVE. Whenever I print first div, it always gives me ACTIVEName
You can use children attribute on bs4 tag that gives you all the children in a tag. After choosing children, you can get the last element of the children list
from bs4 import BeautifulSoup
html = """<div style="width:100%">
<div class="status_p">
ACTIVE
</div>
Name
</div>"""
soup = BeautifulSoup(html, "html.parser")
print(list(soup.find("div").children)[-1].strip())
Output:
Name
OR
you can use stripped_strings
print(list(soup.find("div").stripped_strings)[-1])
OR
you can delete the inner div and get only the name.
soup.find("div",class_="status_p").extract()
print(soup.find("div").get_text(strip=True))
I have found solution and used
find("div", class_="status_p").decompose()
Related
I want to extract 'span' tag from 'p' but I don't know how to do it
html = "
<div id="tab-description" class="plugin-description section">
<h2 id="description-header">Description</h2>
<p><span class="embed-youtube" style="text-align:center; display: block;"><iframe class="youtube-player"src="https://www.youtube.com/"></iframe></span></p>
</div>
"
soup = BeautifulSoup(html,'lxml')
description = soup.find(id="tab-description").find('p')
I tried to decompose() it but returns an error.
To get <span> select it directly:
soup.find(id="tab-description").p.span
or
soup.find(id="tab-description").find('span')
or
soup.select_one('#tab-description p > span')
Be aware Not an option to scrape contents from the <iframe>, if this should be the intension. If so? This would be predestined for asking a new question with exact this focus.
To delete <span> and its contents from soup:
soup.find(id="tab-description").p.span.decompose()
I have a class in my soup element that is the description of a unit.
<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>
I can easily grab this part with soup.select(".ats-description")[0].
Now I want to remove <div class="ats-description">, only to keep all the inner tags (to retain text structure). How to do it?
soup.select(".ats-description")[0].getText() gives me all the texts within, like this:
'\nHere is a paragraph\ninner div\nAnother div\n\nItem1\nItem2\nItem3\n\n\n'
But removes all the inner tags, so it's just unstructured text. I want to keep the tags as well.
to get innerHTML, use method .decode_contents()
innerHTML = soup.select_one('.ats-description').decode_contents()
print(innerHTML)
Try this, match by tag in list in soup.find_all()
from bs4 import BeautifulSoup
html="""<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>"""
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one("div.ats-description").find_all(['p','div','ul']))
I am trying to scrape the web page specifically this web page for example. I am trying to scrape the product names but somehow my find_all method doesn't work properly and not finding all tags I specified.
So here is what I am doing
from bs4 import BeatifulSoup
url = 'https://www.toysrus.fi/nallet-ja-pehmolelut/interaktiiviset-pehmolelut'
soup = BeautifulSoup(request.urlopen(url).read(), 'html.parser')
print(len(soup.findAll('div', {'class' : 'inner-wrapper'})))
The length of the class='inner-wrapper' is actually 4 in the specified page but it finds only 1. Please guide in scraping the product names from the web page and how can I get correct number of tags of div having class'inner-wrapper'. Thanks.
Beatiful soup only finds proper html divs tags, those happened to be inside of scripts are ignored. Regretfully Beautiful soup does not evaluates scripts.
Just open the HTML code, you will see one HTML div of class, and bunch of scripts/js-templates like below
<script type="text/x-jsrender" id="product-list-skuid-template">
<div class="product-list-component type-{{:TemplateInfo.type}} outer-wrapper">
<div class="inner-wrapper">
<ul class="product-list-container">
{{for Data}} {{include tmpl="#product-template"/}} {{/for}}
</ul>
</div>
</div>
{{!-- SHADOW --}} {{if TemplateInfo.divider=='roundshadow'}}
<div class="round-shadow"></div>
{{else TemplateInfo.divider=='simple'}}
<hr /> {{/if}}
</script>
Below in the grayed out area is some text that I am trying to extract in a page.
I dont know how to access the text in the gray area. I tried the following but it did not work. The class does not have an id - how to get the text inside it?
comment = soup.find("div", {"class", "GCARQJCDEXD"})
You can locate the element by matching a class attribute to an empty string:
from bs4 import BeautifulSoup
data = """
<div class="GCARQJCDEXD">
<div class="clearfix hidden">something here</div>
<div class>
desired text
</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
comment = soup.find("div", {"class": "GCARQJCDEXD"}).find("div", {"class": ""})
print(comment.get_text(strip=True))
Prints desired text.
I have a html like this
<div id="all-stories" class="book">
<ul>
<li title="Book1" >Book1</li>
<li title="Book2" >Book2</li>
</ul>
</div>
I want to get the books and their respective url using xpath, but it seems my approach is not working. for simplicity i tried to extract all the elements under "li " tags as follows
lis = tree.xpath('//div[#id="all-stories"]/div/text()')
import lxml.html as LH
content = '''\
<div id="all-stories" class="book">
<ul>
<li title="Book1" >Book1</li>
<li title="Book2" >Book2</li>
</ul>
</div>
'''
root = LH.fromstring(content)
for atag in root.xpath('//div[#id="all-stories"]//li/a'):
print(atag.attrib['href'], atag.text_content())
yields
('book1_url', 'Book1')
('book2_url', 'Book2')
The XPath //div[#id="all-stories"]/div does not match anything because there is no child div inside the outer div tag.
The XPath //div[#id="all-stories"]/li also would not match because the there is no direct child li tage inside the div tag. However, //div[#id="all-stories"]//li does match li tags because // tells XPath to recursively search as deeply as necessary to find the li tags.
Now, the content you are looking for is not in the li tag. It is inside the a tag. So instead use the XPath
'//div[#id="all-stories"]//li/a' to reach the a tags.
The value of the href attribute can be accessed with atag.attrib['href'], and the text with atag.text_content().