Find elements which have a specific child with BeautifulSoup

Find elements which have a specific child with BeautifulSoup - python

With BeautifulSoup, how to access to a <li> which has a specific div as child?
Example: How to access to the text (i.e. info#blah.com) of the li which has Email as child div?
<li>
<div>Country</div>
Germany
</li>
<li>
<div>Email</div>
info#blah.com
</li>
I tried to do it manually: looping on all li, and for each of them, relooping on all child div to check if text is Email, etc. but I'm sure there exists a more clever version with BeautifulSoup.

There are multiple ways to approach the problem.
One option is to locate the Email div by text and get the next sibling:
soup.find("div", text="Email").next_sibling.strip() # prints "info#blah.com"

Your Question is about the get the whole <li> part which has "Email" inside the <div> tag right? Meaning you need to get the following result,
<li>
<div>Email</div>
info#blah.com
</li>
If I am understanding you question correctly means you need to do the following thing.
soup.find("div", text="Email").parent
or if you need "info#blah.com" as your result you need to do the following thing.
soup.find("div", text="Email").next_sibling

If you have only a single div has content "Email", you can do this way.
soup.find("div", text="Email").find_parent('li')

Related

XPath, nested conditions

I have the following HTML code, and I need to have an XPath expression, which finds the table element.
<div>
<div>Dezember</div>
<div>
<div class="dash-table-container">more divs</div>
</div>
</div>
My current Xpath expression:
//div[./div[1]/text() = "Dezember"]/preceding::div[./div[2][#class=dash-table-container]
I don't know how to check if the dash table container is the last one loaded, since I have many of them. So I need the check if it's under the div with "Dezember" as a text because the div's before with the other months are being loaded faster.
I want the XPATH to select the "dash table container" div.
Thanks in advance

To select the div with the text content of "more divs", you can use
//div/div[#class="dash-table-container" and ../preceding-sibling::div[1]="Dezember"]
and to select its parent div element, use
//div[div/#class="dash-table-container"][preceding-sibling::div[1]="Dezember"]/..

I figured it out.
//div[preceding-sibling::div="Dezember"]/div[#class="dash-table-container"]
worked perfectly for me.

Get Element using BeautifulSoup that does not have a class in HTML

I am trying to get the text of the user's rank from this webpage. By "rank" I mean the text you see in the top right corner of the user's info:
In this example rank is "Competitions Master". So how do I get that text?

If you see clearly after inspecting the rank there's <a href = "/progression" inside of which is a <p> tag and inside of which there's again a <p>tag which contains the rank.
First find the container with <a href = "/progression" and then find all the <p> tags inside it then again find all <p>. Print the text present inside the <p>tag as there is only one <p> tag inside<a href = "/progression" then <p>.
Or there's a second method too: There's a button below "Home" with the name of the rank. You can try scraping that element.

how to extract text using Beautifulsoup

Can you please show me how to extract the title text (Inna) using BeautifulSoup in this situation:
<div class="wallpapers-box-300x180-2 wallpapers-margin-2">
<div class="wallpapers-box-300x180-2-img"><a title="Inna" href="/photo.jpg" alt="Inna" width="300" height="188" /></a></div>
<div class="wallpapers-box-300x180-2-title"><a title="Inna" href="/wallpapers/inna/">Inna</a></div>
Thanks.

There are so many ways to locate the element in this case and it's difficult to tell which way would work for you better since we don't know the scope of the problem, how unique is the element and what do you know and can rely on.
The most practical approach here I think would be to use the following CSS selector:
for elm in soup.select('div[class^="wallpapers-box"] > a[href*=wallpapers]'):
print(elm.get_text())
Here we check for the parent div element's class to start with wallpapers-box and find the direct a child element having wallpapers text inside the href attribute value.

Python/Beautiful Soup find particular heading output full div

I'm attempting to parse a very extensive HTML document looks something like:
<div class="reportsubsection n" ><br>
<h2> part 1 </h2><br>
<p> insert text here </p><br>
<table> crazy table thing here </table><br>
</div>
<div class="reportsubsection n"><br>
<h2> part 2 </h2><br>
<p> insert text here </p><br>
<table> crazy table thing here </table><br>
</div>
Need to parse out the second div based on h2 having text "Part 2". Iwas able to break out all divs with:
divTag = soup.find("div", {"id": "reportsubsection"})
but didn't know how to dwindle it down from there. Other posts I found I was able to find the specific text "part 2 but I need to be able to output the whole DIV section it is contained in.
EDIT/UPDATE
Ok sorry but I'm still a little lost. Here is what I've got now. I feel like this should be so much simpler than I'm making it. Thanks again for all the help
divTag = soup.find("div", {"id": "reportsubsection"})<br>
for reportsubsection in soup.select('div#reportsubsection #reportsubsection'):<br>
if not reportsubsection.findAll('h2', text=re.compile('Finding')):<br>
continue<br>
print divTag

You can always go back up after finding the right h2, or you can test all subsections:
for subsection in soup.select('div#reportsubsection #subsection'):
if not subsection.find('h2', text=re.compile('part 2')):
continue
# do something with this subsection
This uses a CSS selector to locate all subsections.
Or, going back up with the .parent attribute:
for header in soup.find_all('h2', text=re.compile('part 2')):
section = header.parent
The trick is to narrow down your search as early as possible; the second option has to find all h2 elements in the whole document, while the former narrows the search down quicker.

How to surround an html element with another tag using lxml in Python

What I want to do is something like this. In my page I have an html document which has this tag
<p class="pretty">
Some text
</p>
And I want to replace it with
<blockquote>
<p>
Some Text
</p>
</blockquote>
I can strip the class of the tag using tag.attrib.pop('class') but I am unable to get how to wrap another html tag around a particular tag.
Any help is appreciated.

I believe you are thinking it the wrong way: you cannot wrap an element around another. What you need to do is to copy the contents of the <p> into a variable, delete the <p> element, create a <blockquote> element into where the <p> element used to be and then add the contents of the <p> element into the <blockquote>.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find elements which have a specific child with BeautifulSoup - python

There are multiple ways to approach the problem. One option is to locate the Email div by text and get the next sibling: soup.find("div", text="Email").next_sibling.strip() # prints "info#blah.com"

If you have only a single div has content "Email", you can do this way. soup.find("div", text="Email").find_parent('li')

Related

XPath, nested conditions

Get Element using BeautifulSoup that does not have a class in HTML

how to extract text using Beautifulsoup

Python/Beautiful Soup find particular heading output full div

How to surround an html element with another tag using lxml in Python

Categories

Resources