how to extract text using Beautifulsoup - python

Can you please show me how to extract the title text (Inna) using BeautifulSoup in this situation:
<div class="wallpapers-box-300x180-2 wallpapers-margin-2">
<div class="wallpapers-box-300x180-2-img"><a title="Inna" href="/photo.jpg" alt="Inna" width="300" height="188" /></a></div>
<div class="wallpapers-box-300x180-2-title"><a title="Inna" href="/wallpapers/inna/">Inna</a></div>
Thanks.

There are so many ways to locate the element in this case and it's difficult to tell which way would work for you better since we don't know the scope of the problem, how unique is the element and what do you know and can rely on.
The most practical approach here I think would be to use the following CSS selector:
for elm in soup.select('div[class^="wallpapers-box"] > a[href*=wallpapers]'):
print(elm.get_text())
Here we check for the parent div element's class to start with wallpapers-box and find the direct a child element having wallpapers text inside the href attribute value.

Related

Elements able to be found using XPATH but not using CSS Selector. Am I searching using the correct value?

I am trying to extract data from multiple pages of search results where the HTML in question looks like so:
<ul>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
</ul>
I want to extract the text from the "li" tags, so I have:
text_data = WebDriverWait(driver,10).until(EC.visibility_of_all_element_located((By.XPATH,'Card___StyledLi4-ulg8ho-7.jmevwM')
print(text_data.text)
to wait and target "li" item. However, I get a "TimeoutException" error.
However, if I try to locate a single "li" item using the XPATH under the same conditions, the data is returned which leads me to question if I am inputting the class correctly?
Can anyone tell me what I'm doing wrong? Please let me know if there is any further information, you'd like me to provide.
I believe the XPath for these list items would be //li[#class="Card___StyledLi4-ulg8ho-7 jmevwM"] (or //*[#class="Card___StyledLi4-ulg8ho-7 jmevwM"] if you want all elements with that class rather than just li tags). You can take a look at this cheatsheet and this tutorial for further rules and examples of XPath.
You can also just use CSS Selectors like (By.CSS_SELECTOR, '.Card___StyledLi4-ulg8ho-7.jmevwM') in this case.
You have mentioned the wrong locator type, it should be CSS_SELECTOR and also put a dot '.' in front of element's property, because it is a 'class':
text_data = WebDriverWait(driver,10).until(EC.visibility_of_all_element_located((By.CSS_SELECTOR,'.Card___StyledLi4-ulg8ho-7.jmevwM')

How to bring back 1st div child in python using bs4 soup.select within a dynamic table

In the below html elements, I have been unsuccessful using beautiful soup.select to only obtain the first child after div class="wrap-25PNPwRV"> (i.e. -11.94M and 2.30M) in list format
<div class="value-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>‪−11.94M‬</div>
<div class="change-25PNPwRV negative-25PNPwRV">−119.94%</div></div></div>
<div class="value-25PNPwRV additional-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>‪2.30M‬</div>
<div class="change-25PNPwRV negative-25PNPwRV">−80.17%</div></div></div>
Above is just two examples within the html I'm attempting to scrape within the dynamic javascript coded table which the above source code lies within, but there are many more div attributes on the page, and many more div class "wrap-25PNPwRV" inside the javascript table
I currently have the below code which allows me to scrape all the contents within div class ="wrap-25PNPwRV"
data_list = [elem.get_text() for elem in soup.select("div.wrap-25PNPwRV")]
Output:
['-11.94M', '-119.94%', '2.30M', '-80.17%']
However, I would like to use soup.select to yield the desired output :
['-11.94M', '2.30M']
I tried following this guide https://www.crummy.com/software/BeautifulSoup/bs4/doc/ but have been unsuccessful to implement it to my above code.
Please note, if soup.select is not possible to perform the above, I am happy to use an alternative providing it generates the same list format/output
You can use the :nth-of-type CSS selector:
data_list = [elem.get_text() for elem in soup.select(".wrap-25PNPwRV div:nth-of-type(1)")]
I'd suggest to not use the .wrap-25PNPwRV class. Seems random and almost certainly will change in the future.
Instead, select the <div> element which has other element with class="change..." as sibling. For example
print([t.text.strip() for t in soup.select('div:has(+ [class^="change"])')])
Prints:
['−11.94M', '2.30M']

XPath, nested conditions

I have the following HTML code, and I need to have an XPath expression, which finds the table element.
<div>
<div>Dezember</div>
<div>
<div class="dash-table-container">more divs</div>
</div>
</div>
My current Xpath expression:
//div[./div[1]/text() = "Dezember"]/preceding::div[./div[2][#class=dash-table-container]
I don't know how to check if the dash table container is the last one loaded, since I have many of them. So I need the check if it's under the div with "Dezember" as a text because the div's before with the other months are being loaded faster.
I want the XPATH to select the "dash table container" div.
Thanks in advance
To select the div with the text content of "more divs", you can use
//div/div[#class="dash-table-container" and ../preceding-sibling::div[1]="Dezember"]
and to select its parent div element, use
//div[div/#class="dash-table-container"][preceding-sibling::div[1]="Dezember"]/..
I figured it out.
//div[preceding-sibling::div="Dezember"]/div[#class="dash-table-container"]
worked perfectly for me.

Python xpath - Exclude content with style="display:none"

I'm using xpath to get some information from a website and I came across a block of code that contains style="display:none or block and I want to only include the code that has display:block; I watched some examples but I couldn't get it working on my code. I want to use an if statement to run the code if it has display:block but I don't know if that is possible. This is what I have:
if guide_page.xpath(".//div[#class='build-box']/#style/text()") == "display: block;":
for build_names in guide_page.xpath(".//div[#class='build-gradient']"):
for title in build_names.xpath("div/h2/text()"):
print("\n")
print(title)
And this is the div that has it:
<div class="build-box" style="display: block;">
I'm not sure if I should paste more of the html or if that's enough, otherwise, please tell me and thanks for any help :)
You can do this without using if statement. Just add a not(...condition...) in predicate to exclude elements matching certain condition. For example, the following XPath returns div elements with certain class attribute value, that don't have attribute style="display: block;" :
.//div[#class='build-box' and not(#style='display: block;')]

scrapy xpath selector without same DIV ID

i want to select all divs that a first part of their ID is "edit" using scrapy/XPATH.
For example:
<div id="edit3423432">...</div>
<div id="edit0036594">...</div>
For divs which have same id i use this code:
hxs.select('.//div[contains(#id,"testid")]')
But now how can i select all divs that have the first four characters equal to "edit"?
Xpath has a special function called starts-with, that would be pretty ideal here. Here's an example of how to use it:
hxs.select('.//div[starts-with(#id, 'edit')]')
Hope that helps, let me know if you have any questions.

Categories