scrapy xpath selector without same DIV ID - python

i want to select all divs that a first part of their ID is "edit" using scrapy/XPATH.
For example:
<div id="edit3423432">...</div>
<div id="edit0036594">...</div>
For divs which have same id i use this code:
hxs.select('.//div[contains(#id,"testid")]')
But now how can i select all divs that have the first four characters equal to "edit"?

Xpath has a special function called starts-with, that would be pretty ideal here. Here's an example of how to use it:
hxs.select('.//div[starts-with(#id, 'edit')]')
Hope that helps, let me know if you have any questions.

Related

Elements able to be found using XPATH but not using CSS Selector. Am I searching using the correct value?

I am trying to extract data from multiple pages of search results where the HTML in question looks like so:
<ul>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
</ul>
I want to extract the text from the "li" tags, so I have:
text_data = WebDriverWait(driver,10).until(EC.visibility_of_all_element_located((By.XPATH,'Card___StyledLi4-ulg8ho-7.jmevwM')
print(text_data.text)
to wait and target "li" item. However, I get a "TimeoutException" error.
However, if I try to locate a single "li" item using the XPATH under the same conditions, the data is returned which leads me to question if I am inputting the class correctly?
Can anyone tell me what I'm doing wrong? Please let me know if there is any further information, you'd like me to provide.
I believe the XPath for these list items would be //li[#class="Card___StyledLi4-ulg8ho-7 jmevwM"] (or //*[#class="Card___StyledLi4-ulg8ho-7 jmevwM"] if you want all elements with that class rather than just li tags). You can take a look at this cheatsheet and this tutorial for further rules and examples of XPath.
You can also just use CSS Selectors like (By.CSS_SELECTOR, '.Card___StyledLi4-ulg8ho-7.jmevwM') in this case.
You have mentioned the wrong locator type, it should be CSS_SELECTOR and also put a dot '.' in front of element's property, because it is a 'class':
text_data = WebDriverWait(driver,10).until(EC.visibility_of_all_element_located((By.CSS_SELECTOR,'.Card___StyledLi4-ulg8ho-7.jmevwM')

How to bring back 1st div child in python using bs4 soup.select within a dynamic table

In the below html elements, I have been unsuccessful using beautiful soup.select to only obtain the first child after div class="wrap-25PNPwRV"> (i.e. -11.94M and 2.30M) in list format
<div class="value-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>‪−11.94M‬</div>
<div class="change-25PNPwRV negative-25PNPwRV">−119.94%</div></div></div>
<div class="value-25PNPwRV additional-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>‪2.30M‬</div>
<div class="change-25PNPwRV negative-25PNPwRV">−80.17%</div></div></div>
Above is just two examples within the html I'm attempting to scrape within the dynamic javascript coded table which the above source code lies within, but there are many more div attributes on the page, and many more div class "wrap-25PNPwRV" inside the javascript table
I currently have the below code which allows me to scrape all the contents within div class ="wrap-25PNPwRV"
data_list = [elem.get_text() for elem in soup.select("div.wrap-25PNPwRV")]
Output:
['-11.94M', '-119.94%', '2.30M', '-80.17%']
However, I would like to use soup.select to yield the desired output :
['-11.94M', '2.30M']
I tried following this guide https://www.crummy.com/software/BeautifulSoup/bs4/doc/ but have been unsuccessful to implement it to my above code.
Please note, if soup.select is not possible to perform the above, I am happy to use an alternative providing it generates the same list format/output
You can use the :nth-of-type CSS selector:
data_list = [elem.get_text() for elem in soup.select(".wrap-25PNPwRV div:nth-of-type(1)")]
I'd suggest to not use the .wrap-25PNPwRV class. Seems random and almost certainly will change in the future.
Instead, select the <div> element which has other element with class="change..." as sibling. For example
print([t.text.strip() for t in soup.select('div:has(+ [class^="change"])')])
Prints:
['−11.94M', '2.30M']

XPath, nested conditions

I have the following HTML code, and I need to have an XPath expression, which finds the table element.
<div>
<div>Dezember</div>
<div>
<div class="dash-table-container">more divs</div>
</div>
</div>
My current Xpath expression:
//div[./div[1]/text() = "Dezember"]/preceding::div[./div[2][#class=dash-table-container]
I don't know how to check if the dash table container is the last one loaded, since I have many of them. So I need the check if it's under the div with "Dezember" as a text because the div's before with the other months are being loaded faster.
I want the XPATH to select the "dash table container" div.
Thanks in advance
To select the div with the text content of "more divs", you can use
//div/div[#class="dash-table-container" and ../preceding-sibling::div[1]="Dezember"]
and to select its parent div element, use
//div[div/#class="dash-table-container"][preceding-sibling::div[1]="Dezember"]/..
I figured it out.
//div[preceding-sibling::div="Dezember"]/div[#class="dash-table-container"]
worked perfectly for me.

Web scraping Button BeautifulSoup Python

i'm trying to webscrape the span from a button that has a determinated class. This is the code of the page on the website.
<button class="sqdOP yWX7d _8A5w5 " type="button">altri <span>17</span></button>
I'd like to find "17" that obviously changes everytime. Thanks.
I've tried with this one but it doesn't work
for item in soup.find_all('button', {'class': 'sqdOP yWX7d _8A5w5 '}):
For complex selections, it's best to use selectors. These work very similar to CSS.
p selects an element with the type p.
p.example selects an element with type p and class example.
p span selects any span inside a p.
There are also others, but only these are needed for this example.
These can be nested as you like. For example, p.example span.foo selects any span with class foo inside any p with class example.
Now, an element can have multiple classes, and they are separated by spaces. <p class="foo bar">Hello, World!</p> has both foo and bar as class.
I think I am safe to assume the class sqdOP is unique. You can build the selector pretty easily using the above:
button.sqdOP span
Now, issue select, and BeautifulSoup will return a list of matching elements. If this is the only one, you can safely use [0] to get the first item. So, the final code to select that span:
soup.select('button.sqdOP span')[0]

how to extract text using Beautifulsoup

Can you please show me how to extract the title text (Inna) using BeautifulSoup in this situation:
<div class="wallpapers-box-300x180-2 wallpapers-margin-2">
<div class="wallpapers-box-300x180-2-img"><a title="Inna" href="/photo.jpg" alt="Inna" width="300" height="188" /></a></div>
<div class="wallpapers-box-300x180-2-title"><a title="Inna" href="/wallpapers/inna/">Inna</a></div>
Thanks.
There are so many ways to locate the element in this case and it's difficult to tell which way would work for you better since we don't know the scope of the problem, how unique is the element and what do you know and can rely on.
The most practical approach here I think would be to use the following CSS selector:
for elm in soup.select('div[class^="wallpapers-box"] > a[href*=wallpapers]'):
print(elm.get_text())
Here we check for the parent div element's class to start with wallpapers-box and find the direct a child element having wallpapers text inside the href attribute value.

Categories