i'm trying to webscrape the span from a button that has a determinated class. This is the code of the page on the website.
<button class="sqdOP yWX7d _8A5w5 " type="button">altri <span>17</span></button>
I'd like to find "17" that obviously changes everytime. Thanks.
I've tried with this one but it doesn't work
for item in soup.find_all('button', {'class': 'sqdOP yWX7d _8A5w5 '}):
For complex selections, it's best to use selectors. These work very similar to CSS.
p selects an element with the type p.
p.example selects an element with type p and class example.
p span selects any span inside a p.
There are also others, but only these are needed for this example.
These can be nested as you like. For example, p.example span.foo selects any span with class foo inside any p with class example.
Now, an element can have multiple classes, and they are separated by spaces. <p class="foo bar">Hello, World!</p> has both foo and bar as class.
I think I am safe to assume the class sqdOP is unique. You can build the selector pretty easily using the above:
button.sqdOP span
Now, issue select, and BeautifulSoup will return a list of matching elements. If this is the only one, you can safely use [0] to get the first item. So, the final code to select that span:
soup.select('button.sqdOP span')[0]
Related
In the below html elements, I have been unsuccessful using beautiful soup.select to only obtain the first child after div class="wrap-25PNPwRV"> (i.e. -11.94M and 2.30M) in list format
<div class="value-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>−11.94M</div>
<div class="change-25PNPwRV negative-25PNPwRV">−119.94%</div></div></div>
<div class="value-25PNPwRV additional-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>2.30M</div>
<div class="change-25PNPwRV negative-25PNPwRV">−80.17%</div></div></div>
Above is just two examples within the html I'm attempting to scrape within the dynamic javascript coded table which the above source code lies within, but there are many more div attributes on the page, and many more div class "wrap-25PNPwRV" inside the javascript table
I currently have the below code which allows me to scrape all the contents within div class ="wrap-25PNPwRV"
data_list = [elem.get_text() for elem in soup.select("div.wrap-25PNPwRV")]
Output:
['-11.94M', '-119.94%', '2.30M', '-80.17%']
However, I would like to use soup.select to yield the desired output :
['-11.94M', '2.30M']
I tried following this guide https://www.crummy.com/software/BeautifulSoup/bs4/doc/ but have been unsuccessful to implement it to my above code.
Please note, if soup.select is not possible to perform the above, I am happy to use an alternative providing it generates the same list format/output
You can use the :nth-of-type CSS selector:
data_list = [elem.get_text() for elem in soup.select(".wrap-25PNPwRV div:nth-of-type(1)")]
I'd suggest to not use the .wrap-25PNPwRV class. Seems random and almost certainly will change in the future.
Instead, select the <div> element which has other element with class="change..." as sibling. For example
print([t.text.strip() for t in soup.select('div:has(+ [class^="change"])')])
Prints:
['−11.94M', '2.30M']
In my page to scrape I have some elements like
<a href='xxxx' class='main-class class2 class3'>
end others elements like
<a href='xxxx' class='main-class class4 class5'>
I want to retrieve all these elements, so I did
elems = driver.find_elements_by_xpath("//a[#class='main-class']")
but it doesn't retrieve anything. I can't search on one class only.
For the moment, it is working only if I indicate the three classes.
Thank you
You should use the following XPath:
elems = driver.find_elements_by_xpath("//a[contains(#class,'main-class')]")
This will give you all the elements containing this class
I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()
I am looking to retrieve the value "312 votes" from the below tag hierarchy:
<div class="rating-rank right">
<span class="rating-votes-div-65211">312 votes</span>
</div>
The problem seems to be that the span tag has a unique identifier for every values in the page. In the above case '65211'. What should i do to retrieve the required value?
I am using soup.select to get the values. But it doesn't seem to work.
for tag in soup.select('div.rating-rank right'):
try:
print(tag.string)
except KeyError:
pass
You try to select a right element that follows a div with class rating-rank. You can select what you want like this:
soup.select("div.rating-rank.right span")
With css selectors you have to read them from right to left. So div.rating-rank.right span means I want a span element which is after a div element having rating-rank, right as classes. From the moment you identified your span elements, you can print their contents like you already do.
i want to select all divs that a first part of their ID is "edit" using scrapy/XPATH.
For example:
<div id="edit3423432">...</div>
<div id="edit0036594">...</div>
For divs which have same id i use this code:
hxs.select('.//div[contains(#id,"testid")]')
But now how can i select all divs that have the first four characters equal to "edit"?
Xpath has a special function called starts-with, that would be pretty ideal here. Here's an example of how to use it:
hxs.select('.//div[starts-with(#id, 'edit')]')
Hope that helps, let me know if you have any questions.