I'm using Selenium to scrape a Web Page and I'm having some problems targeting some attributes.
The page I'm trying to scrape looks like this:
<div>
<span abc> content </span>
<span def> content2 </span>
<div>
My goal would be to retrieve the text within the "span abc" tag, without selecting the other text included in the "span def" tag.
I've tried multiple approaches and looked at a lot of different resources but I wasn't able to find the right approach, since I don't want to select all the spans at the same time and I don't want to search based on the text within the tags.
A simple approach would be indexing cause you do not want to select based on
since I don't want to select all the spans at the same time and I
don't want to search based on the text within the tags.
If abc is an attribute please use :
//div/span[#abc]
or
with indexing :
(//div/span[#abc])[1]
If you only want to pull the first span out of these two, you could easily do this with the XPATH. It would look like this:
span = driver.find_element_by_xpath("/html/body/div/span[1]").text
if you want to pull every span, but execute commands with each of these you could do:
span = len(driver.find_elements_by_xpath("/html/body/div/span"))
m = 1
while m <= 0:
span = driver.find_element_by_xpath("/html/body/div/span["+str(m)+"]")
print(span.text)
m = m + 1
You can use xpath like //span[1]/text() for get text inside of the <span> tag
span = driver.find_element_by_xpath("/html/body/div/span[1]/text()")
Related
In the below html elements, I have been unsuccessful using beautiful soup.select to only obtain the first child after div class="wrap-25PNPwRV"> (i.e. -11.94M and 2.30M) in list format
<div class="value-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>−11.94M</div>
<div class="change-25PNPwRV negative-25PNPwRV">−119.94%</div></div></div>
<div class="value-25PNPwRV additional-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>2.30M</div>
<div class="change-25PNPwRV negative-25PNPwRV">−80.17%</div></div></div>
Above is just two examples within the html I'm attempting to scrape within the dynamic javascript coded table which the above source code lies within, but there are many more div attributes on the page, and many more div class "wrap-25PNPwRV" inside the javascript table
I currently have the below code which allows me to scrape all the contents within div class ="wrap-25PNPwRV"
data_list = [elem.get_text() for elem in soup.select("div.wrap-25PNPwRV")]
Output:
['-11.94M', '-119.94%', '2.30M', '-80.17%']
However, I would like to use soup.select to yield the desired output :
['-11.94M', '2.30M']
I tried following this guide https://www.crummy.com/software/BeautifulSoup/bs4/doc/ but have been unsuccessful to implement it to my above code.
Please note, if soup.select is not possible to perform the above, I am happy to use an alternative providing it generates the same list format/output
You can use the :nth-of-type CSS selector:
data_list = [elem.get_text() for elem in soup.select(".wrap-25PNPwRV div:nth-of-type(1)")]
I'd suggest to not use the .wrap-25PNPwRV class. Seems random and almost certainly will change in the future.
Instead, select the <div> element which has other element with class="change..." as sibling. For example
print([t.text.strip() for t in soup.select('div:has(+ [class^="change"])')])
Prints:
['−11.94M', '2.30M']
i'm trying to webscrape the span from a button that has a determinated class. This is the code of the page on the website.
<button class="sqdOP yWX7d _8A5w5 " type="button">altri <span>17</span></button>
I'd like to find "17" that obviously changes everytime. Thanks.
I've tried with this one but it doesn't work
for item in soup.find_all('button', {'class': 'sqdOP yWX7d _8A5w5 '}):
For complex selections, it's best to use selectors. These work very similar to CSS.
p selects an element with the type p.
p.example selects an element with type p and class example.
p span selects any span inside a p.
There are also others, but only these are needed for this example.
These can be nested as you like. For example, p.example span.foo selects any span with class foo inside any p with class example.
Now, an element can have multiple classes, and they are separated by spaces. <p class="foo bar">Hello, World!</p> has both foo and bar as class.
I think I am safe to assume the class sqdOP is unique. You can build the selector pretty easily using the above:
button.sqdOP span
Now, issue select, and BeautifulSoup will return a list of matching elements. If this is the only one, you can safely use [0] to get the first item. So, the final code to select that span:
soup.select('button.sqdOP span')[0]
I have been trying to create an xpath supposed to locate the first three Yes within p elements until the text Demarcation within h1 elements. The existing one which I've used within the below script locates all the text within p elements. However, I can't find any idea to move along. Just consider the one I've created already to be a placeholder.
How can I create an xapth to be able to locate first three Yes within p elements and nothing else?
My attempt so far:
from lxml.html import fromstring
htmldoc="""
<li>
<a>Nope</a>
<a>Nope</a>
<p>Yes</p>
<p>Yes</p>
<p>Yes</p>
<h1>Demarcation</h1>
<p>No</p>
<p>No</p>
<h1>Not this</h2>
<p>No</p>
<p>Not this</p>
</li>
"""
root = fromstring(htmldoc)
for item in root.xpath("//li/p"):
print(item.text)
Try below to select paragraphs that are preceding siblings of header "Demarcation"
//li/p[following-sibling::h1[.="Demarcation"]]
It looks like you are trying to depend on the h1 tag containing Demarcation, so start from it:
//h1[contains(., "Demarcation")]/preceding-sibling::p[contains(., "Yes")][position()<4]
The idea is to get previous p elements and I added the position()<4 so you only get three, you can remove that if you just need all of the p:
//h1[contains(., "Demarcation")]/preceding-sibling::p[contains(., "Yes")]
I am trying to extract the release date of a game from Steam's store page. The html that I'm working with is as follows:
<div class="details_block">
<b>Title:</b> Total War™: ROME II - Emperor Edition<br>
<b>Genre:</b> Strategy<br>
<b>Developer:</b>
Creative Assembly
<br>
<b>Publisher:</b>
SEGA <br>
<b>Release Date:</b> Sep 2, 2013<br>
</div>
Ultimately, my goal is to retrieve a number of values from this "details_block" div. I tried extracting all br tags from this section of code with:
details_block = bsObj.find("div", class_="details_block")
for br in details_block.findAll('br'):
br.extract()
Then I access each piece of data that I want individually. I am a little stuck on the release date though. I am trying to access it with find_next_sibling() but nothing is being found, presumably because find_next_sibling only grabs elements with tags:
releaseDatePattern = re.compile(r'Release Date:')
print details_block.find('b', text=releaseDatePattern).find_next_sibling().text.strip()
However, before I had extracted all of the br tags, it WAS finding the value, but it was attaching a br tag to it, which I did not want.
Is there an effective way to grab the release date without assuming that the order of these details in the detail_block will stay in this order?
First find all the b tags in the block. Then iterate over each of the b tag and you should get the text as b.next_sibling.
I am looking to retrieve the value "312 votes" from the below tag hierarchy:
<div class="rating-rank right">
<span class="rating-votes-div-65211">312 votes</span>
</div>
The problem seems to be that the span tag has a unique identifier for every values in the page. In the above case '65211'. What should i do to retrieve the required value?
I am using soup.select to get the values. But it doesn't seem to work.
for tag in soup.select('div.rating-rank right'):
try:
print(tag.string)
except KeyError:
pass
You try to select a right element that follows a div with class rating-rank. You can select what you want like this:
soup.select("div.rating-rank.right span")
With css selectors you have to read them from right to left. So div.rating-rank.right span means I want a span element which is after a div element having rating-rank, right as classes. From the moment you identified your span elements, you can print their contents like you already do.