I am trying to scrape some information from this website https://www.gumtree.co.za (https://www.gumtree.co.za/a-house-rentals-flat-rentals-offered/tamboerskloof/studio-flatlet-in-tamboerskloof/1005754794350910092234609 this is the link of the property I am taking information from); more specifically I am trying to take information from these span classes:
<div class="attribute">
<span class="name">Bathrooms (#):</span>
<span class="value">1</span>
</div>
I first want to check if the span class has Bathroom in it and then take the value for that. This is what I have right now:
bathrooms=response.xpath("//span[contains(text(),'Bathrooms')]/span[#class='value']text()").extract_first()
However, I do not get anything.
Any suggestions? Thank you!
This is the correct way to extract all the siblings.
Bathrooms=response.xpath("//span[contains(text(),'Bathrooms')]/following-sibling::*").extract_first()
For more, you can refer to this: XPath Axes
Hope this helps.
Related
I'm working on a script that will take the src of a website, and take screenshots of relevant parts of the site. More specifically, I'm interested in taking screenshots of posts from a site, including their respective comments and replies.
Currently, I am able to generate all these screenshots as desired, however I am encountering an issue when a given post's content exceeds the length of the Selenium browser window. A sample HTML snippet is below:
<div class="detail word-break">
<p id="contentArea">
Sample text content here. As you can see, the text is inside a p tag
<br>
<br>
...
My issue can be boiled down to wanting to treat each of these text elements as a separate WebElement for the purpose of taking Selenium screenshots
<br>
<br>
Using the XPath selector for "./child::*" on the contentArea element only returns a list of br tags, with no text content inside
...
</p>
</div>
Is it possible to take the WebElement for the contentArea, and subdivide it into smaller WebElements that contain the tagless text so they can be screenshotted individually?
I ended up finding a workaround for my issue. Rather than splitting the post into separate elements, I was instead able to parse out the text and separate each paragraph (i.e. subelement) into a list. Then, using the JS executor, I was able to replace the text of the entire post with just that of each paragraph, and take the screenshots that way.
Hopefully anyone facing a similar issue will find this workaround useful!
I'm currently learning Python, and as a project for myself, I'm learning how to use Selenium to interact with websites. I have found the element through its id in HTML, but when I don't know how to reference the heading inside the element. In this case, I just want the string from <h4>.
<div class="estimate-box equal-width" id="estimate-box">
<a href="#worth" id="worthBox" style="text-decoration: none;">
<h5>Worth</h5>
<h4>$5.02</h4>
</a>
</div>
My question is, how do I get python to extract just the text in <h4>? I'm sorry if I formatted this wrong, this is my first post. Thanks in advance!
Use following xpath.
print(driver.find_element_by_xpath("//a[id='worthBox']/h4").text)
Or following css selector.
print(driver.find_element_by_css_selector("#worthBox>h4").text)
So I've been going through the online book "Automate the Boring Stuff with Python" and I'm learning about BeautifulSoup. My issue is I can't seem to figure out how to choose the appropriate tag based on what I find using the developer's tools in Chrome.
<div data-hveid=.....>
<div class="rc">
<a href="https://www.python.org/".....>
<h3 class="LC20lb">Welcome to Python.org</h3>
# Using select to grab links to search results.
linkElems = soup.select('r .a')
An example of the inspector results.
In the book the goal was to grab all the links that show up on the search results page of a google search. To do so the author uses the line soup.select('r .a'). But when I use the inspector I get to the "a href" tag.
On my own I wanted to also grab the title/heading of a link that shows up on the search results page. The inspector highlights the "h3 class" tag. I tried to select that by telling select to look for tags with the class attribute equal to "LC20lb" but I keep getting an empty list as output.
So my question is, once the inspector has helped us narrow our focus how do we know which tag is the appropriate one to select? Like how did the author know that instead of the "a href" tag, we should instead go with '.r a' instead? In general, how far "out", i.e. which ancestor, should I choose once the selector has shown me a particular element?
If you do 'a href' you haven't specified a div class, so it's going to get all instances of a href, which is going to include links to stuff like maps and drive etc. In the code you cite, you missed the "r" div class
<div data-hveid=.....>
<div class="rc">
<div class="r">
<a href="https://www.python.org/".....>
<h3 class="LC20lb">Welcome to Python.org</h3>
So soup.select('.r a') is getting all the a tags in the "r" div class (which is the search results), rather than all instances of a href tags.
Hope this answers your question!
I am trying to get the xpath for the following code but can't seem to figure it out.
<i class="icon icon-button-follow pointer action-button js-follow-unfollow-button" data-type="follow" data-id="3470861"></i>
You have to understand: what does xpath is? Which goal it has? Xpath is a locator, which you can to use to separate one elements and another one.
I can recommend to you this http://practicalsqa.net/xpath-brainteasers-and-exercises/ class to better understanding this question.
In you case xpath may be for ex.:
//i[#class="icon icon-button-follow pointer action-button js-follow-unfollow-button"]
or
//i[#data-type="follow"]
You have to use something what is unique for this element.
I depends why you want that node, right?
//i[#data-id="3470861"]
should find it. But will you always have that id?
I am learning web scraping on my own and I am trying to scrap reviewer's ratings on Yelp as a practice. Typically, I can use CSS selector or XPath methods to select the contents I am interested in. However, those methods do not work for selecting reviewers' ratings. For instance, on the following page: https://www.yelp.com/user_details_reviews_self?userid=0S6EI51ej5J7dgYz3-O0lA. The CSS selector for the first rating is '.stars_2'. However, if I use this selector in my RSelenium code as follows:
ratings=remDr$findElements('css selector','.stars_2')
ratings=unlist(lapply(ratings, function(x){x$getElementText()}))
I get NULL. I think the reason is that the rating is actually a image. I paste a small part of the page source here:
<div class="review-content">
<div class="review-content">
<div class="biz-rating biz-rating-very-large clearfix">
<div>
<div class="rating-very-large">
<i class="star-img stars_2" title="2.0 star rating">
<img alt="2.0 star rating" class="offscreen" height="303" src="//s3-media4.fl.yelpcdn.com/assets/srv0/yelp_styleguide/c2252a4cd43e/assets/img/stars/stars_map.png" width="84">
</i>
</div>
</div>
Basically, if I can extract the text from class="stat-img stars_2" or title="2.0 star rating" then I am good. Can anyone help me on this?
You might want to try something like this approach:
Using the Yelp API with R, attempting to search business types using geo-coordinates
though it seems some folks found this outdated, I found some useful code on the Yelp github page:
https://github.com/Yelp/yelp-api/pull/88
https://github.com/Yelp/yelp-api/pull/88/commits/95009afde2b47e8244fda3d435f0476205cc0039
Good luck!
:)