Selenium find_elements_ from a tag - python

I want to scrape some hotel information from Booking.com. The website provides some hotel informations, in this particular case, how many rooms are still available. The following shows the span tag from the Booking.com website and i want to extract only the number of data-x-left-count for all listed hotels.
<span class="only_x_left sr_rooms_left_wrap " data-x-left-count="6">
Nur noch 6 Zimmer auf unserer Seite verfügbar!
</span>
I tried to approach it by finding the elements and returning an array of selenium objects.
availabilities_element = browser.find_elements_by_xpath("(//span[contains(.,'nur noch')])[2]")
And then a list comprehension to get the actual hotel titles and not the selenium objects.
availabilities = [x.text for x in availabilities_element]
But i have still some problems to get the data. I expect to get a list (just the numbers and nothing more) of the available rooms. Is there a way for a clean simple solution?

Welcome to SO. Here is the simple approach to get the number of vacant rooms.
# get all the vacant room elements
rooms = driver.find_elements_by_xpath("//span[#class='only_x_left sr_rooms_left_wrap ']")
for room in rooms:
# get the number of elements
print(room.get_attribute('data-x-left-count'))

Assuming that attribute is only associated with rooms left you can simply use attribute selector
rooms_left = [item.get_attribute('data-x-left-count') for item in driver.find_elements_by_css_selector("[data-x-left-count]")]

Related

How to get children of a list with no attributes with BeatifulSoup?

Situation
I try to scrape the nested unordered list of 3 "Market drivers" from this HTML:
<li>Drivers, Challenges, and Trends
<ul>
<li>Market drivers
<ul>
<li>Improvement in girth gear manufacturing technologies</li>
<li>Expansion and installation of new cement plants</li>
<li>Augmented demand from APAC</li>
</ul>
</li>
<li>Market challenges
<ul>
<li>Increased demand for refurbished girth gear segments</li>
Issue #1:
The list "Market drivers" I'm looking for doesn't have any attributes, like class name or id, so just need to go by the text / string within it. All tutorials show how to find using classes, id's, etc.
Issue #2:
The children, i.e. the 3 list items, happen to be 3 in this page, but in other similar pages there may be 0, 4 or 7 or another number of them. So I'm looking to get all the children irrespective of how many there are (or none). I've found something on getting children using recursive=False and also some other instruction saying not to use findChildren after BS2.
Issue #3:
I tried using find_all_next, but tutorials don't tell me how to find next up to a defined point - it's always about getting ALL next. Whereas I could potentially use find_all_next if it had some stop at or until you find property.
Following code shows my try (but it doesn't work):
import requests
from bs4 import BeautifulSoup
url = 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-DNA-Microarray-30162580/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
drivers = toc.find(string="Market drivers").findAll("li", recursive=False).text
print(drivers)
While there is no example of expected output i would recommend the following approach with Beautiful Soup version 4.7.0 required
How to select?
Selecting an element by its own text and extract the text of all its children <li> you can go with css selectors and a list comprehension:
[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
or in a for loop:
data = []
for x in toc.select('li:-soup-contains-own("Market drivers") li'):
data.append(x.get_text(strip=True))
print(data)
Output:
['Improvement in girth gear manufacturing technologies', 'Expansion and installation of new cement plants', 'Augmented demand from APAC']

Selenium scraping

I am trying to come up with a way to scrape information on houses on Zillow and I am currently using xpath to look at data such as rent price, principal and mortgage costs, insurance costs.
I was able to find the information using xpath but I wanted to make it automatic and put it inside a for loop but I realized as I was using xpath, not all the data for each listing has the same xpath information. for some it would be off by 1 of a list or div. See code below for what I mean. How do I get it more specific? Is there a way to look up for a string like "principal and interest" and select the next value which would be the numerical value that I am looking for?
works for one listing:
driver.find_element_by_xpath("/html/body/div[1]/div[6]/div/div[1]/div[1]/div[1]/ul/li[1]/article/div[1]/div[2]/div")
a different listing would contain this:
driver.find_element_by_xpath("/html/body/div[1]/div[6]/div/div[1]/div[1]/div[2]/ul/li[1]/article/div[1]/div[2]/div")
The xpaths that you are using are specific to the elements of the first listing. To be able to access elements for each listing, you will need to use xpaths in a way that can help you access elements for each listing:
import pandas as pd
from selenium import webdriver
I searched for listing for sale in Manhattan and got the below URL
url = "https://www.zillow.com/homes/Manhattan,-New-York,-NY_rb/"
Asking selenium to open the above link in Chrome
driver = webdriver.Chrome()
driver.get(url)
I hovered my mouse on one of the house listings and clicked "inspect". This opened the HTML code and highlighted the item I am inspecting. I noticed that the elements having class "list-card-info" contain all the info of the house that we need. So, our strategy would be for each house access the element that has class "list-card-info". So, using the following code, I saved all such HTML blocks in house_cards variable
house_cards = driver.find_elements_by_class_name("list-card-info")
There are 40 elements in house_cards i.e. one for each house (each page has 40 houses listed)
I loop over each of these 40 houses and extract the information I need. Notice that I am now using xpaths which are specific to elements within the "list-card-info" element. I save this info in a pandas datagram.
address = []
price = []
bedrooms = []
baths = []
sq_ft = []
for house in house_cards:
address.append(house.find_element_by_class_name("list-card-addr").text)
price.append(house.find_element_by_class_name("list-card-price").text)
bedrooms.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[1]').text)
baths.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[2]').text)
sq_ft.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[3]').text)
driver.quit()
# print(address, price,bedrooms,baths, sq_ft)
Manahattan_listings = pd.DataFrame({"address":address,
"bedrooms": bedrooms,
"baths":baths,
"sq_ft":sq_ft,
"price":price},)
pandas dataframe output
Now, to extract info from more pages i.e. page2, page 3, etc, you can loop over website pages i.e. keep modifying your URL and keep extracting info
Happy Scraping!
selecting multiple elements using xpath is not a good idea. You can look into "css selector". Using this you can get similar elements.

Scraping hyperlinks from a table using Selenium and Python

I have this table with two columns and unknown number of rows. I am trying to use Selenium (with Python) to scrape all the links into a list.
Goal: get all the links (one per row) from the second column into a list.
elements = driver.find_elements_by_xpath('') #for the table
for element in elements:
print(element.text)
\#Output is:
Penn Affiliated:
Delaware Valley Regional Planning Commission Congestion Management Intern
Contracts Intern
Transit, Bike, and Pedestrian Planning
Fabrication Lab Laser Cutter Operator
...
This prints all the rows. Now I am unsure of how to get links from the second column and all the rows.
Here is the HTML for the table:
Thanks a lot!
To get the value of href attribute from your element, you can do this:
elements = driver.find_elements_by_xpath("//table[#class = 'search']//td/a")
for element in elements:
print(element.get_attribute("href"))
Well, you didn't give a URL, but basically it should be like this.
import lxml.html
doc = lxml.html.parse('http://www.gpsbasecamp.com/national-parks')
links = doc.xpath('//a[#href]')
for link in links:
print(link.attrib['href'])

Selenium for Python: Get text() of node that is shared with another element, via XPath

On this page I would like Selenium for Python to grab the text contents of the "Investment Objective", excluding the <h3> header. I want to use XPath.
The nodes look like this:
<div class="carousel-content column fund-objective">
<h3 class="carousel-header">INVESTMENT OBJECTIVE</h3>
The Fund seeks to track the performance of an index composed of 25 of the largest Dutch companies listed on NYSE Euronext Amsterdam.
</div>
To retrieve the text, I'm using:
string = driver.find_element_by_xpath(xpath).text
If use I the this XPath for the top node:
xpath = '//div[#class="carousel-content column fund-objective"]'
It will work, but it includes the <h3> header INVESTMENT OBJECTIVE — which I want to exclude.
However, if I try to use /text() to address the actual text content, it seems that Selenium for Python doesn't let me grab it whilst using the .text to get the attribute:
xpath = '//div[#class="carousel-content column fund-objective"]/text()'
Note that there seems to be multiple nodes with this XPath on this particular page, so I'm specifying the correct node like this:
xpath = '(//div[#class="carousel-content column fund-objective"]/text())[2]'
My interpretation of the problem is that .text doesn't allow me to retrieve the text contents of the XPath sub-node text(). My apologies for incorrect terminology.
/text() will locate and return text node, which is not an element node. It doesn't have text property.
One solution will be to locate both elements and remove the unwanted text
xpath = '//div[#class="carousel-content column fund-objective"]'
element = driver.find_element_by_xpath(xpath)
all_text = element .text
title_text = element.find_element_by_xpath('./*[#class="carousel-header"]').text
all_text.replace(title_text, '')
You can try below code to get required output:
div = driver.find_element_by_xpath('(//div[#class="carousel-content column fund-objective"])[2]')
driver.execute_script('return arguments[0].lastChild.textContent;', div).strip()
The output is
'The Fund seeks to track the performance of an index composed of 25 of the largest Dutch companies listed on NYSE Euronext Amsterdam.'
To retrieve the text The Fund seeks to track the performance of an index composed of 25 of the largest Dutch companies listed on NYSE Euronext Amsterdam. you can use the following line of code :
string = driver.find_element_by_xpath("//div[#class='carousel-content column fund-objective' and not (#class='carousel-header')]").text

count rank of product listed in search results on website and get rank into excel using python

I am a new to Python.
I started web browser automatically using selenium package and opened e-commerce website(like Amazon) and searched my products in the search bar.
This I did successfully.
But now, on performing this search I want to get what's rank of my product in search result using Python. (Ex. Will it be 5th/ 6th/50th) in the search result and then store that rank in CSV. I brainstormed a lot but I can't come up with the code. Can anyone please help me?
Sample of Response:
Consider my mobile product company name is Yu yu yureka
Open flipkart> type 'mobile' and Hit search button >
You will see title, picture, description of motorola
next of yu yu yureka. So here I want the rank of my products. If it has different products and comes at 6rd and 5th position I want to retrieve the product name and the rank as 6 and 5 both.
Here is my code
from selenium import webdriver
driver = webdriver.Chrome("C:\\All\\chromedriver_win32\\chromedriver.exe")
driver.get('https://www.flipkart.com/')
driver.set_page_load_timeout(30)
driver.find_element_by_class_name("LM6RPg").send_keys("mobile")
driver.find_element_by_class_name("vh79eN").click()
Now I want get the search result data sort rank of Yu Yu Yureka(both the products) in search result?
It is 5th or 6th there
Thanks in advance!
You need to pick the elements from the list of elements of class name _2_KrJI and get their text attribute.
element_list = driver.find_elements_by_class_name("_2_KrJI")
print(element_list[4].text)
print(element_list[5],text)

Categories