Scraping hyperlinks from a table using Selenium and Python

Scraping hyperlinks from a table using Selenium and Python - python

I have this table with two columns and unknown number of rows. I am trying to use Selenium (with Python) to scrape all the links into a list.
Goal: get all the links (one per row) from the second column into a list.
elements = driver.find_elements_by_xpath('') #for the table
for element in elements:
print(element.text)
\#Output is:
Penn Affiliated:
Delaware Valley Regional Planning Commission Congestion Management Intern
Contracts Intern
Transit, Bike, and Pedestrian Planning
Fabrication Lab Laser Cutter Operator
...
This prints all the rows. Now I am unsure of how to get links from the second column and all the rows.
Here is the HTML for the table:
Thanks a lot!

To get the value of href attribute from your element, you can do this:
elements = driver.find_elements_by_xpath("//table[#class = 'search']//td/a")
for element in elements:
print(element.get_attribute("href"))

Well, you didn't give a URL, but basically it should be like this.
import lxml.html
doc = lxml.html.parse('http://www.gpsbasecamp.com/national-parks')
links = doc.xpath('//a[#href]')
for link in links:
print(link.attrib['href'])

Related

Selenium scraping

I am trying to come up with a way to scrape information on houses on Zillow and I am currently using xpath to look at data such as rent price, principal and mortgage costs, insurance costs.
I was able to find the information using xpath but I wanted to make it automatic and put it inside a for loop but I realized as I was using xpath, not all the data for each listing has the same xpath information. for some it would be off by 1 of a list or div. See code below for what I mean. How do I get it more specific? Is there a way to look up for a string like "principal and interest" and select the next value which would be the numerical value that I am looking for?
works for one listing:
driver.find_element_by_xpath("/html/body/div[1]/div[6]/div/div[1]/div[1]/div[1]/ul/li[1]/article/div[1]/div[2]/div")
a different listing would contain this:
driver.find_element_by_xpath("/html/body/div[1]/div[6]/div/div[1]/div[1]/div[2]/ul/li[1]/article/div[1]/div[2]/div")

The xpaths that you are using are specific to the elements of the first listing. To be able to access elements for each listing, you will need to use xpaths in a way that can help you access elements for each listing:
import pandas as pd
from selenium import webdriver
I searched for listing for sale in Manhattan and got the below URL
url = "https://www.zillow.com/homes/Manhattan,-New-York,-NY_rb/"
Asking selenium to open the above link in Chrome
driver = webdriver.Chrome()
driver.get(url)
I hovered my mouse on one of the house listings and clicked "inspect". This opened the HTML code and highlighted the item I am inspecting. I noticed that the elements having class "list-card-info" contain all the info of the house that we need. So, our strategy would be for each house access the element that has class "list-card-info". So, using the following code, I saved all such HTML blocks in house_cards variable
house_cards = driver.find_elements_by_class_name("list-card-info")
There are 40 elements in house_cards i.e. one for each house (each page has 40 houses listed)
I loop over each of these 40 houses and extract the information I need. Notice that I am now using xpaths which are specific to elements within the "list-card-info" element. I save this info in a pandas datagram.
address = []
price = []
bedrooms = []
baths = []
sq_ft = []
for house in house_cards:
address.append(house.find_element_by_class_name("list-card-addr").text)
price.append(house.find_element_by_class_name("list-card-price").text)
bedrooms.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[1]').text)
baths.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[2]').text)
sq_ft.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[3]').text)
driver.quit()
# print(address, price,bedrooms,baths, sq_ft)
Manahattan_listings = pd.DataFrame({"address":address,
"bedrooms": bedrooms,
"baths":baths,
"sq_ft":sq_ft,
"price":price},)
pandas dataframe output
Now, to extract info from more pages i.e. page2, page 3, etc, you can loop over website pages i.e. keep modifying your URL and keep extracting info
Happy Scraping!

selecting multiple elements using xpath is not a good idea. You can look into "css selector". Using this you can get similar elements.

Selenium find_elements_ from a tag

I want to scrape some hotel information from Booking.com. The website provides some hotel informations, in this particular case, how many rooms are still available. The following shows the span tag from the Booking.com website and i want to extract only the number of data-x-left-count for all listed hotels.
<span class="only_x_left sr_rooms_left_wrap " data-x-left-count="6">
Nur noch 6 Zimmer auf unserer Seite verfügbar!
</span>
I tried to approach it by finding the elements and returning an array of selenium objects.
availabilities_element = browser.find_elements_by_xpath("(//span[contains(.,'nur noch')])[2]")
And then a list comprehension to get the actual hotel titles and not the selenium objects.
availabilities = [x.text for x in availabilities_element]
But i have still some problems to get the data. I expect to get a list (just the numbers and nothing more) of the available rooms. Is there a way for a clean simple solution?

Welcome to SO. Here is the simple approach to get the number of vacant rooms.
# get all the vacant room elements
rooms = driver.find_elements_by_xpath("//span[#class='only_x_left sr_rooms_left_wrap ']")
for room in rooms:
# get the number of elements
print(room.get_attribute('data-x-left-count'))

Assuming that attribute is only associated with rooms left you can simply use attribute selector
rooms_left = [item.get_attribute('data-x-left-count') for item in driver.find_elements_by_css_selector("[data-x-left-count]")]

Selenium for Python: Get text() of node that is shared with another element, via XPath

On this page I would like Selenium for Python to grab the text contents of the "Investment Objective", excluding the <h3> header. I want to use XPath.
The nodes look like this:
<div class="carousel-content column fund-objective">
<h3 class="carousel-header">INVESTMENT OBJECTIVE</h3>
The Fund seeks to track the performance of an index composed of 25 of the largest Dutch companies listed on NYSE Euronext Amsterdam.
</div>
To retrieve the text, I'm using:
string = driver.find_element_by_xpath(xpath).text
If use I the this XPath for the top node:
xpath = '//div[#class="carousel-content column fund-objective"]'
It will work, but it includes the <h3> header INVESTMENT OBJECTIVE — which I want to exclude.
However, if I try to use /text() to address the actual text content, it seems that Selenium for Python doesn't let me grab it whilst using the .text to get the attribute:
xpath = '//div[#class="carousel-content column fund-objective"]/text()'
Note that there seems to be multiple nodes with this XPath on this particular page, so I'm specifying the correct node like this:
xpath = '(//div[#class="carousel-content column fund-objective"]/text())[2]'
My interpretation of the problem is that .text doesn't allow me to retrieve the text contents of the XPath sub-node text(). My apologies for incorrect terminology.

/text() will locate and return text node, which is not an element node. It doesn't have text property.
One solution will be to locate both elements and remove the unwanted text
xpath = '//div[#class="carousel-content column fund-objective"]'
element = driver.find_element_by_xpath(xpath)
all_text = element .text
title_text = element.find_element_by_xpath('./*[#class="carousel-header"]').text
all_text.replace(title_text, '')

You can try below code to get required output:
div = driver.find_element_by_xpath('(//div[#class="carousel-content column fund-objective"])[2]')
driver.execute_script('return arguments[0].lastChild.textContent;', div).strip()
The output is
'The Fund seeks to track the performance of an index composed of 25 of the largest Dutch companies listed on NYSE Euronext Amsterdam.'

To retrieve the text The Fund seeks to track the performance of an index composed of 25 of the largest Dutch companies listed on NYSE Euronext Amsterdam. you can use the following line of code :
string = driver.find_element_by_xpath("//div[#class='carousel-content column fund-objective' and not (#class='carousel-header')]").text

Webscraping multiline cells in tables using CSS Selectors and Python

So I'm webscraping a page (http://canoeracing.org.uk/marathon/results/burton2016.htm) where there are multiline cells in tables:
I'm using the following code to scrape each column (the one below so happens to scrape the names):
import lxml.html
from lxml.cssselect import CSSSelector
# get some html
import requests
r = requests.get('http://canoeracing.org.uk/marathon/results/burton2016.htm')
# build the DOM Tree
tree = lxml.html.fromstring(r.text)
# construct a CSS Selector
sel1 = CSSSelector('body > table > tr > td:nth-child(2)')
# Apply the selector to the DOM tree.
results1 = sel1(tree)
# get the text out of all the results
data1 = [result.text for result in results1]
Unfortunately it's only returning the first name from each cell, not both. I've tried a similar thing on the webscraping tool Kimono and I'm able to scrape both, however I want to sent up a Python code as Kimono falls down when running over multiple webpages.

The problem is that some of the cells contain multiple text nodes delimited by a <br>. In cases like this, find all text nodes and join them:
data1 = [", ".join(result.xpath("text()")) for result in rows]
For the provided rows in the screenshot, you would get:
OSCAR HUISSOON, FREJA WEBBER
ELLIE LAWLEY, RHYS TIPPINGS
ALLISON MILES, ALEX MILES
NICOLA RUDGE, DEBORAH CRUMP
You could have also used .text_content() method, but you would lose the delimiter between the text nodes, getting things like OSCAR HUISSOONFREJA WEBBER in the result.

not able to output specific row from xpath response

This is my first attempt at using xpath
I am attempting to pull out information of products on a list using the xpath interface via Scrapy on Python, specifically the prices from this url :
http://store.nike.com/gb/en_gb/pw/mens-shoes/7puZoi3?ipp=120#
As you can see the prices (going horizontally from left to right) is £90, £120,
£100 ...
The following, will return a list of all the trainer prices on the page:
item['trainerPrice']= response.xpath('//span[#class="local nsg-font-family--base"]/text()').extract()
More so, the following will return the first "record":
item['trainerPrice']= response.xpath('string(//span[#class="local nsg-font-family--base"]/text())').extract()
However I am unsure, how to select the second record, i.e. u'\xa3120'
Any hints?

You can just get the second item from the extracted list:
prices = response.xpath('//span[#class="local nsg-font-family--base"]/text()').extract()
print(prices[1])
Though, I don't particularly like your locator. Instead I would take the prices div to rely on:
response.css('div.prices span.local::text')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping hyperlinks from a table using Selenium and Python - python

To get the value of href attribute from your element, you can do this: elements = driver.find_elements_by_xpath("//table[#class = 'search']//td/a") for element in elements: print(element.get_attribute("href"))

Well, you didn't give a URL, but basically it should be like this. import lxml.html doc = lxml.html.parse('http://www.gpsbasecamp.com/national-parks') links = doc.xpath('//a[#href]') for link in links: print(link.attrib['href'])

Related

Selenium scraping

Selenium find_elements_ from a tag

Selenium for Python: Get text() of node that is shared with another element, via XPath

Webscraping multiline cells in tables using CSS Selectors and Python

not able to output specific row from xpath response

Categories

Resources