Selenium scraping - python

I am trying to come up with a way to scrape information on houses on Zillow and I am currently using xpath to look at data such as rent price, principal and mortgage costs, insurance costs.
I was able to find the information using xpath but I wanted to make it automatic and put it inside a for loop but I realized as I was using xpath, not all the data for each listing has the same xpath information. for some it would be off by 1 of a list or div. See code below for what I mean. How do I get it more specific? Is there a way to look up for a string like "principal and interest" and select the next value which would be the numerical value that I am looking for?
works for one listing:
driver.find_element_by_xpath("/html/body/div[1]/div[6]/div/div[1]/div[1]/div[1]/ul/li[1]/article/div[1]/div[2]/div")
a different listing would contain this:
driver.find_element_by_xpath("/html/body/div[1]/div[6]/div/div[1]/div[1]/div[2]/ul/li[1]/article/div[1]/div[2]/div")

The xpaths that you are using are specific to the elements of the first listing. To be able to access elements for each listing, you will need to use xpaths in a way that can help you access elements for each listing:
import pandas as pd
from selenium import webdriver
I searched for listing for sale in Manhattan and got the below URL
url = "https://www.zillow.com/homes/Manhattan,-New-York,-NY_rb/"
Asking selenium to open the above link in Chrome
driver = webdriver.Chrome()
driver.get(url)
I hovered my mouse on one of the house listings and clicked "inspect". This opened the HTML code and highlighted the item I am inspecting. I noticed that the elements having class "list-card-info" contain all the info of the house that we need. So, our strategy would be for each house access the element that has class "list-card-info". So, using the following code, I saved all such HTML blocks in house_cards variable
house_cards = driver.find_elements_by_class_name("list-card-info")
There are 40 elements in house_cards i.e. one for each house (each page has 40 houses listed)
I loop over each of these 40 houses and extract the information I need. Notice that I am now using xpaths which are specific to elements within the "list-card-info" element. I save this info in a pandas datagram.
address = []
price = []
bedrooms = []
baths = []
sq_ft = []
for house in house_cards:
address.append(house.find_element_by_class_name("list-card-addr").text)
price.append(house.find_element_by_class_name("list-card-price").text)
bedrooms.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[1]').text)
baths.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[2]').text)
sq_ft.append(house.find_element_by_xpath('.//div[#class="list-card-heading"]/ul[#class="list-card-details"]/li[3]').text)
driver.quit()
# print(address, price,bedrooms,baths, sq_ft)
Manahattan_listings = pd.DataFrame({"address":address,
"bedrooms": bedrooms,
"baths":baths,
"sq_ft":sq_ft,
"price":price},)
pandas dataframe output
Now, to extract info from more pages i.e. page2, page 3, etc, you can loop over website pages i.e. keep modifying your URL and keep extracting info
Happy Scraping!

selecting multiple elements using xpath is not a good idea. You can look into "css selector". Using this you can get similar elements.

Related

How to access text element in selenium if it is splitted by body tags

I have a problem while trying to access some values on the website during the process of web scraping the data. The problem is that the text I want to extract is in the class which contains several texts separated by tags (these body tags also have texts which are also important for me).
So firstly, I tried to look for the tag with the text I needed ('Category' in this case) and then extract the exact category from the text below this body tag assignment. I could use precise XPath but here it is not the case because other pages I need to web scrape contain a different amount of rows in this sidebar so the locations, as well as XPaths, are different.
The expected output is 'utility' - the category in the sidebar.
The website and the text I need to extract look like that (look right at the sidebar containing 'Category':
The element looks like that:
And the code I tried:
driver = webdriver.Safari()
driver.get('https://www.statsforsharks.com/entry/MC_Squares')
element = driver.find_elements_by_xpath("//b[contains(text(), 'Category')]/following-sibling")
for value in element:
print(value.text)
driver.close()
the link to the page with the data is https://www.statsforsharks.com/entry/MC_Squares.
Thank you!
You might be better off using regex here, as the whole text comes under the 'company-sidebar-body' class, where only some text is between b tags and some are not.
So, you can the text of the class first:
sidebartext = driver.find_element_by_class_name("company-sidebar-body").text
That will give you the following:
"EOY Proj Sales: $1,000,000\r\nSales Prev Year: $200,000\r\nCategory: Utility\r\nAsking Deal\r\nEquity: 10%\r\nAmount: $300,000\r\nValue: $3,000,000\r\nEquity Deal\r\nSharks: Kevin O'Leary\r\nEquity: 25%\r\nAmount: $300,000\r\nValue: $1,200,000\r\nBite: -$1,800,000"
You can then use regex to target the category:
import re
c = re.search("Category:\s\w+", sidebartext).group()
print(c)
c will result in 'Category: Utility' which you can then work with. This will also work if the value of the category ('Utility') is different on other pages.
There are easier ways when it's a MediaWiki website. You could, for instance, access the page data through the API with a JSON request and parse it with a much more limited DOM.
Any particular reason you want to scrape my website?

Scraping hyperlinks from a table using Selenium and Python

I have this table with two columns and unknown number of rows. I am trying to use Selenium (with Python) to scrape all the links into a list.
Goal: get all the links (one per row) from the second column into a list.
elements = driver.find_elements_by_xpath('') #for the table
for element in elements:
print(element.text)
\#Output is:
Penn Affiliated:
Delaware Valley Regional Planning Commission Congestion Management Intern
Contracts Intern
Transit, Bike, and Pedestrian Planning
Fabrication Lab Laser Cutter Operator
...
This prints all the rows. Now I am unsure of how to get links from the second column and all the rows.
Here is the HTML for the table:
Thanks a lot!
To get the value of href attribute from your element, you can do this:
elements = driver.find_elements_by_xpath("//table[#class = 'search']//td/a")
for element in elements:
print(element.get_attribute("href"))
Well, you didn't give a URL, but basically it should be like this.
import lxml.html
doc = lxml.html.parse('http://www.gpsbasecamp.com/national-parks')
links = doc.xpath('//a[#href]')
for link in links:
print(link.attrib['href'])

How to click every link and extract content inside - Python Selenium

I wanna get content inside from all links with id = "LinkNoticia"
Actually my code join in first link and extract content, but i cant access to other.
How can i do it?
this is my code (its works for 1 link)
from selenium import webdriver
driver= webdriver.Chrome("/selenium/webdriver/chromedriver")
driver.get('http://www.emol.com/noticias/economia/todas.aspx')
driver.find_element_by_id("LinkNoticia").click()
title = driver.find_element_by_id("cuDetalle_cuTitular_tituloNoticia")
print(title.text)
First of all, the fact that page has multiple elements with the same ID is a bug on its own. The whole point of ID is to be unique for each element on the page. According to HTML specs:
id = name
This attribute assigns a name to an element. This name must be unique in a document.
A lengthy discussion is here.
Since ID is supposed to be unique, most (all?) implementations of Selenium will only have function to look for one element with given ID (e.g. find_element_by_id). I have never seen a function for finding multiple elements by ID. So you cannot use ID as your locator directly, you need to use one of the existing functions that allows location of multiple elements, and use ID as just some attribute which allows you to select a group of elements. Your choices are:
find_elements_by_xpath
find_elements_by_css_selector
For example, you could change your search like this:
links = driver.find_elements_by_xpath("//a[#id='LinkNoticia']");
That would give you the whole set of links, and you'd need to loop through them to retrieve the actual link (href). Note that if you just click on each link, you navigate away from the page and references in links will no longer be valid. So instead you can do this:
Build list of hrefs from the links:
hrefs=[]
for link in links:
hrefs.append(link.get_attribute("href"))
Navigate to eachhref to check its title:
for href in hrefs:
driver.get(href);
title = driver.find_element_by_id("cuDetalle_cuTitular_tituloNoticia")
# etc

Selenium can't find elements by this XPath expression

I'm trying to extract some odds from a page using Selenium ChromeDriver, since the data is dynamic. The "find elements by XPath expression" usually works with these kind of websites for me, but this time, it can't seem to find the element in question, nor any element that belong to the section of the page that shows the relevant odds.
I'm probably making a simple error - if anyone has time to check the page out I'd be very grateful! Sample page: Nordic Bet NHL Odds
driver.get("https://www.nordicbet.com/en/odds#?cat=&reg=&sc=50&bgi=36")
time.sleep(5)
dayElems = driver.find_elements_by_xpath("//div[#class='ng-scope']")
print(len(dayElems))
Output:
0
It was a problem I used to face...
It is in another frame whose id is SportsbookIFrame. You need to navigate into the frame:
driver.switch_to_frame("SportsbookIFrame")
dayElems = driver.find_elements_by_xpath("//div[#class='ng-scope']")
len(dayElems)
Output:
26
For searching iframes, they are usual elements:
iframes = driver.find_elements_by_xpath("//iframe")

not able to output specific row from xpath response

This is my first attempt at using xpath
I am attempting to pull out information of products on a list using the xpath interface via Scrapy on Python, specifically the prices from this url :
http://store.nike.com/gb/en_gb/pw/mens-shoes/7puZoi3?ipp=120#
As you can see the prices (going horizontally from left to right) is £90, £120,
£100 ...
The following, will return a list of all the trainer prices on the page:
item['trainerPrice']= response.xpath('//span[#class="local nsg-font-family--base"]/text()').extract()
More so, the following will return the first "record":
item['trainerPrice']= response.xpath('string(//span[#class="local nsg-font-family--base"]/text())').extract()
However I am unsure, how to select the second record, i.e. u'\xa3120'
Any hints?
You can just get the second item from the extracted list:
prices = response.xpath('//span[#class="local nsg-font-family--base"]/text()').extract()
print(prices[1])
Though, I don't particularly like your locator. Instead I would take the prices div to rely on:
response.css('div.prices span.local::text')

Categories