Trouble scraping titles from a webpage

Trouble scraping titles from a webpage - python

I've written a script in python with selenium to parse some results populated upon filling in an inputbox and preessing the Go button. My script does this portion well at this moment. However, my main goal is to parse the title of that container visible as Toys & Games as well.
This is my try so far (I could not find any idea to make a loop to do the same for all the containers):
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://www.fbatoolkit.com/"
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
driver.find_element_by_css_selector(".estimator-container .estimator-input").send_keys("25000",Keys.RETURN)
time.sleep(2)
item = driver.find_element_by_css_selector(".estimator-result div").text
print(item)
driver.quit()
The result I get:
4 (30 Days Avg)
Result I would like to have:
Toys & Games
4 (30 Days Avg)
Link to an image in which you can see how they look like in that site. Expected fields are also marked with a pencil to let you know the location of the fields I'm trying to parse.

Try below code to get required output
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
url = "https://www.fbatoolkit.com/"
driver = webdriver.Chrome()
driver.get(url)
for container in wait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[class='chart-container']"))):
wait(container, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input.estimator-input"))).send_keys("25000", Keys.RETURN)
title = wait(container, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".chart text"))).text
item = wait(container, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".estimator-result div"))).text
print(title, item)
driver.quit()

Related

Wait for some time before getting the website source code

I am trying to scrape a website to get the heading and summary of the news. The problem I am facing is that when we first open the website, a redirect appears and we have to wait 8 seconds for the website to load. The problem I am facing is that the web data that is beign stored is that of the redirect instead of the main website.
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
# Specify the path to the ChromeDriver executable
chrome_driver_path = "C:/webdrivers/chromedriver"
# Initialize the webdriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# Navigate to website
driver.get("https://economictimes.indiatimes.com/markets/stocks/news")
time.sleep(10)
data2, data4 = [], []
while True:
# Extract data
soup = BeautifulSoup(driver.page_source, 'html.parser')
data = soup.find_all("div", {"class": "example-class"})
for item in data:
data2.append(item.find_all('h3'))
data4.append(item.find_all('p'))
try:
# Find the "Load More" button
load_more_button = driver.find_element_by_css_selector("div.autoload_continue")
# Click the button
load_more_button.click()
except:
break
# Close the browser
driver.quit()
print(data2)

You could check for switch to your final url:
wait.until(EC.url_to_be('https://economictimes.indiatimes.com/markets/stocks/news'))
Example
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
url = 'https://economictimes.indiatimes.com/markets/stocks/news'
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(EC.url_to_be('https://economictimes.indiatimes.com/markets/stocks/news'))

An ideal approach would be to wait for the News heading within the webpage to be visibible.
Solution
To wait for the News heading to be visibible you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h1.h1")))
Using XPATH:
driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h1[#class='h1' and text()='News']")))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Alternative
You can also wait for the Page Title of the webpage to contain Stocks in News Today as follows:
driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 10).until(EC.title_contains("Stocks in News Today"))
References
You can find a couple of relevant detailed discussions in:
Python selenium get page title
How to make selenium wait before getting contents from the actual website which loads after the landing page through IEDriverServer and IE

Selenium is returning empty text for elements that definitely have text

I'm practicing trying to scrape my university's course catalog. I have a few lines in Python that open the url in Chrome and clicks the search button to bring up the course catalog. When I go to extract the texting using find_elements_by_xpath(), it returns blank. When I use the dev tools on Chrome, there definitely is text there.
from selenium import webdriver
import time
driver = webdriver.Chrome()
url = 'https://courses.osu.edu/psp/csosuct/EMPLOYEE/PUB/c/COMMUNITY_ACCESS.OSR_CAT_SRCH.GBL?'
driver.get(url)
time.sleep(3)
iframe = driver.find_element_by_id('ptifrmtgtframe')
driver.switch_to.frame(iframe)
element = driver.find_element_by_xpath('//*[#id="OSR_CAT_SRCH_WK_BUTTON1"]')
element.click()
course = driver.find_elements_by_xpath('//*[#id="OSR_CAT_SRCH_OSR_CRSE_HEADER$0"]')
print(course)
I'm trying to extract the text from the element 'OSU_CAT_SRCH_OSR_CRSE_HEADER'. I don't understand why it's not returning the text values especially when I can see that it contains text with dev tools.

You are not using text that is the reason you are not getting the text.
course = driver.find_elements_by_xpath('//*[#id="OSR_CAT_SRCH_OSR_CRSE_HEADER$0"]').text
Try above changes in last second line
Below is the full code after the changes
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
url = 'https://courses.osu.edu/psp/csosuct/EMPLOYEE/PUB/c/COMMUNITY_ACCESS.OSR_CAT_SRCH.GBL?'
driver.get(url)
time.sleep(3)
iframe = driver.find_element_by_id('ptifrmtgtframe')
driver.switch_to.frame(iframe)
element = driver.find_element_by_xpath('//*[#id="OSR_CAT_SRCH_WK_BUTTON1"]')
element.click()
# wait 10 seconds
course = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '//*[#id="OSR_CAT_SRCH_OSR_CRSE_HEADER$0"]'))
).text
print(course)

BeautifulSoup scraping from a web page already opened by Selenium

I would like to make scrape a web page which was opened by Selenium from a different webpage.
I entered a search term into a website using Selenium and this landed me in a new page. My aim is to create soup out of this new page. But, the soup is getting created out of the previous page where I entered my search term. Help please!
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
inputElement.send_keys(Keys.ENTER)
driver.wait.until(staleness_of('txtStock')
source = driver.page_source
soup = BeautifulSoup(source)

You need to know the exect company names for your search. After you are using send_keys, you tried to check for staleness of an element. I did not understand how that statement should work. I added WebDriverWait for an element of the new page.
The following works for me reagrding the selenium part up to getting the page source:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries Ltd.')
inputElement.send_keys(Keys.ENTER)
company = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'lblCompany')))
source = driver.page_source
You should add exception handling.

#Jens Dibbern has given a working solution. But it is not necessary that the exact name of the company should be given in the search. What happens is that when you type a non-exact name, a drop-down will pop up.
I have observed that until and unless this drop-down is present enter key is not working. You can check this by going to the site, pasting the name and without waiting press the enter key as fast as possible. Nothing happens.
You could also wait for this drop-down to be visible instead and the send the enter key.This also works perfectly. Note that this will end up selecting the first item in the drop-down if more than one is present.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
drop_down=driver.find_element_by_css_selector("#listPlacementStock")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#listPlacementStock:not([style*="display: none"])')))
inputElement.send_keys(Keys.ENTER)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="CompanyLink"]')))
source = driver.page_source
soup = BeautifulSoup(source,'html.parser')
print(soup)

My scraper fails to get all the items from a webpage

I've written some code in python in combination with selenium to parse different product names from a webpage. There are few load more buttons visible if the browser is made to scroll downward. The webpage displays it's full content if the page is made to scroll downmost until there is no load more button to click. My scraper seems to be doing good but I'm not getting all the results. There are around 200 products in that page but I'm getting 90 out of them. What change should I bring about in my scraper to get them all? Thanks in advance.
The webpage I'm dealing with: Page_Link
This is the script I'm trying with:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("put_above_url_here")
wait = WebDriverWait(driver, 10)
page = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".listing_item")))
for scroll in range(17):
page.send_keys(Keys.PAGE_DOWN)
time.sleep(2)
try:
load = driver.find_element_by_css_selector(".lm-btm")
load.click()
except Exception:
pass
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[id^=item_]"))):
name = item.find_element_by_css_selector(".pro-name.el2").text
print(name)
driver.quit()

Try below code to get required data:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://www.purplle.com/search?q=hair%20fall%20shamboo")
wait = WebDriverWait(driver, 10)
header = driver.find_element_by_tag_name("header")
driver.execute_script("arguments[0].style.display='none';", header)
while True:
try:
page = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".listing_item")))
driver.execute_script("arguments[0].scrollIntoView();", page)
page.send_keys(Keys.END)
load = wait.until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT, "LOAD MORE")))
driver.execute_script("arguments[0].scrollIntoView();", load)
load.click()
wait.until(EC.staleness_of(load))
except:
break
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[id^=item_]"))):
name = item.find_element_by_css_selector(".pro-name.el2").text
print(name)
driver.quit()

You should only Use Selenium as a last resort.
A simple look around in the webpage showed the API it called to get your data.
It returns a JSON output with all the details:
Link
You can now just loop over and store in a dataframe easily.
Very fast, fewer errors than selenium.

click on element with classname selenium

I am trying to scrape the opening hours of bars from a website. There is a list of bars which then if you navigate to you the opening hours are available. I am having an issue clicking on an element when it has a class name.
I have written the code to get the hours from one venuw, however, I am unable to navigate to each venue from the first link.
This code works when I get the hours for one venue
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.designmynight.com/london/bars/soho/six-storeys')
hours = driver.find_element_by_xpath('//li[#id="hours"]')
hours.click()
hoursTable = driver.find_elements_by_css_selector("table.opening-times tr")
for row in hoursTable:
print(row.text)
The issue is when I try to navigate to this page to I am unable to click into each link. Can anyone see what I am doing wrong?
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.designmynight.com/london/search-results#!?type_of_venue=512b2019d5d190d2978c9ea9&type_of_venue=512b2019d5d190d2978c9ea8&type_of_venue=512b2019d5d190d2978c9ead&type_of_venue=512b2019d5d190d2978c9eaa&type_of_venue=512b2019d5d190d2978c9eab&type=&q=&radius=')
venue = driver.find_element_by_xpath('//a[#class="ng-binding"]')
venue.click()
//this should then lead me to the following link ->
driver.get('https://www.designmynight.com/london/bars/soho/six-storeys')
hours = driver.find_element_by_xpath('//li[#id="hours"]')
hours.click()
hoursTable = driver.find_elements_by_css_selector("table.opening-times tr")
for row in hoursTable:
print(row.text)

All links with ng-binding class names are generated dynamically, so you need to wait untill link appears in DOM and it's also clickable:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.by import By
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.designmynight.com/london/search-results#!?type_of_venue=512b2019d5d190d2978c9ea9&type_of_venue=512b2019d5d190d2978c9ea8&type_of_venue=512b2019d5d190d2978c9ead&type_of_venue=512b2019d5d190d2978c9eaa&type_of_venue=512b2019d5d190d2978c9eab&type=&q=&radius=')
venue = wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//a[#class="ng-binding"]')))
venue.click()
But if you want to follow each link I'd suggest you not to click those links, but get the list of references and then open each one as below:
xpath = '//a[#class="ng-binding"]'
wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath)))
links = [venue.get_attribute('href') for venue in driver.find_elements_by_xpath(xpath)]
for link in links:
driver.get(link)
hours = driver.find_element_by_xpath('//li[#id="hours"]')
hours.click()
hoursTable = driver.find_elements_by_css_selector("table.opening-times tr")
for row in hoursTable:
print(row.text)

I feel the issue is that the driver is trying to locate the element before it is available in the DOM. Try waiting until the element is present, instead of directly trying to find it.
You could replace this line:
venue = driver.find_element_by_xpath('//a[#class="ng-binding"]')
with:
venue = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//a[#class="ng-binding"]')))
Note that you'll need to do a few imports to do this:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Source : http://selenium-python.readthedocs.io/waits.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble scraping titles from a webpage - python

Related

Wait for some time before getting the website source code

Selenium is returning empty text for elements that definitely have text

BeautifulSoup scraping from a web page already opened by Selenium

My scraper fails to get all the items from a webpage

click on element with classname selenium

Categories

Resources