Click on title to scrape data using selenium and scrapy

Click on title to scrape data using selenium and scrapy - python

from scrapy_selenium import SeleniumRequest
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
from selenium import webdriver
url='https://www.aeafa.es/asociados.php?provinput='
driver =webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')
driver.get(url)
wait = WebDriverWait(driver, 30)
detail=wait.until(EC.element_to_be_clickable((By.XPATH, "//tbody//td[6]")))
detail.click()
Error will be in these:
detail=wait.until(EC.element_to_be_clickable((By.XPATH, "//tbody//td[6]")))
detail.click()
I want that they click on the title of these is page link
and then there will scrape these information. How to scrape these data.

You cannot locate HTML attribute to click on it. Try to replace
detail=wait.until(EC.element_to_be_clickable((By.XPATH, "//td//img//#src"))).click()
with
detail = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[#title='info']")))
detail.click()
UPDATE
Since your ScrapySelenium approach doeasn't seem to work, try common Selenium approach. Then you could adapt it to Scrapy
from selenium import webdriver
driver = webdriver.Chrome()
url = 'https://www.aeafa.es/asociados.php?provinput='
driver.get(url)
for mail in driver.find_elements("xpath", "//p/a[starts-with(#href, 'mailto')]"):
print(mail.get_attribute('textContent'))
Note that you don't need to open each details popup to get required email text

Related

Wait for some time before getting the website source code

I am trying to scrape a website to get the heading and summary of the news. The problem I am facing is that when we first open the website, a redirect appears and we have to wait 8 seconds for the website to load. The problem I am facing is that the web data that is beign stored is that of the redirect instead of the main website.
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
# Specify the path to the ChromeDriver executable
chrome_driver_path = "C:/webdrivers/chromedriver"
# Initialize the webdriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# Navigate to website
driver.get("https://economictimes.indiatimes.com/markets/stocks/news")
time.sleep(10)
data2, data4 = [], []
while True:
# Extract data
soup = BeautifulSoup(driver.page_source, 'html.parser')
data = soup.find_all("div", {"class": "example-class"})
for item in data:
data2.append(item.find_all('h3'))
data4.append(item.find_all('p'))
try:
# Find the "Load More" button
load_more_button = driver.find_element_by_css_selector("div.autoload_continue")
# Click the button
load_more_button.click()
except:
break
# Close the browser
driver.quit()
print(data2)

You could check for switch to your final url:
wait.until(EC.url_to_be('https://economictimes.indiatimes.com/markets/stocks/news'))
Example
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
url = 'https://economictimes.indiatimes.com/markets/stocks/news'
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(EC.url_to_be('https://economictimes.indiatimes.com/markets/stocks/news'))

An ideal approach would be to wait for the News heading within the webpage to be visibible.
Solution
To wait for the News heading to be visibible you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h1.h1")))
Using XPATH:
driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h1[#class='h1' and text()='News']")))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Alternative
You can also wait for the Page Title of the webpage to contain Stocks in News Today as follows:
driver.get('https://economictimes.indiatimes.com/markets/stocks/news')
WebDriverWait(driver, 10).until(EC.title_contains("Stocks in News Today"))
References
You can find a couple of relevant detailed discussions in:
Python selenium get page title
How to make selenium wait before getting contents from the actual website which loads after the landing page through IEDriverServer and IE

Button click using selenium webdriver youtbe example

When I run the code, it opens the browser, but I want to click the reject all button when the cookie banner comes from YouTube. I tried using class and a link text. It did not work.
I will appreciate any help.
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.youtube.com/'
driver = webdriver.Chrome(executable_path="C:\\Users\\donner\\Downloads\\chromedriver_win32\\chromedriver.exe")
driver.get(url)
driver.implicitly_wait(100)
continue_link = driver.find_element(By.LINK_TEXT, "Reject all")
continue_link.click()
driver.implicitly_wait(100)
content = driver.find_element(By.CLASS_NAME, '.style-scope.ytd-button-renderer.style-primary size-default')
content.click()

Try the below xpath with the given code
//*[normalize-space()='Reject all']
OR
//*[contains(text(),'Reject all')]
The Code:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "myXpath")))
element.click();
The above soultion work if there is no iframe
In case there is iframe, than you need to follow below
WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"iframe")))
driver.find_element(By.XPATH, "yourXpath").click()

Click “Accept Cookies” popup with Selenium in Python

I've been trying to scrape some information of this E-commerce website with selenium. However when I access the website I need to accept cookies to continue. This only happens when the bot accesses the website, not when I do it manually. When I try to find the corresponding element either by xpath, as I find it when I inspect the page manually I always get this error message:
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
My code is mentined below.
import time
import pandas
pip install selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup #pip install beautifulsoup4
PATH = "/Users/Ziye/Desktop/Python/chromedriver"
delay = 15
driver = webdriver.Chrome(PATH)
driver.implicitly_wait(10)
driver.get("https://en.zalando.de/women/?q=michael+michael+kors+taschen")
driver.find_element_by_xpath('//*[#id="uc-btn-accept-banner"]').click()
This is the HTML corresponding to the button "That's ok". The XPATH is as above.
<button aria-label="" id="uc-btn-accept-banner" class="uc-btn uc-btn-primary">
That’s OK <span id="uc-optin-timer-display"></span>
</button>
Does anyone know where my mistake lies?

You should add explicit wait for this button:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.implicitly_wait(10)
driver.get("https://en.zalando.de/women/?q=michael+michael+kors+taschen")
wait = WebDriverWait(driver, 15)
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[#id="uc-btn-accept-banner"]')))
driver.find_element_by_xpath('//*[#id="uc-btn-accept-banner"]').click()
Your locator is correct.
As css selector, you can use .uc-btn-footer-container .uc-btn.uc-btn-primary

Not Able to See Entire Page Source of Website using Selenium,Python

I am trying to scrape this website
https://script.google.com/a/macros/cprindia.org/s/AKfycbzgfCVNciFRpcpo8P7joP1wTeymj9haAQnNEkNJJ2fQ4FBXEco/exec
I am using selenium and python.I am not able to view entire page source,Basically i have to scrape the table inside it and click on next button,but the code of next and table not visible on page source.Here is my code
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get(link)
pass1 = browser.find_element_by_xpath("/html/body/div[2]/table[2]/tbody/tr[1]/td/div/div/div[2]/div[2]")
pass1.click()
time.sleep(30)
I am getting this error,NoSuchElementException.

There are two iframes present on the page, so you need to first switch on those iframe and then you need to click on the element.
And you can apply explicit wait on the element so that the script waits until the element is visible on the page.
You can do it like:
browser = webdriver.PhantomJS()
browser.get(link)
browser.switch_to.frame(driver.find_element_by_id('sandboxFrame'))
browser.switch_to.frame(driver.find_element_by_id('userHtmlFrame'))
WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//div[contains(#class,'charts-custom-button-collapse-left')]//div"))).click()
Note: You have to add the following imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

BeautifulSoup scraping from a web page already opened by Selenium

I would like to make scrape a web page which was opened by Selenium from a different webpage.
I entered a search term into a website using Selenium and this landed me in a new page. My aim is to create soup out of this new page. But, the soup is getting created out of the previous page where I entered my search term. Help please!
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
inputElement.send_keys(Keys.ENTER)
driver.wait.until(staleness_of('txtStock')
source = driver.page_source
soup = BeautifulSoup(source)

You need to know the exect company names for your search. After you are using send_keys, you tried to check for staleness of an element. I did not understand how that statement should work. I added WebDriverWait for an element of the new page.
The following works for me reagrding the selenium part up to getting the page source:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries Ltd.')
inputElement.send_keys(Keys.ENTER)
company = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'lblCompany')))
source = driver.page_source
You should add exception handling.

#Jens Dibbern has given a working solution. But it is not necessary that the exact name of the company should be given in the search. What happens is that when you type a non-exact name, a drop-down will pop up.
I have observed that until and unless this drop-down is present enter key is not working. You can check this by going to the site, pasting the name and without waiting press the enter key as fast as possible. Nothing happens.
You could also wait for this drop-down to be visible instead and the send the enter key.This also works perfectly. Note that this will end up selecting the first item in the drop-down if more than one is present.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
drop_down=driver.find_element_by_css_selector("#listPlacementStock")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#listPlacementStock:not([style*="display: none"])')))
inputElement.send_keys(Keys.ENTER)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="CompanyLink"]')))
source = driver.page_source
soup = BeautifulSoup(source,'html.parser')
print(soup)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Click on title to scrape data using selenium and scrapy - python

Related

Wait for some time before getting the website source code

Button click using selenium webdriver youtbe example

Click “Accept Cookies” popup with Selenium in Python

Not Able to See Entire Page Source of Website using Selenium,Python

BeautifulSoup scraping from a web page already opened by Selenium

Categories

Resources