web scraping w/age verification - python

Hello I want to web scrape data from a site with an age verification pop-up using python 3.x and beautifulsoup. I can't get to the underlying text and images without clicking "yes" for "are you over 21". Thanks for any support.
EDIT: Thanks, with some help from a comment I see that I can use the cookies but am not sure how to manage/store/call cookies with the requests package.
So with some help from another user I am using selenium package so that it will work also in case it's a graphical overlay (I think?). Having trouble getting it to work with the gecko driver but will keep trying! Thanks for all the advice again, everyone.
EDIT 3: OK I have made progress and I can get the browser window to open, using the gecko driver!~ Unfortunately it doesn't like that link specification so I'm posting again. The link to click "yes" on the age verification is buried on that page as something called mlink...
EDIT 4: Made some progress, updated code is below. I managed to find the element in the XML code, now I just need to manage to click the link.
#
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
driver = webdriver.Firefox(executable_path=r'/Users/jeff/Documents/geckodriver') # Optional argument, if not specified will search path.
driver.get('https://www.shopharborside.com/oakland/#/shop/412');
url = 'https://www.shopharborside.com/oakland/#/shop/412'
driver.get(url)
#
driver.find_element_by_class_name('hhc_modal-body').click(Yes)
#wait.1.second
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print(soup.prettify())
Edit new: Stuck again, here is the current code. I seem to have isolated the element "mBtnYes" but I get an error when running the code :
ElementClickInterceptedException: Message: Element is not clickable at point (625,278.5500030517578) because another element obscures it
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
driver = webdriver.Firefox(executable_path=r'/Users/jeff/Documents/geckodriver') # Optional argument, if not specified will search path.
driver.get('https://www.shopharborside.com/oakland/#/shop/412');
url = 'https://www.shopharborside.com/oakland/#/shop/412'
driver.get(url)
#
driver.find_element_by_id('myBtnYes').click()
#wait.1.second
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print(soup.prettify())

if your aim is to click the verification get to selenium:
ps install selenium && get geckodriver(firefox) or chromedriver(chrome)
#Mossein~King(hi i'm here to help)
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.firefox.options import Options
from BeautifulSoup import BeautifulSoup
#this.is.for.headless.This.will.save.you.a.bunch.of.research.time(Trust.me)
options = Options()
options.add_argument("--headless")
driver = webdriver.Firefox(firefox_options=options)
#for.graphical(you.need.gecko.driver.for.firefox)
# driver = webdriver.Firefox()
url = 'your-url'
driver.get(url)
#get.the.link.to.clicking
#exaple if<a class='MosseinKing'>
driver.find_element_by_xpath("//a[#class='MosseinKing']").click()
#wait.1.secong.in.case.of.transitions
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print soup.prettify()

Related

Not able to access website url using beautiful soup and python while web scraping

Link that I am scraping : https://www.indusind.com/in/en/personal/cards/credit-card.html
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json, requests, re, sys
from selenium import webdriver
import re
IndusInd_url = "https://www.indusind.com/in/en/personal/cards/credit-card.html"
html = requests.get(IndusInd_url)
soup = BeautifulSoup(html.content, 'lxml')
print(soup)
for x in soup.select("#display-product-cards .text-primary"):
print(x.get_text())
Using the above code I am trying to scrape the titles of the card, but unfortuantely I am getting this output
<html><body><p>This website is secured against online attacks. Your request was blocked due to suspicious behavior<br/>
<br/>
Client IP : 124.123.170.109<br/>
<br/>
Incident Time : 2021-02-24 06:28:10 UTC <br/>
<br/>
Incident ID : YDXx#m6g3nSFLvi5lGg4wgAAAf8<br/>
<br/>
If you feel it was a legitimate request, please contact the website owner for further investigation and remediation with a screenshot of this page.</p></body></html>
Is there any other alternative to follow to scrape the details.
Any help is highly appreciated ! ! !
Please check this.
FYI: Make sure you have the right driver (firefoxe or chrome or whatever with right version)
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import time
url = 'https://www.indusind.com/in/en/personal/cards/credit-card.html'
# open the chrome driver
driver = webdriver.Chrome(executable_path='webdrivers/chromedriver.exe')
# pings the specified url
driver.get(url)
# sleep time to wait for t seconds to wait for page load
# replace 3 with any int value (int value in seconds)
time.sleep(3)
# gets the page source
pg = driver.page_source
# beautify with beautifulsoup
soup = BeautifulSoup(pg)
# get the titles of the card
for x in soup.select("#display-product-cards .text-primary"):
print(x.get_text())
Below is output image
Can be achieved without BeautifulSoup.
I define the locator with xpath with the value:
//div[#id='display-product-cards']//a[#class='card-title text-primary' and text()!='']
And utilize method .presence_of_all_elements_located.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='webdrivers/chromedriver.exe')
driver.get('https://www.indusind.com/in/en/personal/cards/credit-card.html')
wait = WebDriverWait(driver, 20)
elements = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div[#id='display-product-cards']//a[#class='card-title text-primary' and text()!='']")))
for element in elements:
print(element.get_attribute('innerHTML'))
driver.quit()

The problem in crawl the every next page using beautifulsoup and webdriver

I am trying to crawl all the links of jobs from https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam by using BeautifulSoup and Selenium. The problem is that I just only can crawl the links of the 1st page and don't know how to crawl the link from every next page.
This is the code I have tried:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
from bs4 import BeautifulSoup
import array as arr
import pandas as pd
#The first line import the Web Driver, and the second import Chrome Options
#-----------------------------------#
#Chrome Options
all_link = []
chrome_options = Options()
chrome_options.add_argument ('--ignore-certificate-errors')
chrome_options.add_argument ("--igcognito")
chrome_options.add_argument ("--window-size=1920x1080")
chrome_options.add_argument ('--headless')
#-----------------------------------#
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path="C:/webdriver/chromedriver.exe")
#Open url
url = "https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam"
driver.get(url)
time.sleep(2)
#-----------------------------------#
page_source = driver.page_source
page = page_source
soup = BeautifulSoup(page_source,"html.parser")
block_job_list = soup.find_all("div",{"class":"d-flex justify-content-center align-items-center logo-area-wrapper logo-border"})
for i in block_job_list:
link = i.find("a")
all_link.append("https://www.vietnamworks.com/"+ link.get("href"))
Since your problem is traversing through the pages, this code will help you do that. Insert the scraping code inside the while loop as mentioned.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
from webdriver_manager.chrome import ChromeDriverManager # use pip install webdriver_manager if not installed
option = webdriver.ChromeOptions()
CDM = ChromeDriverManager()
driver = webdriver.Chrome(CDM.install(),options=option)
url = 'https://www.vietnamworks.com/tim-viec-lam/tat-ca-viec-lam'
driver.get(url)
time.sleep(3)
page_num = 1
links = []
driver.execute_script("window.scrollTo(0, document.body.scrollHeight/2);")
while True:
# create the soup element here so that it can get the page source of every page
# sample scraping of url's of the jobs posted
for i in driver.find_elements_by_class_name('job-title '):
links.append(i.get_attribute('href'))
# moves to next page
try:
print(f'On page {str(page_num)}')
print()
page_num+=1
driver.find_element_by_link_text(str(page_num)).click()
time.sleep(3)
# checks only at the end of the page
except NoSuchElementException:
print('End of pages')
break
driver.quit()
EDIT:
Simplified and modified the pagination method
If you are using BeautifulSoup then you have to insert the page_source and soup variables inside the while loop because after every iteration, the source page code changes. In your code you had extracted only the first page's source code and hence you got repetitive outputs which is equal to the number of pages.
By using ChromeDriverManager in the package webdriver-manager, the need to mention the location/executable path is not needed. You can just copy paste this code and run it in any machine that has Chrome installed in it. If you have to installed use pip install webdriver_manager in cmd before running the code.
Warning: AVOID DISPLAYING your actual username and password of any of your accounts like you have in your GitHub code.

BeautifulSoup scraping from a web page already opened by Selenium

I would like to make scrape a web page which was opened by Selenium from a different webpage.
I entered a search term into a website using Selenium and this landed me in a new page. My aim is to create soup out of this new page. But, the soup is getting created out of the previous page where I entered my search term. Help please!
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
inputElement.send_keys(Keys.ENTER)
driver.wait.until(staleness_of('txtStock')
source = driver.page_source
soup = BeautifulSoup(source)
You need to know the exect company names for your search. After you are using send_keys, you tried to check for staleness of an element. I did not understand how that statement should work. I added WebDriverWait for an element of the new page.
The following works for me reagrding the selenium part up to getting the page source:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries Ltd.')
inputElement.send_keys(Keys.ENTER)
company = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'lblCompany')))
source = driver.page_source
You should add exception handling.
#Jens Dibbern has given a working solution. But it is not necessary that the exact name of the company should be given in the search. What happens is that when you type a non-exact name, a drop-down will pop up.
I have observed that until and unless this drop-down is present enter key is not working. You can check this by going to the site, pasting the name and without waiting press the enter key as fast as possible. Nothing happens.
You could also wait for this drop-down to be visible instead and the send the enter key.This also works perfectly. Note that this will end up selecting the first item in the drop-down if more than one is present.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
drop_down=driver.find_element_by_css_selector("#listPlacementStock")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#listPlacementStock:not([style*="display: none"])')))
inputElement.send_keys(Keys.ENTER)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="CompanyLink"]')))
source = driver.page_source
soup = BeautifulSoup(source,'html.parser')
print(soup)

Fetching table data from webpage using selenium in python

I am very new to web scraping. I have the following url:
https://www.bloomberg.com/markets/symbolsearch
So, I use Selenium to enter the Symbol Textbox and press Find Symbols to get the details. This is the code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://www.bloomberg.com/markets/symbolsearch/")
element = driver.find_element_by_id("query")
element.send_keys("WMT:US")
driver.find_element_by_name("commit").click()
It returns the table. How can I retrieve that? I am pretty clueless.
Second question,
Can I do this without Selenium as it is slowing down things? Is there a way to find an API which returns a JSON?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
import requests
driver = webdriver.Firefox()
driver.get("https://www.bloomberg.com/markets/symbolsearch/")
element = driver.find_element_by_id("query")
element.send_keys("WMT:US")
driver.find_element_by_name("commit").click()
time.sleep(5)
url = driver.current_url
time.sleep(5)
parsed = requests.get(url)
soup = BeautifulSoup(parsed.content,'html.parser')
a = soup.findAll("table", { "class" : "dual_border_data_table" })
print(a)
here is the total code by which you can get the table you are looking for. now do what you need to do after getting the table. hope it helps

Python Scraping JavaScript using Selenium and Beautiful Soup

I'm trying to scrape a JavaScript enables page using BS and Selenium.
I have the following code so far. It still doesn't somehow detect the JavaScript (and returns a null value). In this case I'm trying to scrape the Facebook comments in the bottom. (Inspect element shows the class as postText)
Thanks for the help!
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup.BeautifulSoup(html_source)
comments = soup("div", {"class":"postText"})
print comments
There are some mistakes in your code that are fixed below. However, the class "postText" must exist elsewhere, since it is not defined in the original source code.
My revised version of your code was tested and is working on multiple websites.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source,'html.parser')
#class "postText" is not defined in the source code
comments = soup.findAll('div',{'class':'postText'})
print comments

Categories