Selenium in Notebook vs. Script - python

Running into something interesting when trying to set up a Selenium webdriver to scrape fantasy football stats from ESPN. When I execute the following cells in a jupyter notebook I can reach the page I'm looking for (the draft recap page of my fantasy league) and successfully login to my account, accessing the page:
# cell 1
driver = webdriver.Firefox()
driver.get(url)
# cell 2
i = 0
iter_again = True
iframes = driver.find_elements_by_tag_name('iframe')
while i < len(iframes) and iter_again:
driver.switch_to_frame(iframes[i])
if (len(driver.find_elements_by_class_name("input-wrapper"))) > 0:
username, password = driver.find_elements_by_class_name("input-wrapper")
iter_again = False
else:
sleep(1)
driver.switch_to_default_content()
i += 1
# Cell 3
username.find_elements_by_tag_name('input')[0].send_keys(espn_username)
password.find_elements_by_tag_name('input')[0].send_keys(espn_password)
# Cell 4
driver.find_elements_by_tag_name('button')[0].click()
# Cell 5
driver.refresh()
The strange thing though is that when I put all of this in a function and return the webdriver object, espn won't let me log in. I get an error message saying that ESPN is experiencing technical difficulties at this time and I may not be able to login (they're right I can't).
I initially thought this could be some sort of rate limiting thing but can't think of anything different in the HTTP request from the functional form vs. the cell-by-cell approach. For what it's worth, I've tested the functional approach in both a jupyter notebook environment as well as running from a standalone script via the CLI. Any thoughts? All help/feedback is greatly appreciated!
EDIT - Adding the script that doesn't execute properly
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
time import sleep
def get_active_webdriver(url, espn_username, espn_password, headless=False):
driver = webdriver.Firefox()
driver.get(url)
i = 0
iter_again = True
# find iframe with the login info and log in
iframes = driver.find_elements_by_tag_name('iframe')
while i < len(iframes) and iter_again:
driver.switch_to_frame(iframes[i])
if (len(driver.find_elements_by_class_name("input-wrapper"))) > 0:
username, password = driver.find_elements_by_class_name("input-wrapper")
iter_again = False
else:
sleep(1)
driver.switch_to_default_content()
i += 1
username.find_elements_by_tag_name('input')[0].send_keys(espn_username)
password.find_elements_by_tag_name('input')[0].send_keys(espn_password)
driver.find_elements_by_tag_name('button')[0].click()
driver.refresh()
return driver
if __name__ == "__main__":
url = #url here
espn_username = #username
espn_password = #password
driver = get_active_webdriver(url, espn_username, espn_password)

Related

Can I run my selenium code so that it doesn't trigger cloudflare?

I'm writing a code that can help my dad get tee times for his golf course. At the moment, it scans through a series of n number of tabs for a button to book the tee time, but if it can't find one, it refreshes the page. The problem comes when it refreshes, as the page is protected by cloudflare, and thus my code gets blocked ten times more often than it can check for the actual tee time. Is there a better way to run my code, so it doesn't get blocked so often?
(At the moment, I'm running it on a headless selenium Chrome browser, but I'm looking to see if I can run the code on a normal browser instead.)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pyautogui as pag
date = "2021-09-19"
st = "08"
et = "09"
golfers = "4"
tabs_opened = 10
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
for _ in range(tabs_opened):
driver.execute_script(f"window.open('about:blank', 'tab{_+1}');")
driver.switch_to.window(f"tab{_+1}")
driver.get(f"https://city-of-burnaby-golf.book.teeitup.com/?course=5fc6afcfd62a025a3123401a&date={date}&end={et}&golfers={golfers}&holes=18&start={st}") #loads the website
driver.switch_to.window(driver.window_handles[0]) #assuming new tab is at index 1
driver.close()
tcgbbc = 0 #tracks number of times cloudflare blocks me
wt = 0 #tracks actual times the website has been checked for button
i = 0
_ = 0
starttest = time.perf_counter()
while i < 100:
driver.switch_to.window(driver.window_handles[_%tabs_opened]) #switch to next tab
try:
driver.find_element_by_xpath('/html/body/table/tbody/tr/td/div[2]/span/code') #checks if the cloudflare ray id text is there
print("whoops got blocked by cloudflare")
tcgbbc+=1
except:
print("no cloudflare yay") #it wasn't. yay
elem = driver.find_elements_by_xpath('//*[#id="app-container"]/div/div[2]/div/div[2]/div[2]/div[2]/div[1]/div/button') #checks golf burnaby website to see if the button is there
if len(elem) > 0:
elem[0].click()
print("page found")
break
else:
if len(driver.find_elements_by_xpath('//*[#id="header"]/div/div[1]')) > 0: #if the website's title is there (sometimes it checks and can't find the button, but the page wasn't even loaded)
print("didn't work")
wt+=1
i+=1
driver.refresh() #refreshes the page. seems to be the most time consuming.
_+=1
print(time.perf_counter() - starttest)
time.sleep(1)
driver.close
print(f"Cloudflare blocks: {tcgbbc}, actual checks: {wt}")

How do I make the driver navigate to new page in selenium python

I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class

automaticaly impossible to click element in Webdriver (python)

I am trying to program a Python script which downloads table automatically from the webpage. The table is not fully loaded, when I simply go to the specified url address. I have to click link "Load more". This I tried to do by the script bellow.
delay = 2
driver = webdriver.Chrome('chromedriver')
driver.get("url")
time.sleep(delay + np.random.rand() )
click_except = 0
while click_except == 0:
try:
driver.find_element_by_id("id").click()
time.sleep(delay + np.random.rand() )
except:
click_except = 1
time.sleep(delay + np.random.rand() )
web = driver.find_element_by_id("id_table")
str = (web.text)
It worked before, but now it does not work... the same code! I moved to a different country and I am using different wi-fi. Can this have any effect? Actually the line with click command still works, when processed separately and manually. It does not work together with the While and Try cycle. Any idea what is wrong? Or any idea, how to programme it better?
The delay should give the webpage enough time to upload.
I recommend you to avoid waiting for a time period, it is better to wait for specific element and selenium supports it, check: https://selenium-python.readthedocs.io/waits.html#explicit-waits
You can do something like:
driver = webdriver.Chrome('chromedriver')
driver.get('url')
wait_for_id('id').click()
str = wait_for_id('id_table').text
def wait_for_id(identifier):
"""
It waits for web element with identifier
:return: found selenium web element
"""
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, identifier))
)
return element

Cannot contact reCAPTCHA. Check your connection and try again in selenium with python

I want to make a python program that gets all links for a certain Google search query so I loop over the 30 search pages and when it gives me a ReCaptcha I do it manually
here is how my code looks like :
driver = webdriver.Firefox()
number_pages = 30
query = 'hello world'
query = urllib.parse.quote_plus(query)
url = "https://www.google.com/search?q="+query+"&&start="
with open('result.txt','w') as fp:
for i in range(1,number_pages-1):
# loop over the 30 pages
page_url = url + str((i-1)*10)
print("# " + page_url)
driver.get(page_url)
while len(driver.find_elements_by_id('recaptcha')) != 0:
# ReCaptcha , sleeping until the user solve the recaptcha
print('sleeping...!')
time.sleep(10)
els = driver.find_elements_by_tag_name('cite')
But when i try to send the recaptcha form it gaves me the error:
Cannot contact reCAPTCHA. Check your connection and try again
and when I use a normal navigator (Google Chrome or Firefox ) the error don't occur
I think the ReCaptcha blocks the webdriver
Please anyone can explain what exact issue here, and how can be fixed.

Python and Selenium: I am automating web scraping among pages. How can I loop by Next button?

I already written several lines of codes to pull url from this website.
http://www.worldhospitaldirectory.com/United%20States/hospitals
code is below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
driver = webdriver.Firefox()
driver.get('http://www.worldhospitaldirectory.com/United%20States/hospitals')
url = []
pagenbr = 1
while pagenbr <= 115:
current = driver.current_url
driver.get(current)
lks = driver.find_elements_by_xpath('//*[#href]')
for ii in lks:
link = ii.get_attribute('href')
if '/info' in link:
url.append(link)
print('page ' + str(pagenbr) + ' is done.')
if pagenbr <=114:
elm = driver.find_element_by_link_text('Next')
driver.implicitly_wait(10)
elm.click()
time.sleep(2)
pagenbr += 1
ls = list(set(url))
with open('US_GeneralHospital.csv', 'wb') as myfile:
wr = csv.writer(myfile,quoting=csv.QUOTE_ALL)
for u in ls:
wr.writerow([u])
And it worked very well to pull each individual links from this website.
But the problem is I need to change the page number I need to loop by myself every time.
I want to let this code upgrade to iterate by calculating how many time it need. Not by manually inputting.
Thank you very much.
This is bad idea to hardcode the number of pages in your script. Try just to click "Next" button while it is enabled:
from selenium.common.exceptions import NoSuchElementException
while True:
try:
# do whatever you need to do on page
driver.find_element_by_xpath('//li[not(#class="disabled")]/span[text()="Next"]').click()
except NoSuchElementException:
break
This should allow you to execute page scraping until the last page reached
Also note that using lines current = driver.current_url and driver.get(current) makes no sense at all, so you might skip them

Categories