I am trying to program a Python script which downloads table automatically from the webpage. The table is not fully loaded, when I simply go to the specified url address. I have to click link "Load more". This I tried to do by the script bellow.
delay = 2
driver = webdriver.Chrome('chromedriver')
driver.get("url")
time.sleep(delay + np.random.rand() )
click_except = 0
while click_except == 0:
try:
driver.find_element_by_id("id").click()
time.sleep(delay + np.random.rand() )
except:
click_except = 1
time.sleep(delay + np.random.rand() )
web = driver.find_element_by_id("id_table")
str = (web.text)
It worked before, but now it does not work... the same code! I moved to a different country and I am using different wi-fi. Can this have any effect? Actually the line with click command still works, when processed separately and manually. It does not work together with the While and Try cycle. Any idea what is wrong? Or any idea, how to programme it better?
The delay should give the webpage enough time to upload.
I recommend you to avoid waiting for a time period, it is better to wait for specific element and selenium supports it, check: https://selenium-python.readthedocs.io/waits.html#explicit-waits
You can do something like:
driver = webdriver.Chrome('chromedriver')
driver.get('url')
wait_for_id('id').click()
str = wait_for_id('id_table').text
def wait_for_id(identifier):
"""
It waits for web element with identifier
:return: found selenium web element
"""
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, identifier))
)
return element
Related
I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class
I am trying to scrape a website to extract tweet links (Specifically DW in this case) but I am unable to get any data because the tweets are not loading immediately so the request executes before there is time to give it to load. I have tried using requests timeout as well as time.sleep() but without luck. After using those two options I tried using Selenium to load the webpage locally and give it time to load, but I can't seem to make it work. I believe this can be done with Selenium. Here is what I tried so far:
links = 'https://www.dw.com/en/vaccines-appear-effective-against-india-covid-variant/a-57344037'
driver.get(links)
delay = 30 #seconds
try:
WebDriverWait(driver, delay).until(EC.visibility_of_all_elements_located((By.ID, "twitter-widget-0")))
except:
pass
tweetSource = driver.page_source
tweetSoup = BeautifulSoup(tweetSource, features='html.parser')
linkTweets = tweetSoup.find_all('a')
for linkTweet in linkTweets:
try:
tweetURL = linkTweet.attrs['href']
except: # pass on KeyError or any other error
pass
if "twitter.com" in tweetURL and "status" in tweetURL:
# Run getTweetID function
tweetID = getTweetID(tweetURL)
newdata = [tweetID, date_tag, "DW", links, title_tag, "News", ""]
# Write to dataframe
df.loc[len(df)] = newdata
print("working on tweetID: " + str(tweetID))
If anyone could get Selenium to find the tweet that would be great!
it's an iframe first you need to switch to that iframe
iframe = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "twitter-widget-0"))
)
driver.switch_to.frame(iframe)
My task is to open each url from the following website and retrieve some evaluation data for each essay. I have located the element successfully, which means I get 10 element. However, when selenium began to imitate human to click the url, it can only open the first link of ten links.
https://esi.clarivate.com/DocumentsAction.action
HTML:
The code is as followed.
import time
from selenium import webdriver
driver=webdriver.Chrome('/usr/local/bin/chromedriver')
driver.get('https://esi.clarivate.com/IndicatorsAction.action?Init=Yes&SrcApp=IC2LS&SID=H3-M1jrs4mSS2O3WTFbtdrUJugtDvogGRIM-18x2dx2B1ubex2Bo9Y5F6ZPQtUZbfUAx3Dx3Dp1StTsneXx2B7vu85UqXoaoQx3Dx3D-03Ff2gF3hTJGBPDScD1wSwx3Dx3D-cLUx2FoETAVeN3rTSMreq46gx3Dx3D')
#add filter-> research fields-> "clinical medicine"
target = driver.find_element_by_id("ext-gen1065")
time.sleep(1)
target.click()
time.sleep(1)
n = driver.window_handles
driver.switch_to.window(n[-1])
links=driver.find_elements_by_class_name("docTitle")
length=len(links)
for i in range(0,length):
item=links[i]
item.click()
time.sleep(1)
handles=driver.window_handles
index_handle=driver.current_window_handle
for handle in handles:
if handle != index_handle:
driver.switch_to.window(handle)
else:
continue
time.sleep(1)
u1=driver.find_elements_by_class_name("large-number")[2].text
u2=driver.find_elements_by_class_name("large-number")[3].text
print(u1,u2)
print("\n")
driver.close()
time.sleep(1)
driver.switch_to_window(index_handle)
driver.quit()
print("————finished————")
The error page:
And I try to find out the problem by testing these code:
links=driver.find_elements_by_class_name("docTitle")
length=len(links)
print(length)
print(links[1].text)
#links[0].click()
links[1].click()
The result is:
which means it had already find the element, but failed to open it.(when using links[0].text, it works fine.)
Any idea about this?
I am trying to scrape a long list of books in 10 web pages. When the loop clicks on next > button for the first time the website displays a login overlay so selenium can not find the target elements.
I have tried all the possible solutions:
Use some chrome options.
Use try-except to click X button on the overlay. But it appears only one time (when clicking next > for the first time). The problem is that when I put this try-except block at the end of while True: loop, it became infinite as I use continue in except as I do not want to break the loop.
Add some popup blocker extensions to Chrome but they do not work when I run the code although I add the extension using options.add_argument('load-extension=' + ExtensionPath).
This is my code:
options = Options()
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('disable-avfoundation-overlays')
options.add_argument('disable-internal-flash')
options.add_argument('no-proxy-server')
options.add_argument("disable-notifications")
options.add_argument("disable-popup")
Extension = (r'C:\Users\DELL\AppData\Local\Google\Chrome\User Data\Profile 1\Extensions\ifnkdbpmgkdbfklnbfidaackdenlmhgh\1.1.9_0')
options.add_argument('load-extension=' + Extension)
options.add_argument('--disable-overlay-scrollbar')
driver = webdriver.Chrome(options=options)
driver.get('https://www.goodreads.com/list/show/32339._50_?page=')
wait = WebDriverWait(driver, 2)
review_dict = {'title':[], 'author':[],'rating':[]}
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('table', class_ = 'tableList js-dataTooltip')
while True:
table = driver.find_element_by_xpath('//*[#id="all_votes"]/table')
for product in table.find_elements_by_xpath(".//tr"):
for td in product.find_elements_by_xpath('.//td[3]/a'):
title = td.text
review_dict['title'].append(title)
for td in product.find_elements_by_xpath('.//td[3]/span[2]'):
author = td.text
review_dict['author'].append(author)
for td in product.find_elements_by_xpath('.//td[3]/div[1]'):
rating = td.text[0:4]
review_dict['rating'].append(rating)
try:
close = wait.until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[3]/div/div/div[1]/button')))
close.click()
except NoSuchElementException:
continue
try:
element = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'next_page')))
element.click()
except TimeoutException:
break
df = pd.DataFrame.from_dict(review_dict)
df
Any help like if I can change the loop to for loop clicks next > button until the end rather than while loop or where should I put try-except block to close the overlay or if there is Chromeoption can disable overlay.
Thanks in advance
Thank you for sharing your code and the website that you are having trouble with. I was able to close the Login Modal by using xpath. I took this challenge and broke up the code using class objects. 1 object is for the selenium.webdriver.chrome.webdriver and the other object is for the page that you wanted to scrape the data against ( https://www.goodreads.com/list/show/32339 ). In the following methods, I used the Javascript return arguments[0].scrollIntoView(); method and was able to scroll to the last book that displayed on the page. After I did that, I was able to click the next button
def scroll_to_element(self, xpath : str):
element = self.chrome_driver.find_element(By.XPATH, xpath)
self.chrome_driver.execute_script("return arguments[0].scrollIntoView();", element)
def get_book_count(self):
return self.chrome_driver.find_elements(By.XPATH, "//div[#id='all_votes']//table[contains(#class, 'tableList')]//tbody//tr").__len__()
def click_next_page(self):
# Scroll to last record and click "next page"
xpath = "//div[#id='all_votes']//table[contains(#class, 'tableList')]//tbody//tr[{0}]".format(self.get_book_count())
self.scroll_to_element(xpath)
self.chrome_driver.find_element(By.XPATH, "//div[#id='all_votes']//div[#class='pagination']//a[#class='next_page']").click()
Once I clicked on the "Next" button, I saw the modal display. I was able to find the xpath for the modal and was able to close the modal.
def is_displayed(self, xpath: str, int = 5):
try:
webElement = DriverWait(self.chrome_driver, int).until(
DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
)
return True if webElement != None else False
except:
return False
def is_modal_displayed(self):
return self.is_displayed("//body[#class='modalOpened']")
def close_modal(self):
self.chrome_driver.find_element(By.XPATH, "//div[#class='modal__content']//div[#class='modal__close']").click()
if(self.is_modal_displayed()):
raise Exception("Modal Failed To Close")
I hope this helps you to solve your problem.
I need to scroll over a web page (example twitter) an make a web scraping of the new elements that appear as one advances on the website. I try to make this using python 3.x, selenium and PhantomJS. This is my code
import time
from selenium import webdriver
from bs4 import BeautifulSoup
user = 'ciroylospersas'
# Start web browser
#browser = webdriver.Firefox()
browser = webdriver.PhantomJS()
browser.set_window_size(1024, 768)
browser.get("https://twitter.com/")
# Fill username in login
element = browser.find_element_by_id("signin-email")
element.clear()
element.send_keys('your twitter user')
# Fill password in login
element = browser.find_element_by_id("signin-password")
element.clear()
element.send_keys('your twitter pass')
browser.save_screenshot('screen.png') # save a screenshot to disk
# Summit the login
element.submit()
time.sleep(5
browser.save_screenshot('screen1.png') # save a screenshot to disk
# Move to the following url
browser.get("https://twitter.com/" + user + "/following")
browser.save_screenshot('screen2.png') # save a screenshot to disk
scroll_script = "var h = document.body.scrollHeight; window.scrollTo(0, h); return h;"
newHeight = browser.execute_script(scroll_script)
print(newHeight)
browser.save_screenshot('screen3.png') # save a screenshot to disk
The problem is I can't scroll to the bottom. The screen2.png and screen3.png are the same. But if I change the webdriver from PhantomJS to Firefox the same code work fine. Why?
I was able to get this to work in phantomJS when trying to solve a similar problem:
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height
It will scroll to the current "bottom", wait, see if the page loaded more, and bail if it did not (assuming everything got loaded if the heights match.)
In my original code I had a "max" value I checked alongside the matching heights because I was only interested in the first 10 or so "pages". If there were more I wanted it to stop loading and skip them.
Also, this is the answer I used as an example