Take snapshot using Selenium for page that doesn't stop loading - python

I'm using Selenium to capture screenshots of a web page. It works great on sites like stackoverflow but I'm trying to use it on a page that never stops loading. Is there a way to grab the screenshot after x seconds regardless if it's done or not?
Current code:
import os
from selenium import webdriver
def main():
driver = webdriver.Chrome()
with open('test.txt', 'r') as f:
for url in f.readlines():
driver.get('http://' + url)
sn_name = os.path.join('Screenshots', url.strip().replace('/', '-') + '.png')
print('Attempting to save:', sn_name)
if not driver.save_screenshot(sn_name):
raise Exception('Could not save screen shot: ' + sn_name)
driver.quit()
if __name__ == '__main__':
main()

I think it doesn't work like that.
Webdriver will implicit waiting for a page loading till timed-out.
It should give you a timeout exception.
I think you should use try-except to catch that and then take a screenshot.
Otherwise, you should do a multithreading programming for another thread to take a screenshot.

Related

How to add a function to monitor link clicking using selenium?

I wrote a short program to automate the process of clicking and saving profiles on LinkedIn.
Brief:
The program reads from a txt file with a large amount of LI URLs.
Using Selenium, it opens them one by one, then, hit the "Open in Sales Navigator" button
A new tab is opening, and on it, it needs to click the "Save" button, and choose the relevant list to save on.
I have two main problems:
LinkedIn has 3 versions of the same page. How can I use a condition to check which page version is it? (meaning - if you can't find this button, move to the next version). From what I've seen, you can't really use the "If" function with selenium, cause it causing trouble. Any other suggestions?
More important, and the reason I opened this thread - I want to monitor the "failed" links. Let's say I have a list of 1000 LI URLs, and I ran the program to save them on my account. I want to monitor the ones it didn't save or failed to open (broken links, page unavailable, etc.). In order to execute that, I used a CSV file and ordered the program to save all the pages that already saved on this account, but it doesn't solve my problem. How can I make him save all of them and not just the ones that were already saved? (I find it hard to execute because when a page appears as "Unavailable", it jumps to the next one and I couldn't find a way to make him save it.
It makes it harder to work with it, cause when I put 500 or 1000 URLs, I can't tell which ones save and which ones aren't saved.
Here's the code:
import selenium.webdriver as webdriver
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
import csv
import random
options = webdriver.ChromeOptions()
options.add_argument('--lang=EN')
options.add_argument("--start-maximized")
prefs = {"profile.default_content_setting_values.notifications" : 2}
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(executable_path='assets\chromedriver', chrome_options=options)
driver.get("https://www.linkedin.com/login?fromSignIn=true")
minDelay=input("\n Please provide min delay in seconds : ")
maxDelay=input("\n Please provide max delay in seconds : ")
listNumber=input("\n Please provide list number : ")
outputFile=input('\n save skipped as?: ')
count=0
closed=2
with open("links.txt", "r") as links:
for link in links:
try:
driver.get(link.strip())
sleep(3)
driver.find_element_by_xpath("//button[#class='save-to-list-dropdown__trigger ph5 artdeco-button artdeco-button--primary artdeco-button--3 artdeco-button--pro artdeco-dropdown__trigger artdeco-dropdown__trigger--placement-bottom ember-view']").click()
sleep(2)
count+=1
if count==1:
driver.find_element_by_xpath("//ul[#class='save-to-list-dropdown__content']//ul//li["+str(listNumber)+"]").click()
else:
driver.find_element_by_xpath("//ul[#class='save-to-list-dropdown__content']//ul//li[1]").click()
sleep(2)
sleep(random.randint(int(minDelay), int(maxDelay)))
except:
if closed==0:
driver.close()
sleep(1)
fileOutput=open(outputFile+".csv", mode='a', newline='', encoding='utf-8')
file_writer = csv.writer(fileOutput, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
file_writer.writerow([link.strip()])
fileOutput.close()
print("Finished.")
The common approach to have different sort of listeners is to use EventFiringWebDriver. See the example here:
from selenium import webdriver
from selenium.webdriver.support.abstract_event_listener import AbstractEventListener
from selenium.webdriver.support.event_firing_webdriver import EventFiringWebDriver
class EventListener(AbstractEventListener):
def before_click(self, element, driver):
if element.tag_name == 'a':
print('Clicking link:', element.get_attribute('href'))
if __name__ == '__main__':
driver = EventFiringWebDriver(driver=webdriver.Firefox(), event_listener=EventListener())
driver.get("https://webelement.click/en/welcome")
link = driver.find_element_by_xpath('//a[text()="All Posts"]')
link.click()
driver.quit()
UPD:
Basically your case does not really need that listener. However you can user it. Say you have link file like:
https://google.com
https://invalid.url
https://duckduckgo.com/
https://sadfsdf.sdf
https://stackoverflow.com
Then the way with EventFiringWebDriver would be:
from selenium import webdriver
from selenium.webdriver.support.abstract_event_listener import AbstractEventListener
from selenium.webdriver.support.event_firing_webdriver import EventFiringWebDriver
broken_urls = []
class EventListener(AbstractEventListener):
def on_exception(self, exception, drv):
broken_urls.append(drv.current_url)
if __name__ == '__main__':
driver = EventFiringWebDriver(driver=webdriver.Firefox(), event_listener=EventListener())
with open("links.txt", "r") as links:
for link in links:
try:
driver.get(link.strip())
except:
print('Cannot reach the link', link.strip())
print("Finished.")
driver.quit()
import csv
with open('broken_urls.csv', 'w', newline='') as broken_urls_csv:
wr = csv.writer(broken_urls_csv, quoting=csv.QUOTE_ALL)
wr.writerow(broken_urls)
and without EventFiringWebDriver would be:
broken_urls = []
if __name__ == '__main__':
from selenium import webdriver
driver = webdriver.Firefox()
with open("links.txt", "r") as links:
for link in links:
stripped_link = link.strip()
try:
driver.get(stripped_link)
except:
print('Cannot reach the link', link.strip())
broken_urls.append(stripped_link)
print("Finished.")
driver.quit()
import csv
with open('broken_urls.csv', 'w', newline='') as broken_urls_csv:
wr = csv.writer(broken_urls_csv, quoting=csv.QUOTE_ALL)
wr.writerow(broken_urls)

TBSelenium: Tor page closes immediately

I'm trying to use Tor with selenium, which works through the use of tbselenium.
However, when loading an url or clicking a web element, the page immideately closes when finishing the action, instead of remaining open as would be the case when using selenium with chrome.
Any ideas to keep the page open?
import tbselenium.common as cm
from tbselenium.tbdriver import TorBrowserDriver
from tbselenium.utils import launch_tbb_tor_with_stem
tbb_dir = "C:\\pathto\\Tor Browser\\"
tor_process = launch_tbb_tor_with_stem(tbb_path=tbb_dir)
for i in range(1):
with TorBrowserDriver(tbb_dir, tor_cfg=cm.USE_STEM) as driver:
driver.load_url("http://hln.be",3,wait_for_page_body=True)
#driver.get('https://google.be')
try:
policypage=driver.find_element_by_xpath("//a[contains(#href,'members/join')]")
policypage.click()
usern=driver.find_element_by_xpath("//input[contains(#id,'user_member_username')]")
usern.send_keys('Tryout')
except:
print('different look')
As Furas said, use the standard driver declaration.

Python and Selenium: I am automating web scraping among pages. How can I loop by Next button?

I already written several lines of codes to pull url from this website.
http://www.worldhospitaldirectory.com/United%20States/hospitals
code is below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
driver = webdriver.Firefox()
driver.get('http://www.worldhospitaldirectory.com/United%20States/hospitals')
url = []
pagenbr = 1
while pagenbr <= 115:
current = driver.current_url
driver.get(current)
lks = driver.find_elements_by_xpath('//*[#href]')
for ii in lks:
link = ii.get_attribute('href')
if '/info' in link:
url.append(link)
print('page ' + str(pagenbr) + ' is done.')
if pagenbr <=114:
elm = driver.find_element_by_link_text('Next')
driver.implicitly_wait(10)
elm.click()
time.sleep(2)
pagenbr += 1
ls = list(set(url))
with open('US_GeneralHospital.csv', 'wb') as myfile:
wr = csv.writer(myfile,quoting=csv.QUOTE_ALL)
for u in ls:
wr.writerow([u])
And it worked very well to pull each individual links from this website.
But the problem is I need to change the page number I need to loop by myself every time.
I want to let this code upgrade to iterate by calculating how many time it need. Not by manually inputting.
Thank you very much.
This is bad idea to hardcode the number of pages in your script. Try just to click "Next" button while it is enabled:
from selenium.common.exceptions import NoSuchElementException
while True:
try:
# do whatever you need to do on page
driver.find_element_by_xpath('//li[not(#class="disabled")]/span[text()="Next"]').click()
except NoSuchElementException:
break
This should allow you to execute page scraping until the last page reached
Also note that using lines current = driver.current_url and driver.get(current) makes no sense at all, so you might skip them

Scrape with BeautifulSoup from site that uses AJAX pagination using Python

I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report
This is the script I have so far:
with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
soup = BeautifulSoup(page)
snippet = soup.find_all('a', attrs={'item-title'})
for a in snippet:
logfile.write ("http://www.heritage.org" + a.get('href') + "\n")
print "Done collecting urls"
Obviously, it scrapes the first page of results and nothing more.
And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.
For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.
Here is a simple solution using Selenium for your question:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
# uncomment if using Firefox web browser
driver = webdriver.Firefox()
# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()
url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)
# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
while True:
try:
# sleep here to allow time for page load
sleep(5)
# grab the Next button if it exists
btn_next = driver.find_element_by_class_name('next')
# find all item-title a href and write to file
links = driver.find_elements_by_class_name('item-title')
print "Page: {} -- {} urls to write...".format(pages, len(links))
for link in links:
f.write(link.get_attribute('href')+'\n')
# Exit if no more Next button is found, ie. last page
if btn_next is None:
print "crawling completed."
exit(-1)
# otherwise click the Next button and repeat crawling the urls
pages += 1
btn_next.send_keys(Keys.RETURN)
# you should specify the exception here
except:
print "Error found, crawling stopped"
exit(-1)
Hope this helps.

Python - Selenium - Print Webpage

How do I print a webpage using selenium please.
import time
from selenium import webdriver
# Initialise the webdriver
chromeOps=webdriver.ChromeOptions()
chromeOps._binary_location = "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"
chromeOps._arguments = ["--enable-internal-flash"]
browser = webdriver.Chrome("C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe", port=4445, chrome_options=chromeOps)
time.sleep(3)
# Login to Webpage
browser.get('www.webpage.com')
Note: I am using the, at present, current version of Google Chrome: Version 32.0.1700.107 m
While it's not directly printing the webpage, it is easy to take a screenshot of the entire current page:
browser.save_screenshot("screenshot.png")
Then the image can be printed using any image printing library. I haven't personally used any such library so I can't necessarily vouch for it, but a quick search turned up win32print which looks promising.
The key "trick" is that we can execute JavaScript in the selenium browser window using the "execute_script" method of the selenium webdriver, and if you execute the JavaScript command "window.print();" it will activate the browsers print function.
Now, getting it to work elegantly requires setting a few preferences to print silently, remove print progress reporting, etc. Here is a small but functional example that loads up and prints whatever website you put in the last line (where 'http://www.cnn.com/' is now):
import time
from selenium import webdriver
import os
class printing_browser(object):
def __init__(self):
self.profile = webdriver.FirefoxProfile()
self.profile.set_preference("services.sync.prefs.sync.browser.download.manager.showWhenStarting", False)
self.profile.set_preference("pdfjs.disabled", True)
self.profile.set_preference("print.always_print_silent", True)
self.profile.set_preference("print.show_print_progress", False)
self.profile.set_preference("browser.download.show_plugins_in_list",False)
self.driver = webdriver.Firefox(self.profile)
time.sleep(5)
def get_page_and_print(self, page):
self.driver.get(page)
time.sleep(5)
self.driver.execute_script("window.print();")
if __name__ == "__main__":
browser_that_prints = printing_browser()
browser_that_prints.get_page_and_print('http://www.cnn.com/')
The key command you were probably missing was "self.driver.execute_script("window.print();")" but one needs some of that setup in init to make it run smooth so I thought I'd give a fuller example. I think the trick alone is in a comment above so some credit should go there too.

Categories