I have built a web scraper. The program enters searchterm into a searchbox and grabs the results. Pandas goes through a spreadsheet line-by-line in a column to retrieve each searchterm.
Sometimes the page doesn't load properly, prompting a refresh.
I need a way for it to repeat the function and try the same searchterm if it fails. Right now, if I return, it would go on to the next line in the spreadsheet.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
df = pd.read_csv(searchterms.csv, delimiter=",")
def scrape(searchterm):
#Loads url
searchbox = driver.find_element_by_name("searchbox")
searchbox.clear()
searchbox.send_keys(searchterm)
print "Searching for %s ..." % searchterm
no_result = True
while no_result is True:
try:
#Find results, grab them
no_result = False
except:
#Refresh page and do the above again for the current searchterm - How?
driver.refresh()
return pd.Series([col1, col2])
df[["Column 1", "Column 2"]] = df["searchterm"].apply(scrape)
#Executes crawl for each line in csv
The try except construct comes with else clause. The else block is executed if everything goes OK. :
def scrape(searchterm):
#Loads url
no_result = True
while no_result:
#Find results, grab them
searchbox = driver.find_element_by_name("searchbox")
searchbox.clear()
try: #assumes that an exception is thrown if there is no results
searchbox.send_keys(searchterm)
print "Searching for %s ..." % searchterm
except:
#Refresh page and do the above again for the current searchterm
driver.refresh()
else: # executed if no exceptions were thrown
no_results = False
# .. some post-processing code here
return pd.Series([col1, col2])
(There is also a finally block that is executed no matter what, which is useful for cleanup tasks that don't depend on the success or failure of the preceding code)
Also, note that empty except catches any exceptions and is almost never a good idea. I'm not familiar with how selenium handles errors, but when catching exceptions, you should specify which exception you are expecting to handle. This how, if an unexpected exception occurs, your code will abort and you'll know that something bad happened.
That is why you should also try keeping as few lines a possible within the try block.
Related
Hi I am a high school student who has not used python to code programs much, and I was having trouble with creating code to check when a website was updated. I have looked at different resources and I have used them to create what I have but when I run the code it doesn't seem to work and do what I expect it to do. When I run the code I expect it to tell me if a site has been updated or stayed the same from when I last checked it. I put some print statements in the code to try to catch the issue, but it has only showed me that the website has changed even though it doesn't look like it has changed.
import time
import hashlib
from urllib.request import urlopen, Request
url = Request('https://www.canada.ca/en/immigration-refugees-citizenship/services/immigrate-canada/express-entry/submit-profile/rounds-invitations.html')
res = urlopen(url).read()
current = hashlib.sha224(res).hexdigest()
print("running")
time.sleep(10)
while True:
try:
res = urlopen(url).read()
current = hashlib.sha224(res).hexdigest()
print(current)
print(res)
time.sleep(30)
res = urlopen(url).read()
newHash = hashlib.sha224(res).hexdigest()
print (newHash)
print(res)
if newHash == current:
print ("nothing changed")
continue
else:
print("there was a change")
except AttributeError as e:
print ("error")
Loop works when import image is not scripted
pre = os.path.dirname(os.path.realpath(__file__))
f_name = 'wpcontacts.xlsx'
path = os.path.join(pre, f_name)
f_name = pandas.read_excel(path)
count = 0
image_url = input("url here")
driver = webdriver.Chrome(executable_path='D:/Old Data/Integration Files/new/chromedriver')
driver.get('https://web.whatsapp.com')
sleep(25)
for column in f_name['Contact'].tolist():
try:
driver.get('https://web.whatsapp.com/send?phone=' + str(f_name['Contact'][count]) + '&text=' + str(
f_name['Messages'][0]))
sent = False
sleep(7)
# It tries 3 times to send a message in case if there any error occurred
click_btn = driver.find_element(By.XPATH,
'/html/body/div[1]/div/div/div[4]/div/footer/div[1]/div/span[2]/div/div[2]/div[2]/button/span')
file_path = 'amazzon.jpg'
driver.find_element(By.XPATH,
'//*[#id="main"]/footer/div[1]/div/span[2]/div/div[1]/div[2]/div/div/span').click()
sendky = driver.find_element(By.XPATH,
'//*[#id="main"]/footer/div[1]/div/span[2]/div/div[1]/div[2]/div/span/div/div/ul/li[1]/button/span')
input_box = driver.find_element(By.TAG_NAME, 'input')
input_box.send_keys(image_url)
sleep(3)
except Exception:
print("Sorry message could not sent to " + str(f_name['Contact'][count]))
else:
sleep(3)
driver.find_element(By.XPATH,
'//*[#id="app"]/div/div/div[2]/div[2]/span/div/span/div/div/div[2]/div/div[2]/div[2]/div/div').click()
sleep(2)
print('Message sent to: ' + str(f_name['Contact'][count]))
count = count + 1
output is
Message sent to: 919891350373
Process finished with exit code 0
how convert this code into loop so that i can send text to every no. mentioned in exel file
thanks
Firstly, if what you've written in the question is the code you are using, I am confused how you aren't getting a syntax error due to the tab spacing eg here:
try:
driver.get('https://web.whatsapp.com/send?phone=' + str(f_name['Contact'][count]) + '&text=' + str(
f_name['Messages'][0]))
I am going to assume this is a mixup related to copy-paste.
Next, I'll just mention the following: I highly doubt you need a 25-second sleep for the page to load, and the default test timeout, and the default timeout for Selenium tests is 30 seconds, so with the other sleeps you've added I'm not sure why it's not simply timing out unless you've overridden this timeout in some other part of the code that's not added in your question.
What is the point of doing driver.get('https://web.whatsapp.com'), then following it with another driver.get()?
All this aside, it would make sense to me that your problem lies with the spacing for your increment count = count + 1; it is not inside your for loop in the code as I see it. So, the count is not actually incremented in the loop itself but rather after the whole loop is executed. If it does not help to add a tab before the count increment, I'm quite sure that you've made some mistake(s) pasting the code here so please organize it such that we can see what code is actually being executed.
Finally, another comment I have: the xpaths you've got scare me. You should almost NEVER use an absolute xpath (like '/html/body/div[1]/div/div/div[4]/div/footer/div[1]/div/span[2]/div/div[2]/div[2]/button/span'). Just about any change to the HTML on the page will cause this to break. I haven't the time to find better selectors for you, but I highly recommend you examine these.
Let me know whether any of the above helps or not!
I have a for loop that itterates over a list of URLS. The urls are then loaded by the chrome driver. Some urls will load a page that is in a 'bad' format and it will fail the first xpath test. If it does I want it to go back to the next element in the loop. My cleanup code works but I can't seem to get it to go to next element in the for loop. I have an except that closes my websbrowser but nothing I tried would all me to then loop back to 'for row in mysql_cats'
for row in mysql_cats :
print ('Here is the url -', row[1])
cat_url=(row[1])
driver = webdriver.Chrome()
driver.get(cat_url); #Download the URL passed from mysql
try:
CategoryName= driver.find_element_by_xpath('//h1[#class="categoryL3"]|//h1[#class="categoryL4"]').text #finds either L3 or L4 catagory
except:
driver.close()
#this does close the webriver okay if it can't find the xpath, but not I cant get code here to get it to go to the next row in mysql_cats
I hope that you're closing the driver at the end of this code also if no exceptions occurs.
If you want to start from the beginning of the loop when an exception is raised, you may add continue, as suggested in other answers:
try:
CategoryName=driver.find_element_by_xpath('//h1[#class="categoryL3"]|//h1[#class="categoryL4"]').text #finds either L3 or L4 catagory
except NoSuchElementException:
driver.close()
continue # jumps at the beginning of the for loop
since I do not know your code, the following tip may be useless, but a common way to handle this cases is a try/except/finally clause:
for row in mysql_cats :
print ('Here is the url -', row[1])
cat_url=(row[1])
driver = webdriver.Chrome()
driver.get(cat_url); #Download the URL passed from mysql
try:
# my code, with dangerous stuff
except NoSuchElementException:
# handling of 'NoSuchElementException'. no need to 'continue'
except SomeOtherUglyException:
# handling of 'SomeOtherUglyException'
finally: # Code that is ALWAYS executed, with or without exceptions
driver.close()
I'm also assuming that you're creating new drivers each time for a reason. If it is not voluntary, you may use something like this:
driver = webdriver.Chrome()
for row in mysql_cats :
print ('Here is the url -', row[1])
cat_url=(row[1])
driver.get(cat_url); #Download the URL passed from mysql
try:
# my code, with dangerous stuff
except NoSuchElementException:
# handling of 'NoSuchElementException'. no need to 'continue'
except SomeOtherUglyException:
# handling of 'SomeOtherUglyException'
driver.close()
In this way, you have only one driver that manages all the pages you're trying to open in the for loop
have a look somewhere about how the try/except/finally is really useful when handling connections and drivers.
As a foot note, I'd like you to notice how in the code I always specify which exception I am expecting: catching all the exception can be dangerous. BTW, probably no one will die if you simply use except:
I have some experience in Python, but I have never used try & except functions to catch errors due to lack of formal training.
I am working on extracting a few articles from wikipedia. For this I have an array of titles, a few of which do not have any article or search result at the end. I would like the page retrieval function just to skip those few names and continue running the script on the rest. Reproducible code follows.
import wikipedia
# This one works.
links = ["CPython"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
#The sequence breaks down if there is no wikipedia page.
links = ["CPython","no page"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
The library running it uses a method like this. Normally it would be really bad practice, but since this is just for a one-off data extraction, I am willing to change the local copy of the library to get it to work. Edit I included the complete function now.
def page(title=None, pageid=None, auto_suggest=True, redirect=True, preload=False):
'''
Get a WikipediaPage object for the page with title `title` or the pageid
`pageid` (mutually exclusive).
Keyword arguments:
* title - the title of the page to load
* pageid - the numeric pageid of the page to load
* auto_suggest - let Wikipedia find a valid page title for the query
* redirect - allow redirection without raising RedirectError
* preload - load content, summary, images, references, and links during initialization
'''
if title is not None:
if auto_suggest:
results, suggestion = search(title, results=1, suggestion=True)
try:
title = suggestion or results[0]
except IndexError:
# if there is no suggestion or search results, the page doesn't exist
raise PageError(title)
return WikipediaPage(title, redirect=redirect, preload=preload)
elif pageid is not None:
return WikipediaPage(pageid=pageid, preload=preload)
else:
raise ValueError("Either a title or a pageid must be specified")
What should I do to retreive only the pages that do not give the error. Maybe there is a way to filter out all items in the list that give this error or an error of some kind. Returning "NA" or similar would be fine with pages that don't exist. Skipping them without notice would be fine too. Thanks!
The function wikipedia.page will raise a wikipedia.exceptions.PageError if the page doesn't exist. That's the error you want to catch.
import wikipedia
links = ["CPython","no page"]
test=[]
for link in links:
try:
#try to load the wikipedia page
page=wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
#if a "PageError" was raised, ignore it and continue to next link
continue
You have to surround the function wikipedia.page by a try block, so I'm afraid you can't use list comprehension.
Understand that this will be bad practice, but for a one off quick and dirty script you can just:
edit: Wait, sorry. I've just noticed the list comprehension. I'm actually not sure if this will work without breaking that down:
links = ["CPython", "no page"]
test = []
for link in links:
try:
page = wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
pass
test = [testitem.content for testitem in test]
print(test)
pass Tells python to essentially to trust you and ignore the error so that it can continue on about its day.
My script below scrapes a website and returns the data from a table. It's not finished but it works. The problem is that it has no error checking. Where should I have error handling in my script?
There are no unittests, should I write some and schedule my unittests to be run periodicaly. Or should the error handling be done in my script?
Any advice on the proper way to do this would be great.
#!/usr/bin/env python
''' Gets the Canadian Monthly Residential Bill Calculations table
from URL and saves the results to a sqllite database.
'''
import urllib2
from BeautifulSoup import BeautifulSoup
class Bills():
''' Canadian Monthly Residential Bill Calculations '''
URL = "http://www.hydro.mb.ca/regulatory_affairs/energy_rates/electricity/utility_rate_comp.shtml"
def __init__(self):
''' Initialization '''
self.url = self.URL
self.data = []
self.get_monthly_residential_bills(self.url)
def get_monthly_residential_bills(self, url):
''' Gets the Monthly Residential Bill Calculations table from URL '''
doc = urllib2.urlopen(url)
soup = BeautifulSoup(doc)
res_table = soup.table.th.findParents()[1]
results = res_table.findNextSibling()
header = self.get_column_names(res_table)
self.get_data(results)
self.save(header, self.data)
def get_data(self, results):
''' Extracts data from search result. '''
rows = results.childGenerator()
data = []
for row in rows:
if row == "\n":
continue
for td in row.contents:
if td == "\n":
continue
data.append(td.text)
self.data.append(tuple(data))
data = []
def get_column_names(self, table):
''' Gets table title, subtitle and column names '''
results = table.findAll('tr')
title = results[0].text
subtitle = results[1].text
cols = results[2].childGenerator()
column_names = []
for col in cols:
if col == "\n":
continue
column_names.append(col.text)
return title, subtitle, column_names
def save(self, header, data):
pass
if __name__ == '__main__':
a = Bills()
for td in a.data:
print td
See the documentation of all the functions and see what all exceptions do they throw.
For ex, in urllib2.urlopen(), it's written that Raises URLError on errors. It's a subclass of IOError.
So, for the urlopen(), you could do something like:
try:
doc = urllib2.urlopen(url)
except IOError:
print >> sys.stderr, 'Error opening URL'
Similary, do the same for others.
You should write unit tests and you should use exception handling. But only catch the exceptions you can handle; you do no one any favors by catching everything and throwing any useful information out.
Unit tests aren't run periodically though; they're run before and after the code changes (although it is feasible for one change's "after" to become another change's "before" if they're close enough).
A copple places you need to have them.is in importing things like tkinter
try:
import Tkinter as tk
except:
import tkinter as tk
also anywhere where the user enters something with a n intended type. A good way to figure this out is to run it abd try really hard to make it crash. Eg typing in wrong type.
The answer to "where should I have error handling in my script?" is basically "any place where something could go wrong", which depends entirely on the logic of your program.
In general, any place where your program relies on an assumption that a particular operation worked as you intended, and there's a possibility that it may not have, you should add code to check whether or not it actually did work, and take appropriate remedial action if it didn't. In some cases, the underlying code might generate an exception on failure and you may be happy to just let the program terminate with an uncaught exception without adding any error-handling code of your own, but (1) this would be, or ought to be, rare if anyone other than you is ever going to use that program; and (2) I'd say this would fall into the "works as intended" category anyway.