I'm having an issue trying to click on an a href tag from an xpath query, the line in question is element = atag.xpath("./a"), I get an error saying Error: 'list' object has no attribute 'click'.
Any help greatly appreciated.
import time
import os.path
import lxml.html as LH
import re
import sys
from selenium import webdriver
from random import randint
PARAMS = sys.argv
URL = PARAMS[1]
BASEURL = URL[:URL.rfind('/')+1]
try:
PAGE_NUMBER = 1
#--------------------------------------------------
## Get initial page
driver = webdriver.Firefox()
driver.get(PARAMS[1])
#--------------------------------------------------
## Get page count
# Give page time to load
time.sleep(2)
PAGE_RAW = driver.page_source
PAGE_RAW = LH.fromstring(PAGE_RAW)
PAGE_COUNT_RAW = PAGE_RAW.xpath("//div[contains(#class, 'menu')]/div/ul/li")
PAGE_COUNT = len(PAGE_COUNT_RAW) - 2
#--------------------------------------------------
## Get page if it's not page one
while PAGE_NUMBER <= PAGE_COUNT:
#--------------------------------------------------
# Delay page processing for a random number of seconds from 2-5
time.sleep(randint(2,5))
#--------------------------------------------------
## Create empty file
FILE_NAME = PARAMS[3] + 'json/' + time.strftime("%Y%m%d%H") + '_' + str(PARAMS[2]) + '_' + str(PAGE_NUMBER) + '.json'
#--------------------------------------------------
## Create JSON file if it doesn't exist
if os.path.exists(FILE_NAME)==False:
JSON_FILE = open(FILE_NAME, "a+", encoding="utf-8")
else:
JSON_FILE = open(FILE_NAME, "w", encoding="utf-8")
JSON_FILE.write("{")
#--------------------------------------------------
# Click page for next page if not page 1
if PAGE_NUMBER > 1:
index = 1
for atag in PAGE_COUNT_RAW:
if index == (PAGE_NUMBER + 1):
element = atag.xpath("./a")
element.click()
index += 1
#--------------------------------------------------
## Proces page
#TODO
#--------------------------------------------------
## Close webdriver
driver.quit()
#--------------------------------------------------
## Close JSON file
JSON_FILE.write("}")
JSON_FILE.close()
#--------------------------------------------------
## Increment page number
PAGE_NUMBER += 1
#--------------------------------------------------
except Exception as e:
print('Error: ' + str(e.args[0]))
You mixed lxml code with selenium code. Your element is a list returned by lxml code, it's not a WebElement or list of WebElements and you can't apply click() even if you try element[0].click().
I'd suggest you to avoid using lxml as it seem to be redundant in this case. Just try to parse page source with selenium built-in methods.
If you need to get list of div elements you can use:
PAGE_COUNT_RAW = driver.find_elements_by_xpath("//div[contains(#class, 'menu')]/div/ul/li")
To find child a element:
for div in PAGE_COUNT_RAW:
element = div.find_element_by_xpath('./a')
Note that if you defined PAGE_COUNT_RAW on the first page, it will not be accessible on the next page, so you can scrape just a list of links and then get each link in a loop. Something like:
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath("//div[contains(#class, 'menu')]/div/ul/li/a")]
for link in links:
driver.get(link)
If you need more details then update your ticket with specific description as for now your problem is not quite clear
Related
I'm trying to go to every next page using the below code.
it collects data from page Number 1. but when I try to loop it and go to the next page it gives me an error.
Web page : https://register.fca.org.uk/s/search?q=capital&type=Companies
this is the code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
url = 'https://register.fca.org.uk/s/search?q=capital&type=Companies'
service = Service('linkto crome driver')
service.start()
driver = webdriver.Remote(service.service_url)
driver.get(url)
time.sleep(12)
for j in range(346):
divs = driver.find_elements_by_xpath('//div[#class="result-card_main"]')
for i in range(len(divs)):
time.sleep(10)
d = driver.find_elements_by_xpath('//div[#class="result-card_main"]')
RN = ''
d[i].click()
time.sleep(12)
try:
RNData = driver.find_elements_by_xpath('//*[#id="profile-header"]/div[1]/div/div/div/div/div/div[1]/div[2]/div/div')
RN = RNData[0].text.split(':')[1].strip()
print(RN)
except Exception as e5:
pass
if i == (len(divs) - 1):
pass
else:
driver.execute_script("window.history.go(-1)")
bt = driver.find_elements_by_xpath('//*[#id="-pagination-next-btn"]')
bt[0].click()
This is the error:
IndexError: list index out of range
How can I solve this problem?
I guess the problem is as following:
bt = driver.find_element_by_xpath('//*[#id="-pagination-next-btn"]')
returns a single web element object, it's not a list, so you can't apply indexing on it with bt[0]
UPD:
After changing from find_element_by_xpath to find_elements_by_xpath you still getting IndexError: list index out of range there because you were in the inner page and performed driver back action.
Immediately after that you are trying to get the next page button while the main page is still not loaded. This actually returns you an empty list
bt = driver.find_elements_by_xpath('//*[#id="-pagination-next-btn"]')
that's why you can't apply bt[0] on an empty list object.
Your problem is this:
if i == (len(divs) - 1):
pass
else:
driver.execute_script("window.history.go(-1)")
After clicking the last link, you are not navigating back to the initial page, which is where the pagination button is. I don't think you need this condition at all, so your code could be:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
url = 'https://register.fca.org.uk/s/search?q=capital&type=Companies'
service = Service('linkto crome driver')
service.start()
driver = webdriver.Remote(service.service_url)
driver.get(url)
time.sleep(12)
for j in range(346):
divs = driver.find_elements_by_xpath('//div[#class="result-card_main"]')
for i in range(len(divs)):
time.sleep(10)
d = driver.find_elements_by_xpath('//div[#class="result-card_main"]')
RN = ''
d[i].click()
time.sleep(12)
try:
RNData = driver.find_elements_by_xpath('//*[#id="profile-header"]/div[1]/div/div/div/div/div/div[1]/div[2]/div/div')
RN = RNData[0].text.split(':')[1].strip()
print(RN)
except Exception as e5:
pass
driver.execute_script("window.history.go(-1)")
bt = driver.find_elements_by_xpath('//*[#id="-pagination-next-btn"]')
bt[0].click()
I am working on a scraping project and am trying so scrape many different profiles. Not all of the profiles have the same information, so I want to skip that piece of data if the current profile does not have it. Here is my current code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
driver = webdriver.Chrome("MY DIRECTORY")
driver.get("https://directory.bcsp.org/")
count = int(input("Number of Pages to Scrape: "))
body = driver.find_element_by_xpath("//body") #
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
while len(profile_count) < count: # Get links up to "count"
body.send_keys(Keys.END)
sleep(1)
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
for link in profile_count: # Calling up links
temp = link.get_attribute('href') # temp for
driver.execute_script("window.open('');") # open new tab
driver.switch_to.window(driver.window_handles[1]) # focus new tab
driver.get(temp)
##### SCRAPE CODE #####
Name = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[1]/div[2]/div')
IssuedBy = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[1]/td[1]/div[2]')
CertificationNumber = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[1]/td[3]/div[2]')
CertfiedSince = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[3]/td[1]/div[2]')
RecertificationCycle = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[3]/td[3]/div[2]')
Expires = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[1]/div[2]')
AccreditedBy = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[3]/div[2]/a')
print(Name.text + " : " + IssuedBy.text + " : " + CertificationNumber.text + " : " + CertfiedSince.text + " : " + RecertificationCycle.text + " : " + Expires.text + " : " + AccreditedBy.text)
driver.close()
driver.switch_to.window(driver.window_handles[0])
driver.close()
Please let me know how I would be able to skip an element if it is not present on the current profile.
According to the docs, find_element_by_xpath() raises a NoSuchElementException if the element you're looking for couldn't be found.
I suggest handling potential NoSuchElementExceptions accordingly. What a proper exception handling could look like depends on what you're trying to achieve, you might want to log an error, assign default values, skip certain follow up actions...
try:
Name = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[1]/div[2]/div')
except NoSuchElementException:
Name = "Default Name"
You could even wrap multiple find_element_by_xpath() calls in your try block.
It will fix try:.. except:.. but you have some other errors too. I fixed them all.
Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
driver = webdriver.Chrome('chromedriver')
driver.get("https://directory.bcsp.org/")
count = int(input("Number of Pages to Scrape: "))
body = driver.find_element_by_xpath("//body") #
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
c = 1
while c <= count:
for link in profile_count: # Calling up links
temp = link.get_attribute('href') # temp for
driver.execute_script("window.open('');") # open new tab
driver.switch_to.window(driver.window_handles[1]) # focus new tab
driver.get(temp)
sleep(1)
##### SCRAPE CODE #####
try:
Name = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[1]/div[2]/div')
IssuedBy = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[1]/td[1]/div[2]')
CertificationNumber = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[1]/td[3]/div[2]')
CertfiedSince = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[3]/td[1]/div[2]')
RecertificationCycle = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[3]/td[3]/div[2]')
except:
c -= 1
driver.switch_to.window(driver.window_handles[0])
c += 1
if c > count:
break
driver.quit()
I'm trying to make a script where the program takes input of multiple URLs and then opens tabs for each of them, this is what I came up with
s=raw_input()
l=s.split()
t=len(l)
for elements in l:
elements = ["https://" + elements + "" for elements in l]
driver = webdriver.Chrome(r"C:/Users/mynam/Desktop/WB/chromedriver.exe")
driver.get("https://www.google.com")
for e in elements:
driver.implicitly_wait(3)
driver.execute_script("window.open(e,'new window')")
print "Opened in new tab"
I get an error of e not defined, how do we pass an argument to window.open in selenium
You need to execute new window, switch to it and open new page.
from selenium import webdriver
import os
def open_tab_page(page, page_number):
browser.execute_script("window.open('');")
browser.switch_to.window(browser.window_handles[page_number])
browser.get(page)
# initialise driver
chrome_driver = os.path.abspath(os.path.dirname(__file__)) + '/chromedriver'
browser = webdriver.Chrome(chrome_driver)
browser.get("http://stackoverflow.com/")
# list of pages to open
pages_list = ['https://www.google.com', 'https://www.youtube.com/']
page_number = 1
for page in pages_list:
open_tab_page(page, page_number)
page_number +=1
I've been working on this Python script for the past day or two and all is working fine when I use the Firefox webdriver, but when I switch to use a headless browser like PhantomJS it fails on the line with setNumber = parseSetNumber(setName[0]) with the error Error: list index out of range due to setName being empty.
The line before it setName = atag.xpath("./div[contains(#class, 'product_info')]/div[contains(#class, 'product_name')]/a/text()") returns nothing when using the PhantomJS webdriver only, if using the Firefox webdriver it returns a value fine.
The error only happens when I switch the webdriver from Firefox to PhantomJS. I use PhantomJS as the script is run on a linux server.
import time
import os.path
import lxml.html as LH
import re
import sys
from selenium import webdriver
from random import randint
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
PARAMS = sys.argv
URL = PARAMS[1]
BASEURL = URL[:URL.rfind('/')+1]
# Parses the set name for the set number
def parseSetNumber(string):
string = string.split(' ')
stringLength = len(string)
string = string[(stringLength - 1)]
if string.replace('.','').isdigit():
return string
else:
return ""
# Returns set reference for this site
def parseRefId(string):
string = string.split('_')
return str(string[2])
try:
PAGE_NUMBER = 1
#--------------------------------------------------
## Get initial page
driver = webdriver.PhantomJS()
driver.get(PARAMS[1])
#--------------------------------------------------
## Get page count
# Give page time to load
time.sleep(2)
PAGE_RAW = driver.page_source
PAGE_RAW = LH.fromstring(PAGE_RAW)
PAGE_COUNT_RAW = PAGE_RAW.xpath("//div[contains(#class, 'pageControlMenu')]/div/ul/li")
PAGE_COUNT = len(PAGE_COUNT_RAW) - 2
#--------------------------------------------------
## Get page if its not page one
while PAGE_NUMBER <= PAGE_COUNT:
#--------------------------------------------------
## Create empty file
FILE_NAME = PARAMS[3] + 'json/' + time.strftime("%Y%m%d%H") + '_' + str(PARAMS[2]) + '_' + str(PAGE_NUMBER) + '.json'
#--------------------------------------------------
## Create JSON file if it doesnt exist
if os.path.exists(FILE_NAME)==False:
JSON_FILE = open(FILE_NAME, "a+", encoding="utf-8")
else:
JSON_FILE = open(FILE_NAME, "w", encoding="utf-8")
JSON_FILE.write("{")
#--------------------------------------------------
# Click page for next page if not page 1
if PAGE_NUMBER > 1:
index = 0
for atag in PAGE_COUNT_RAW:
if index == PAGE_NUMBER:
elements = driver.find_elements_by_xpath("//div[contains(#class, 'pageControlMenu')]/div/ul/li")
if elements:
element = elements[index].find_elements_by_xpath("./a")
if element:
element[0].click()
time.sleep(randint(3,5))
index += 1
#--------------------------------------------------
## Remove survey box if it pops up and log
try:
surveyBox = driver.find_element_by_link_text("No, thanks")
if surveyBox:
surveyBox.click()
print("Store[" + str(PARAMS[2]) + "]: Survey box found on page - " + str(PAGE_NUMBER))
except:
print("Store[" + str(PARAMS[2]) + "]: No survey box on page - " + str(PAGE_NUMBER))
#--------------------------------------------------
## Proces page
# If page is greater then 1 then get the page source of the new page.
if PAGE_NUMBER > 1:
PAGE_RAW = driver.page_source
PAGE_RAW = LH.fromstring(PAGE_RAW)
PAGE_RAW = PAGE_RAW.xpath("//div[contains(#class, 'estore_product_container')]")
index = 0
size = len(PAGE_RAW)
for atag in PAGE_RAW:
if PAGE_NUMBER > 1 and index == 0:
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH, "./div[contains(#class, 'product_info')]/div[contains(#class, 'product_name')]/a")))
setStore = PARAMS[2]
setName = atag.xpath("./div[contains(#class, 'product_info')]/div[contains(#class, 'product_name')]/a/text()")
setNumber = parseSetNumber(setName[0])
setPrice = atag.xpath("./div[contains(#class, 'product_info')]/div[contains(#class, 'product_price')]/text()")
setLink = atag.xpath("./div[contains(#class, 'product_info')]/div[contains(#class, 'product_name')]/a/#href")
setRef = atag.xpath("./div[contains(#class, 'product_info')]/div[contains(#class, 'product_price')]/#id")
if setRef:
setRef = parseRefId(setRef[0])
if re.search('[0-9\.]+', setPrice[0]) is not None:
JSON_FILE.write("\"" + str(index) + "\":{\"store\":\"" + str(setStore) + "\",\"name\":\"" + str(setName[0]) + "\",\"number\":\"" + str(setNumber) + "\",\"price\":\"" + re.search('[0-9\.]+', setPrice[0]).group() + "\",\"ref\":\"" + str(setRef) + "\",\"link\":\"" + str(setLink[0]) + "\"}")
if index+1 < size:
JSON_FILE.write(",")
index += 1
#--------------------------------------------------
## Close JSON file
JSON_FILE.write("}")
JSON_FILE.close()
#--------------------------------------------------
## Increment page number
PAGE_NUMBER += 1
#--------------------------------------------------
#--------------------------------------------------
## Close webdriver
driver.quit()
#--------------------------------------------------
except Exception as e:
print('Error: ' + str(e.args[0]))
# Remove gecodriver.log file
GHOSTDRIVER_FILE = str(PARAMS[3]) + 'jobs/ghostdriver.log'
if os.path.exists(GHOSTDRIVER_FILE)==True:
os.remove(GHOSTDRIVER_FILE)
Update
It looks like these are the only two lines not working with PhantomJS, they both return an empty value.
setName = atag.xpath("./div[contains(#class, 'product_info')]/div[contains(#class, 'product_name')]/a/text()")
setLink = atag.xpath("./div[contains(#class, 'product_info')]/div[contains(#class, 'product_name')]/a/#href")
Ok, looks like I've solved this issue, I had to add the set_windows_size option for the webdriver when using PhantomJS.
Originally:
driver = webdriver.PhantomJS()
driver.get(PARAMS[1])
Solution:
driver = webdriver.PhantomJS()
driver.set_window_size(1024, 768)
driver.get(PARAMS[1])
Now the PhantomJS webdriver works as is expected in the same way the Firefox webdriver works.
Using Python, Selenium, Sublime and Firefox: I am scraping the links off of this website and would like to save the scraped pages (as html files) into a folder. However, I have been working for days on trying to get the body of these html files to dump into a dropbox folder. The problem is 1) saving the html files and 2) saving them to a dropbox folder (or any folder).
I have successfully written code that will perform a search, then scrape the links off of a series of webpages. The following code works well for that.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import re
import csv
import pickle
import signal
import time
def handler(signum, frame):
raise Exception('Last Resort!')
signal.signal(signal.SIGALRM,handler)
def isReady(browser):
return browser.execute_script("return document.readyState")=="complete"
def waitUntilReady(browser):
if not isReady(browser):
waitUntilReady(browser)
def waitUntilReadyBreak(browser_b,url,counter):
try:
signal.alarm(counter)
waitUntilReady(browser_b)
signal.alarm(0)
except Exception,e:
print e
signal.alarm(0)
browser_b.close()
browser_b = webdriver.Firefox()
browser_b.get(url)
waitUntilReadyBreak(browser_b,url,counter)
return browser_b
browser = webdriver.Firefox()
thisurl = 'http://www.usprwire.com/cgi-bin/news/search.cgi'
browser.get(thisurl)
waitUntilReady(browser)
numarticles = 0
elem = WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.NAME, "query")))
elem = browser.find_element_by_name("query")
elem.send_keys('"test"')
form = browser.find_element_by_xpath("/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[3]/td[2]/table/tbody/tr[3]/td/table/tbody/tr[1]/td/font/input[2]").click()
nextpage = False
all_newproduct_links = []
npages = 200
for page in range(1,npages+1):
if page == 1:
elems = browser.find_elements_by_tag_name('a')
article_url = [elems.get_attribute("href")
for elems in browser.find_elements_by_class_name('category_links')]
print page
print article_url
print "END_A_PAGE"
elem = browser.find_element_by_link_text('[>>]').click()
waitUntilReady(browser)
if page >=2 <= 200:
# click the dots
print page
print page
print "B4 LastLoop"
elems = WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.CLASS_NAME, "category_links")))
elems = browser.find_elements_by_tag_name('a')
article_url = [elems.get_attribute("href")
for elems in browser.find_elements_by_class_name('category_links')]
print page
print article_url
print "END_C_PAGE"
# This is the part that will not work :(
for e in elems:
numarticles = numarticles+1
numpages = 0
numpages = numpages+1000
article_url = e.get_attribute('href')
print 'waiting'
bodyelem.send_keys(Keys.COMMAND + "2")
browser.get(article_url)
waitUntilReady(browser)
fw = open('/Users/My/Dropbox/MainFile/articlesdata/'+str(page)+str(numpages)+str(numarticles)+'.html','w')
fw.write(browser.page_source.encode('utf-8'))
fw.close()
bodyelem2 = browser.find_elements_by_xpath("//body")[0]
bodyelem2.send_keys(Keys.COMMAND + "1")
The above (for e in elems:) is meant to click on the page and create an html file containing the body of the scraped page. I seem to be missing something fundamental.
Any guidance at all would be most appreciated.
I think you are overcomplicating it.
There is at least one problem in this block:
elems = browser.find_elements_by_tag_name('a')
article_url = [elems.get_attribute("href")
for elems in browser.find_elements_by_class_name('category_links')]
elems would contain a list of elements found by find_elements_by_tag_name(), but then, you are using the same elems variable in the list comprehension. As a result, when you are iterating over elems later, you are getting an error, since elems now refer to a single element and not a list.
Anyway, here is the approach I would take:
gather all the article urls first
iterate over the urls one by one and save the HTML source using the page url name as a filename. E.g. _Iran_Shipping_Report_Q4_2014_is_now_available_at_Fast_Market_Research_326303.shtml would be the article filename
The code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
def isReady(browser):
return browser.execute_script("return document.readyState") == "complete"
def waitUntilReady(browser):
if not isReady(browser):
waitUntilReady(browser)
browser = webdriver.Firefox()
browser.get('http://www.usprwire.com/cgi-bin/news/search.cgi')
# make a search
query = WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.NAME, "query")))
query.send_keys('"test"')
submit = browser.find_element_by_xpath("//input[#value='Search']")
submit.click()
# grab article urls
npages = 4
article_urls = []
for page in range(1, npages + 1):
article_urls += [elm.get_attribute("href") for elm in browser.find_elements_by_class_name('category_links')]
browser.find_element_by_link_text('[>>]').click()
# iterate over urls and save the HTML source
for url in article_urls:
browser.get(url)
waitUntilReady(browser)
title = browser.current_url.split("/")[-1]
with open('/Users/My/Dropbox/MainFile/articlesdata/' + title, 'w') as fw:
fw.write(browser.page_source.encode('utf-8'))