How to add a function to monitor link clicking using selenium? - python

I wrote a short program to automate the process of clicking and saving profiles on LinkedIn.
Brief:
The program reads from a txt file with a large amount of LI URLs.
Using Selenium, it opens them one by one, then, hit the "Open in Sales Navigator" button
A new tab is opening, and on it, it needs to click the "Save" button, and choose the relevant list to save on.
I have two main problems:
LinkedIn has 3 versions of the same page. How can I use a condition to check which page version is it? (meaning - if you can't find this button, move to the next version). From what I've seen, you can't really use the "If" function with selenium, cause it causing trouble. Any other suggestions?
More important, and the reason I opened this thread - I want to monitor the "failed" links. Let's say I have a list of 1000 LI URLs, and I ran the program to save them on my account. I want to monitor the ones it didn't save or failed to open (broken links, page unavailable, etc.). In order to execute that, I used a CSV file and ordered the program to save all the pages that already saved on this account, but it doesn't solve my problem. How can I make him save all of them and not just the ones that were already saved? (I find it hard to execute because when a page appears as "Unavailable", it jumps to the next one and I couldn't find a way to make him save it.
It makes it harder to work with it, cause when I put 500 or 1000 URLs, I can't tell which ones save and which ones aren't saved.
Here's the code:
import selenium.webdriver as webdriver
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
import csv
import random
options = webdriver.ChromeOptions()
options.add_argument('--lang=EN')
options.add_argument("--start-maximized")
prefs = {"profile.default_content_setting_values.notifications" : 2}
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(executable_path='assets\chromedriver', chrome_options=options)
driver.get("https://www.linkedin.com/login?fromSignIn=true")
minDelay=input("\n Please provide min delay in seconds : ")
maxDelay=input("\n Please provide max delay in seconds : ")
listNumber=input("\n Please provide list number : ")
outputFile=input('\n save skipped as?: ')
count=0
closed=2
with open("links.txt", "r") as links:
for link in links:
try:
driver.get(link.strip())
sleep(3)
driver.find_element_by_xpath("//button[#class='save-to-list-dropdown__trigger ph5 artdeco-button artdeco-button--primary artdeco-button--3 artdeco-button--pro artdeco-dropdown__trigger artdeco-dropdown__trigger--placement-bottom ember-view']").click()
sleep(2)
count+=1
if count==1:
driver.find_element_by_xpath("//ul[#class='save-to-list-dropdown__content']//ul//li["+str(listNumber)+"]").click()
else:
driver.find_element_by_xpath("//ul[#class='save-to-list-dropdown__content']//ul//li[1]").click()
sleep(2)
sleep(random.randint(int(minDelay), int(maxDelay)))
except:
if closed==0:
driver.close()
sleep(1)
fileOutput=open(outputFile+".csv", mode='a', newline='', encoding='utf-8')
file_writer = csv.writer(fileOutput, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
file_writer.writerow([link.strip()])
fileOutput.close()
print("Finished.")

The common approach to have different sort of listeners is to use EventFiringWebDriver. See the example here:
from selenium import webdriver
from selenium.webdriver.support.abstract_event_listener import AbstractEventListener
from selenium.webdriver.support.event_firing_webdriver import EventFiringWebDriver
class EventListener(AbstractEventListener):
def before_click(self, element, driver):
if element.tag_name == 'a':
print('Clicking link:', element.get_attribute('href'))
if __name__ == '__main__':
driver = EventFiringWebDriver(driver=webdriver.Firefox(), event_listener=EventListener())
driver.get("https://webelement.click/en/welcome")
link = driver.find_element_by_xpath('//a[text()="All Posts"]')
link.click()
driver.quit()
UPD:
Basically your case does not really need that listener. However you can user it. Say you have link file like:
https://google.com
https://invalid.url
https://duckduckgo.com/
https://sadfsdf.sdf
https://stackoverflow.com
Then the way with EventFiringWebDriver would be:
from selenium import webdriver
from selenium.webdriver.support.abstract_event_listener import AbstractEventListener
from selenium.webdriver.support.event_firing_webdriver import EventFiringWebDriver
broken_urls = []
class EventListener(AbstractEventListener):
def on_exception(self, exception, drv):
broken_urls.append(drv.current_url)
if __name__ == '__main__':
driver = EventFiringWebDriver(driver=webdriver.Firefox(), event_listener=EventListener())
with open("links.txt", "r") as links:
for link in links:
try:
driver.get(link.strip())
except:
print('Cannot reach the link', link.strip())
print("Finished.")
driver.quit()
import csv
with open('broken_urls.csv', 'w', newline='') as broken_urls_csv:
wr = csv.writer(broken_urls_csv, quoting=csv.QUOTE_ALL)
wr.writerow(broken_urls)
and without EventFiringWebDriver would be:
broken_urls = []
if __name__ == '__main__':
from selenium import webdriver
driver = webdriver.Firefox()
with open("links.txt", "r") as links:
for link in links:
stripped_link = link.strip()
try:
driver.get(stripped_link)
except:
print('Cannot reach the link', link.strip())
broken_urls.append(stripped_link)
print("Finished.")
driver.quit()
import csv
with open('broken_urls.csv', 'w', newline='') as broken_urls_csv:
wr = csv.writer(broken_urls_csv, quoting=csv.QUOTE_ALL)
wr.writerow(broken_urls)

Related

How do I make the driver navigate to new page in selenium python

I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class

Python and Selenium: I am automating web scraping among pages. How can I loop by Next button?

I already written several lines of codes to pull url from this website.
http://www.worldhospitaldirectory.com/United%20States/hospitals
code is below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
driver = webdriver.Firefox()
driver.get('http://www.worldhospitaldirectory.com/United%20States/hospitals')
url = []
pagenbr = 1
while pagenbr <= 115:
current = driver.current_url
driver.get(current)
lks = driver.find_elements_by_xpath('//*[#href]')
for ii in lks:
link = ii.get_attribute('href')
if '/info' in link:
url.append(link)
print('page ' + str(pagenbr) + ' is done.')
if pagenbr <=114:
elm = driver.find_element_by_link_text('Next')
driver.implicitly_wait(10)
elm.click()
time.sleep(2)
pagenbr += 1
ls = list(set(url))
with open('US_GeneralHospital.csv', 'wb') as myfile:
wr = csv.writer(myfile,quoting=csv.QUOTE_ALL)
for u in ls:
wr.writerow([u])
And it worked very well to pull each individual links from this website.
But the problem is I need to change the page number I need to loop by myself every time.
I want to let this code upgrade to iterate by calculating how many time it need. Not by manually inputting.
Thank you very much.
This is bad idea to hardcode the number of pages in your script. Try just to click "Next" button while it is enabled:
from selenium.common.exceptions import NoSuchElementException
while True:
try:
# do whatever you need to do on page
driver.find_element_by_xpath('//li[not(#class="disabled")]/span[text()="Next"]').click()
except NoSuchElementException:
break
This should allow you to execute page scraping until the last page reached
Also note that using lines current = driver.current_url and driver.get(current) makes no sense at all, so you might skip them

Checking the clickability of an element in selenium using python

I've been trying to write a script which will give me all the links to the episodes present on this page :- http://www.funimation.com/shows/assassination-classroom/videos/episodes
As you can see that the links can be seen in 'Outer HTML', I used selenium and PhantomJS with python.
Link Example: http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time
However, I can't seem to get my code right. I do have a basic Idea of what I want to do. Here's the process :-
1.) Copy the Outer HTML of the very first page and then save it as 'Source_html' file.
2.) Look for links inside this file.
3.) Move to the next page to see rest of the videos and their links.
4.) Repeat the step 2.
This is what my code looks like :
from selenium import webdriver
from selenium import selenium
from bs4 import BeautifulSoup
import time
# ---------------------------------------------------------------------------------------------
driver = webdriver.PhantomJS()
driver.get('http://www.funimation.com/shows/assassination-classroom/videos/episodes')
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
f = open('source_code.html', 'w')
f.write(source_code.encode('utf-8'))
f.close()
print 'Links On First Page Are : \n'
soup = BeautifulSoup('source_code.html')
subtitles = soup.find_all('div',{'class':'popup-heading'})
official = 'something'
for official in subtitles:
x = official.findAll('a')
for a in x:
print a['href']
sbtn = driver.find_element_by_link_text(">"):
print sbtn
print 'Entering The Loop Now'
for driver.find_element_by_link_text(">"):
sbtn.click()
time.sleep(3)
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
f = open('source_code1.html', 'w')
f.write(source_code.encode('utf-8'))
f.close()
Things I already know :-
soup = BeautifulSoup('source_code.html') won't work, because I need to open this file via python and feed it into BS after that. That I can manage.
That official variable isn't really doing anything. Just helping me start a loop.
for driver.find_element_by_link_text(">"):
Now, this is what I need to fix somehow. I'm not sure how to check if this thing is still clickable or not. If yes, then proceed to next page, get the links, click this again to go to page 3 and repeat the process.
Any help would be appreciated.
You don't need to use BeautifulSoup here at all. Just grab all the links via selenium. Proceed to next page only if the > link is visible. Here is the complete implementation including gathering the links, necessary waits. It should work for any page count:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://www.funimation.com/shows/assassination-classroom/videos/episodes")
wait = WebDriverWait(driver, 10)
links = []
while True:
# wait for the page to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.item-title")))
# wait until the loading circle becomes invisible
wait.until(EC.invisibility_of_element_located((By.ID, "loadingCircle")))
links.extend([link.get_attribute("href") for link in driver.find_elements_by_css_selector("a.item-title")])
print("Parsing page number #" + driver.find_element_by_css_selector("a.jp-current").text)
# click next
next_link = driver.find_element_by_css_selector("a.next")
if not next_link.is_displayed():
break
next_link.click()
time.sleep(1) # hardcoded delay
print(len(links))
print(links)
For the mentioned in the question URL, it prints:
Parsing page number #1
Parsing page number #2
93
['http://www.funimation.com/shows/assassination-classroom/videos/official/assassination-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assassination-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assassination-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/baseball-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/baseball-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/baseball-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/grown-up-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/grown-up-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/grown-up-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assembly-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assembly-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/assembly-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/test-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/test-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/test-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time1st-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time1st-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time1st-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/school-trip-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/l-and-r-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/l-and-r-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/l-and-r-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/transfer-student-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/ball-game-tournament-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/ball-game-tournament-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/ball-game-tournament-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/talent-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/talent-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/talent-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/vision-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/vision-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/vision-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/end-of-term-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/end-of-term-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/end-of-term-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/schools-out1st-term', 'http://www.funimation.com/shows/assassination-classroom/videos/official/schools-out1st-term', 'http://www.funimation.com/shows/assassination-classroom/videos/official/schools-out1st-term', 'http://www.funimation.com/shows/assassination-classroom/videos/official/island-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/island-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/island-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/action-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/action-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/action-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/pandemonium-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/pandemonium-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/pandemonium-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time2nd-period', 'http://www.funimation.com/shows/assassination-classroom/videos/official/karma-time2nd-period', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/soul-eater', 'http://www.funimation.com/shows/soul-eater', 'http://www.funimation.com/shows/assassination-classroom/videos/official/xx-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/xx-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/xx-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/nagisa-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/nagisa-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/nagisa-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/summer-festival-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/summer-festival-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/summer-festival-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/kaede-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/kaede-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/kaede-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/itona-horibe-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/itona-horibe-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/itona-horibe-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/spinning-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/spinning-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/spinning-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/leader-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/leader-time', 'http://www.funimation.com/shows/assassination-classroom/videos/official/leader-time', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/deadman-wonderland', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/riddle-story-of-devil', 'http://www.funimation.com/shows/soul-eater', 'http://www.funimation.com/shows/soul-eater']
Basically, I use webelement.is_displayed() to check if it is clickable or not.
isLinkDisplay = driver.find_element_by_link_text(">").is_displayed()

Python Selenium Scrape Hidden Data

I'm trying to scrape the following page (just page 1 for the purpose of this question):
https://www.sportstats.ca/display-results.xhtml?raceid=4886
I can use Selinium to grab the source then parse it, but not all of the data that I'm looking for is in the source. Some of it needs to be found by clicking on elements.
For example, for the first person I can get all the visible fields from the source. But if you click the +, there is more data I'd like to scrape. For example, the "Chip Time" (01:15:29.9), and also the City (Oakville) that pops up on the right after clicking the + for a person.
I don't know how to identify the element that needs to be clicked to expand the +, then even after clicking it, I don't know how to find the values I'm looking for.
Any tips would be great.
Here is a sample code for your requirement. This code is base on python , selenium with crome exe file.
from selenium import webdriver
from lxml.html import tostring,fromstring
import time
import csv
myfile = open('demo_detail.csv', 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
driver=webdriver.Chrome('./chromedriver.exe')
csv_heading=["","","BIB","NAME","CATEGORY","RANK","GENDER PLACE","CAT. PLACE","GUN TIME","SPLIT NAME","SPLIT DISTANCE","SPLIT TIME","PACE","DISTANCE","RACE TIME","OVERALL (/814)","GENDER (/431)","CATEGORY (/38)","TIME OF DAY"]
wr.writerow(csv_heading)
count=0
try:
url="https://www.sportstats.ca/display-results.xhtml?raceid=4886"
driver.get(url)
table_tr=driver.find_elements_by_xpath("//table[#class='results overview-result']/tbody/tr[#role='row']")
for tr in table_tr:
lst=[]
count=count+1
table_td=tr.find_elements_by_tag_name("td")
for td in table_td:
lst.append(td.text)
table_td[1].find_element_by_tag_name("div").click()
time.sleep(5)
table=driver.find_elements_by_xpath("//div[#class='ui-datatable ui-widget']")
for demo_tr in driver.find_elements_by_xpath("//tr[#class='ui-expanded-row-content ui-widget-content view-details']/td/div/div/table/tbody/tr"):
for demo_td in demo_tr.find_elements_by_tag_name("td"):
lst.append(demo_td.text)
wr.writerow(lst)
table_td[1].find_element_by_tag_name("div").click()
time.sleep(5)
print count
time.sleep(5)
driver.quit()
except Exception as e:
print e
driver.quit()

Scraping a dynamically/Javascript generated website with Python/Selenium

I'm trying to scrape this website:
http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210
using Python and Selenium (see code below). The content is dynamically generated, and apparently data which is not visible in the browser is not loaded. I have tried making the browser window larger, and scrolling to the bottom of the page. Enlarging the window gets me all the data I want in the horizontal direction, but there is still plenty of data to scrape in the vertical direction. The scrolling appears not to work at all.
Does anyone have any bright ideas about how to do this?
Thanks!
from selenium import webdriver
import time
url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load
soup = BeautifulSoup(driver.page_source)
table = soup.find("table", {"id":"DataTable"})
### get data
thead = table.find('tbody')
loopRows = thead.findAll('tr')
rows = []
for row in loopRows:
rows.append([val.text.encode('ascii', 'ignore') for val in row.findAll(re.compile('td|th'))])
with open("body.csv", 'wb') as test_file:
file_writer = csv.writer(test_file)
for row in rows:
file_writer.writerow(row)
This will get you as far as autosaving the entire csv to disk, but I haven't found a robust way to determine when the download is complete:
import os
import contextlib
import selenium.webdriver as webdriver
import csv
import time
url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
download_dir = '/tmp'
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.dir", download_dir)
# 2 means "use the last folder specified for a download"
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")
# driver = webdriver.Firefox(firefox_profile=fp)
with contextlib.closing(webdriver.Firefox(firefox_profile=fp)) as driver:
driver.get(url)
driver.execute_script("onDownload(2);")
csvfile = os.path.join(download_dir, 'download.csv')
# Wait for the download to complete
time.sleep(10)
with open(csvfile, 'rb') as f:
for line in csv.reader(f, delimiter=','):
print(line)
Explanation:
Point your browser to url.
You'll see there is an Actions menu with an option to Download report data... and a suboption entitled "Comma-delimited ASCII format (*.csv)". If you inspect the HTML for these words you'll find
"Comma-delimited ASCII format (*.csv)","","javascript:onDownload(2);"
So it follows naturally that you might try getting webdriver to execute the JavaScript function call onDownload(2). We can do that with
driver.execute_script("onDownload(2);")
but normally another window will then pop up asking if you want save the file. To automate the saving-to-disk, I used the method described in this FAQ. The tricky part is finding the correct MIME type to specify on this line:
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")
The curl method described in the FAQ does not work here since we do not have a url for the csv file. However, this page describes another way to find the MIME type: Use a Firefox browser to open the save dialog. Check the checkbox saying "Do this automatically for files like this". Then inspect the last few lines of ~/.mozilla/firefox/*/mimeTypes.rdf for the most recently added description:
<RDF:Description RDF:about="urn:mimetype:handler:application/x-csv"
NC:alwaysAsk="false"
NC:saveToDisk="true">
<NC:externalApplication RDF:resource="urn:mimetype:externalApplication:application/x-csv"/>
</RDF:Description>
This tells us the mime type is "application/x-csv". Bingo, we are in business.
You can do the scrolling by
self.driver.find_element_by_css_selector("html body.TVTableBody table#pageTable tbody tr td#cell4 table#MainTable tbody tr td#vScrollTD img[onmousedown='imgClick(this.sbar.visible,this,event);']").click()
It seems like once you can scroll the scraping should be pretty standard unless I'm missing something

Categories