I'm trying to scrape this website:
http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210
using Python and Selenium (see code below). The content is dynamically generated, and apparently data which is not visible in the browser is not loaded. I have tried making the browser window larger, and scrolling to the bottom of the page. Enlarging the window gets me all the data I want in the horizontal direction, but there is still plenty of data to scrape in the vertical direction. The scrolling appears not to work at all.
Does anyone have any bright ideas about how to do this?
Thanks!
from selenium import webdriver
import time
url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load
soup = BeautifulSoup(driver.page_source)
table = soup.find("table", {"id":"DataTable"})
### get data
thead = table.find('tbody')
loopRows = thead.findAll('tr')
rows = []
for row in loopRows:
rows.append([val.text.encode('ascii', 'ignore') for val in row.findAll(re.compile('td|th'))])
with open("body.csv", 'wb') as test_file:
file_writer = csv.writer(test_file)
for row in rows:
file_writer.writerow(row)
This will get you as far as autosaving the entire csv to disk, but I haven't found a robust way to determine when the download is complete:
import os
import contextlib
import selenium.webdriver as webdriver
import csv
import time
url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
download_dir = '/tmp'
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.dir", download_dir)
# 2 means "use the last folder specified for a download"
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")
# driver = webdriver.Firefox(firefox_profile=fp)
with contextlib.closing(webdriver.Firefox(firefox_profile=fp)) as driver:
driver.get(url)
driver.execute_script("onDownload(2);")
csvfile = os.path.join(download_dir, 'download.csv')
# Wait for the download to complete
time.sleep(10)
with open(csvfile, 'rb') as f:
for line in csv.reader(f, delimiter=','):
print(line)
Explanation:
Point your browser to url.
You'll see there is an Actions menu with an option to Download report data... and a suboption entitled "Comma-delimited ASCII format (*.csv)". If you inspect the HTML for these words you'll find
"Comma-delimited ASCII format (*.csv)","","javascript:onDownload(2);"
So it follows naturally that you might try getting webdriver to execute the JavaScript function call onDownload(2). We can do that with
driver.execute_script("onDownload(2);")
but normally another window will then pop up asking if you want save the file. To automate the saving-to-disk, I used the method described in this FAQ. The tricky part is finding the correct MIME type to specify on this line:
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")
The curl method described in the FAQ does not work here since we do not have a url for the csv file. However, this page describes another way to find the MIME type: Use a Firefox browser to open the save dialog. Check the checkbox saying "Do this automatically for files like this". Then inspect the last few lines of ~/.mozilla/firefox/*/mimeTypes.rdf for the most recently added description:
<RDF:Description RDF:about="urn:mimetype:handler:application/x-csv"
NC:alwaysAsk="false"
NC:saveToDisk="true">
<NC:externalApplication RDF:resource="urn:mimetype:externalApplication:application/x-csv"/>
</RDF:Description>
This tells us the mime type is "application/x-csv". Bingo, we are in business.
You can do the scrolling by
self.driver.find_element_by_css_selector("html body.TVTableBody table#pageTable tbody tr td#cell4 table#MainTable tbody tr td#vScrollTD img[onmousedown='imgClick(this.sbar.visible,this,event);']").click()
It seems like once you can scroll the scraping should be pretty standard unless I'm missing something
Related
I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class
I am relatively new to Python and the Stack Overflow community as well. I am using selenium to web scrape https://freightliner.com/dealer-search/ for dealership names and addresses in North/South America and have been able to print it as a single string with no problems, but I cannot figure out how to export it to a csv file. The difference between the way that I am printing it in my code and how I want to export it to csv is that I am printing the name and address as a single string delimited by a semicolon whereas I want to export it to a csv as separate columns (name, address). The following is what I have tried:
'''
#! python3
# fl_dealers.py - Scrapes freightliner website for north american locations.
# import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time, os, csv
from bs4 import BeautifulSoup
# set Chrome options to automatically download file
options = webdriver.ChromeOptions()
prefs = {'download.default_directory': r'C:\Users\username\Downloads\\'}
options.add_experimental_option('prefs',prefs)
chromedriver = 'C:/Users/username/chromedriver.exe'
# change directory to Downloads folder
os.chdir("C:\\Users\\username\\Downloads")
# create webdriver object and call Chrome options
browser = webdriver.Chrome(executable_path=chromedriver, options=options)
# maximize the browser window
browser.maximize_window()
# set wait time to allow browser to open
browser.implicitly_wait(10) # seconds
# open freightliner website
browser.get('https://freightliner.com/dealer-search/')
# maximize the browser window
browser.maximize_window()
time.sleep(5)
# find all locations in north america
search = browser.find_element_by_xpath('//*[#id="by-location"]/div/div/input')
ActionChains(browser).move_to_element(search).click().key_down(Keys.CONTROL).send_keys('a').key_up(Keys.CONTROL).send_keys("USA").perform()
#search.send_keys('USA')
search_button = browser.find_element_by_xpath('//*[#id="by-location"]/button').click()
time.sleep(10)
# create variable for webpage AFTER searching for results
page_source = browser.page_source
# create bs4 object
soup = BeautifulSoup(page_source, 'html.parser')
# create variables for dealer name and address
names = soup.find_all('h2')[1:]
addresses = soup.find_all(class_='address')
# print the names and addresses
for name, address in zip(names, addresses):
print(name.get_text(separator=" ").strip(), ";", address.get_text(separator=", ").strip())
with open('fl_dealers.csv', mode='w', newline='') as outputFile:
dealershipsCSV = csv.writer(outputFile)
dealershipsCSV.writerow(['name', 'address'])
for name in names:
dealer_name = name.get_text
for address in addresses:
dealer_address = address.get_text
dealershipsCSV.writerow([dealer_name, dealer_address])
'''
The code does create a CSV file, but it only creates the column headers and does not export any of the actual names and addresses. I have searched numerous stack overflow, github and youtube posts related to the issue, but have not been able to find a solution. I have reached the limit of my knowledge thus far. There is a high likelihood that I am missing something very simply. Alas, I am still new to Python.
One thing to note - The reasoning for entering "USA" in the search bar is to override the website's default of using my location to search for nearby dealers. Even though the query is for "USA", it returns all North/South American dealers which is what I want.
Any and all help is greatly appreciated! Thank you.
I guess your main problem that you should append the names and addresses in a loop according to amount of data length.
Also you should use append mode, not write mode.
Please try this:
from csv import writer
with open('fl_dealers.csv', 'a') as outputFile:
writer_object = writer(outputFile)
for name, address in zip(names, addresses):
writer_object.writerow(['name', 'address'])
outputFile.close()
Problem description
I am working on ubuntu 16.04.
I want to download CSV files from a website. They are presented by links. The moment I click on the link I want to open a new tab which will download the file. I used the solution provided in https://gist.github.com/lrhache/7686903.
setup
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir",download_path)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk","text/csv")
# create a selenium webdriver
browser = webdriver.Firefox(firefox_profile=fp)
# open QL2 website
browser.get('http://live.ql2.com')
code
csvList = browser.find_elements_by_class_name("csv")
for l in csvlist:
if 'error' not in l.text and 'after' not in l.text:
l.send_keys(Keys.CONTROL +'t')
Every Element l is represented as follows:
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="9003fc6a-d8be-472b-bced-94fffdb5fdbe", element="27e1638a-0e37-411d-8d30-896c15711b49")>
Question
Why am I not able to open a new tab. Is there something missing?
The problem seems to be that you are just making a new tab, not opening the link in a new tab.
Try using ActionChains:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# create browser as detailed in OP's setup
key_code = Keys.CONTROL
csvList = browser.find_elements_by_class_name("csv")
for l in csvlist:
if 'error' not in l.text and 'after' not in l.text:
actions = webdriver.ActionChains(browser)
# Holds down the key specified in key_code
actions.key_down(key_code)
# Clicks the link
actions.click(l)
# Releases down the key specified in key_code
actions.key_up(key_code)
# Performs the actions specified in the ActionChain
actions.perform()
I'm trying to scrape the following page (just page 1 for the purpose of this question):
https://www.sportstats.ca/display-results.xhtml?raceid=4886
I can use Selinium to grab the source then parse it, but not all of the data that I'm looking for is in the source. Some of it needs to be found by clicking on elements.
For example, for the first person I can get all the visible fields from the source. But if you click the +, there is more data I'd like to scrape. For example, the "Chip Time" (01:15:29.9), and also the City (Oakville) that pops up on the right after clicking the + for a person.
I don't know how to identify the element that needs to be clicked to expand the +, then even after clicking it, I don't know how to find the values I'm looking for.
Any tips would be great.
Here is a sample code for your requirement. This code is base on python , selenium with crome exe file.
from selenium import webdriver
from lxml.html import tostring,fromstring
import time
import csv
myfile = open('demo_detail.csv', 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
driver=webdriver.Chrome('./chromedriver.exe')
csv_heading=["","","BIB","NAME","CATEGORY","RANK","GENDER PLACE","CAT. PLACE","GUN TIME","SPLIT NAME","SPLIT DISTANCE","SPLIT TIME","PACE","DISTANCE","RACE TIME","OVERALL (/814)","GENDER (/431)","CATEGORY (/38)","TIME OF DAY"]
wr.writerow(csv_heading)
count=0
try:
url="https://www.sportstats.ca/display-results.xhtml?raceid=4886"
driver.get(url)
table_tr=driver.find_elements_by_xpath("//table[#class='results overview-result']/tbody/tr[#role='row']")
for tr in table_tr:
lst=[]
count=count+1
table_td=tr.find_elements_by_tag_name("td")
for td in table_td:
lst.append(td.text)
table_td[1].find_element_by_tag_name("div").click()
time.sleep(5)
table=driver.find_elements_by_xpath("//div[#class='ui-datatable ui-widget']")
for demo_tr in driver.find_elements_by_xpath("//tr[#class='ui-expanded-row-content ui-widget-content view-details']/td/div/div/table/tbody/tr"):
for demo_td in demo_tr.find_elements_by_tag_name("td"):
lst.append(demo_td.text)
wr.writerow(lst)
table_td[1].find_element_by_tag_name("div").click()
time.sleep(5)
print count
time.sleep(5)
driver.quit()
except Exception as e:
print e
driver.quit()
I am working on python and selenium. I want to download file from clicking event using selenium. I wrote following code.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.close()
I want to download both files from links with name "Export Data" from given url. How can I achieve it as it works with click event only?
Find the link using find_element(s)_by_*, then call click method.
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
browser = webdriver.Firefox(profile)
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.find_element_by_id('exportpt').click()
browser.find_element_by_id('exporthlgt').click()
Added profile manipulation code to prevent download dialog.
I'll admit this solution is a little more "hacky" than the Firefox Profile saveToDisk alternative, but it works across both Chrome and Firefox, and doesn't rely on a browser-specific feature which could change at any time. And if nothing else, maybe this will give someone a little different perspective on how to solve future challenges.
Prerequisites: Ensure you have selenium and pyvirtualdisplay installed...
Python 2: sudo pip install selenium pyvirtualdisplay
Python 3: sudo pip3 install selenium pyvirtualdisplay
The Magic
import pyvirtualdisplay
import selenium
import selenium.webdriver
import time
import base64
import json
root_url = 'https://www.google.com'
download_url = 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png'
print('Opening virtual display')
display = pyvirtualdisplay.Display(visible=0, size=(1280, 1024,))
display.start()
print('\tDone')
print('Opening web browser')
driver = selenium.webdriver.Firefox()
#driver = selenium.webdriver.Chrome() # Alternately, give Chrome a try
print('\tDone')
print('Retrieving initial web page')
driver.get(root_url)
print('\tDone')
print('Injecting retrieval code into web page')
driver.execute_script("""
window.file_contents = null;
var xhr = new XMLHttpRequest();
xhr.responseType = 'blob';
xhr.onload = function() {
var reader = new FileReader();
reader.onloadend = function() {
window.file_contents = reader.result;
};
reader.readAsDataURL(xhr.response);
};
xhr.open('GET', %(download_url)s);
xhr.send();
""".replace('\r\n', ' ').replace('\r', ' ').replace('\n', ' ') % {
'download_url': json.dumps(download_url),
})
print('Looping until file is retrieved')
downloaded_file = None
while downloaded_file is None:
# Returns the file retrieved base64 encoded (perfect for downloading binary)
downloaded_file = driver.execute_script('return (window.file_contents !== null ? window.file_contents.split(\',\')[1] : null);')
print(downloaded_file)
if not downloaded_file:
print('\tNot downloaded, waiting...')
time.sleep(0.5)
print('\tDone')
print('Writing file to disk')
fp = open('google-logo.png', 'wb')
fp.write(base64.b64decode(downloaded_file))
fp.close()
print('\tDone')
driver.close() # close web browser, or it'll persist after python exits.
display.popen.kill() # close virtual display, or it'll persist after python exits.
Explaination
We first load a URL on the domain we're targeting a file download from. This allows us to perform an AJAX request on that domain, without running into cross site scripting issues.
Next, we're injecting some javascript into the DOM which fires off an AJAX request. Once the AJAX request returns a response, we take the response and load it into a FileReader object. From there we can extract the base64 encoded content of the file by calling readAsDataUrl(). We're then taking the base64 encoded content and appending it to window, a gobally accessible variable.
Finally, because the AJAX request is asynchronous, we enter a Python while loop waiting for the content to be appended to the window. Once it's appended, we decode the base64 content retrieved from the window and save it to a file.
This solution should work across all modern browsers supported by Selenium, and works whether text or binary, and across all mime types.
Alternate Approach
While I haven't tested this, Selenium does afford you the ability to wait until an element is present in the DOM. Rather than looping until a globally accessible variable is populated, you could create an element with a particular ID in the DOM and use the binding of that element as the trigger to retrieve the downloaded file.
In chrome what I do is downloading the files by clicking on the links, then I open chrome://downloads page and then retrieve the downloaded files list from shadow DOM like this:
docs = document
.querySelector('downloads-manager')
.shadowRoot.querySelector('#downloads-list')
.getElementsByTagName('downloads-item')
This solution is restrained to chrome, the data also contains information like file path and download date. (note this code is from JS, may not be the correct python syntax)
Here is the full working code. You can use web scraping to enter the username password and other field. For getting the field names appearing on the webpage, use inspect element. Element name(Username,Password or Click Button) can be entered through class or name.
from selenium import webdriver
# Using Chrome to access web
options = webdriver.ChromeOptions()
options.add_argument("download.default_directory=C:/Test") # Set the download Path
driver = webdriver.Chrome(options=options)
# Open the website
try:
driver.get('xxxx') # Your Website Address
password_box = driver.find_element_by_name('password')
password_box.send_keys('xxxx') #Password
download_button = driver.find_element_by_class_name('link_w_pass')
download_button.click()
driver.quit()
except:
driver.quit()
print("Faulty URL")