Very novice level question. What's the easiest way of opening several websites one by one after reading site names from an external file.
In the example below; I want to replace values of web URL from file and screemshot file name in same way.
Example script:
From selenium import webdriver
Driver=webdriver.ie(...driverpath)
Driver.get("facebook.com")
Driver.get_screenshot_as_file("facebook.png")
Driver.quit()
Try this, I have used json for storing the websites, a simple text file will do as well
import json
from selenium.webdriver import Chrome
with open('path to json file', encoding='utf-8') as s:
data = json.loads(s.read())
for site in data['sites']:
driver = Chrome('path to chrome driver')
driver.get(data['sites'][site])
driver.get_screenshot_as_file(site + '.png')
driver.close()
json file
{
"sites": {
"facebook": "https://www.facebook.com/",
"google": "https://www.google.com/",
"wikipedia": "https://www.wikipedia.org/"
}
}
Related
I've got a collection of URL's in a csv file and I want to loop through these links and open each link in the CSV one at a time. I'm getting several different errors depending on what I try but nonetheless I can't get the browser to open the links. The print shows that the links are there.
When I run my code i get the following error:
Traceback (most recent call last):
File "/Users/Main/PycharmProjects/ScrapingBot/classpassgiit.py", line 26, in <module>
open = browser.get(link_loop)
TypeError: Object of type bytes is not JSON serializable
Can someone help me with my code below if I am missing something or if i am doing it wrong.
My code:
import csv
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
import requests
browser = webdriver.Chrome(executable_path=r'./chromedriver')
contents = []
with open('ClassPasslite.csv', 'rt') as cp_csv:
cp_url = csv.reader(cp_csv)
for row in cp_url:
links = row[0]
contents.append(links)
for link in contents:
url_html = requests.get(links)
for link_loop in url_html:
print(contents)
open = browser.get(link_loop)
Apparently, you are messing something up with the names. Without having a copy of the .csv file, I cannot reproduce the error - hence, I will assume that you correctly extract the link from the text file.
In the second part of your code, you use requests.get to get the links (mind the plural) option, but links apparently is an element that you define in the previous section (links = row[0]), whereas link is the actual object you define in the for loop. Below you can find a version of the code that might be a helpful starting point.
Let me add, though, that the contemporaneous use of requests and selenium in this case makes little sense in your context: why getting an HTML page and then loop over its elements to get other pages with selenium?
import csv
import requests
browser = webdriver.Chrome(executable_path=r'./chromedriver')
contents = []
with open('ClassPasslite.csv', 'rt') as cp_csv:
cp_url = csv.reader(cp_csv)
for row in cp_url:
links = row[0]
contents.append(links)
for link in contents:
url_html = requests.get(link) # now this is singular
# Do what you have to do here with requests, in spite of using selenium #
Since you have not provided any form of what is contained in your variable contents I will assume that it is a list of url strings.
As #cap.py mentioned you are messing up by using requests and selenium at the same time. When you do a GET web request, the server at the destination will send you a text response. This text can be simply some text, like Hello world! or it can be some html. But this html code as to be interpreted in your computer which sent the request.
That's the point of selenium over requests: requests return the text gathered from the destination (url) while selenium ask a browser (e.g. Chrome) to do gather the text and if this text is some html, to interpret it to give you a real readable web page. Moreover the browser is running the javascript inside your page so dynamic pages works as well.
In the end the only thing needed to run your code is to do this:
import csv
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
import requests
browser = webdriver.Chrome(executable_path=r'./chromedriver')
contents = []
with open('ClassPasslite.csv', 'rt') as cp_csv:
cp_url = csv.reader(cp_csv)
for row in cp_url:
links = row[0]
contents.append(links)
#link should be something like "https://www.classpass.com/studios/forever-body-coaching-london?search-id=49534025882004019"
for link in contents:
browser.get(link)
# paste the code you have here
Tip: Don't forget that browsers take some time to load pages. Adding some time.sleep(3) will help you a lot.
I am having troubles downloading txt file from this page: https://www.ceps.cz/en/all-data#RegulationEnergy (when you scroll down and see Download: txt, xls and xml).
My goal is to create scraper that will go to the linked page, clicks on the txt link for example and saves a downloaded file.
Main problems that I am not sure how to solve:
The file doesn't have a real link that I can call and download it, but the link is created with JS based on filters and file type.
When I use requests library for python and call the link with all headers it just redirects me to https://www.ceps.cz/en/all-data .
Approaches tried:
Using scraper such as ParseHub to download link didn't work as intended. But this scraper was the closest to what I've wanted to get.
Used requests library to connect to the link using headers that HXR request uses for downloading the file but it just redirects me to https://www.ceps.cz/en/all-data .
If you could propose some solution for this task, thank you in advance. :-)
You can download this data to a directory of your choice with Selenium; you just need to specify the directory to which the data will be saved. In what follows below, I'll save the txt data to my desktop:
from selenium import webdriver
download_dir = '/Users/doug/Desktop/'
chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : download_dir}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get('https://www.ceps.cz/en/all-data')
container = driver.find_element_by_class_name('download-graph-data')
button = container.find_element_by_tag_name('li')
button.click()
You should do like so:
import requests
txt_format = 'txt'
xls_format = 'xls' # open in binary mode
xml_format = 'xlm' # open in binary mode
def download(file_type):
url = f'https://www.ceps.cz/download-data/?format={txt_format}'
response = requests.get(url)
if file_type is txt_format:
with open(f'file.{file_type}', 'w') as file:
file.write(response.text)
else:
with open(f'file.{file_type}', 'wb') as file:
file.write(response.content)
download(txt_format)
I am trying to create a script to automatically save read-only pdfs via Chrome's printing functionality to save it as another pdf in the same folder. This removes the 'read-only' feature. However while running the script I am not sure where I can specify my own specific destination folder and the script saves it in the Downloads folder directly.
Full props to https://stackoverflow.com/users/1432614/ross-smith-ii for the code below.
Any help will be very much appreciated.
import json
from selenium import webdriver
downloadPath = r'mypath\downloadPdf\\'
appState = {
"recentDestinations": [
{
"id": "Save as PDF",
"origin": "local"
}
],
"selectedDestinationId": "Save as PDF",
"version": 2
}
profile = {'printing.print_preview_sticky_settings.appState':json.dumps(appState)}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', profile)
chrome_options.add_argument('--kiosk-printing')
driver = webdriver.Chrome(chrome_options=chrome_options)
pdfPath = r'mypath\protected.pdf'
driver.get(pdfPath)
driver.execute_script('window.print();')
Ok, I think I figured out the solution. Just append the following line with the below code:
profile = {'printing.print_preview_sticky_settings.appState':json.dumps(appState),'savefile.default_directory':downloadPath}
It's not ideal still as you cannot specify the new file name you want but it works for now.
If anyone has a better solution, please do post it here. Thanks
I would like to download (using Python 3.4) all (.zip) files on the Google Patent Bulk Download Page http://www.google.com/googlebooks/uspto-patents-grants-text.html
(I am aware that this amounts to a large amount of data.) I would like to save all files for one year in directories [year], so 1976 for all the (weekly) files in 1976. I would like to save them to the directory that my Python script is in.
I've tried using the urllib.request package, but I could get far enoughto get to the http text, not how to "click" on the file to download it.
import urllib.request
url = 'http://www.google.com/googlebooks/uspto-patents-grants-text.html'
savename = 'google_patent_urltext'
urllib.request.urlretrieve(url, savename )
Thank you very much for help.
As I understand you seek for a command that will simulate leftclicking on file and automatically download it. If so, you can use Selenium.
something like:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
profile = FirefoxProfile ()
profile.set_preference("browser.download.folderList",2)
profile.set_preference("browser.download.manager.showWhenStarting",False)
profile.set_preference("browser.download.dir", 'D:\\') #choose folder to download to
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",'application/octet-stream')
driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.google.com/googlebooks/uspto-patents-grants-text.html#2015')
filename = driver.find_element_by_xpath('//a[contains(text(),"ipg150106.zip")]') #use loop to list all zip files
filename.click()
UPDATED! 'application/octet-stream' zip-mime type should be used instead of "application/zip". Now it should work:)
The html you are downloading is the page of links. You need to parse the html to find all the download links. You could use a library like beautiful soup to do this.
However, the page is very regularly structured so you could use a regular expression to get all the download links:
import re
html = urllib.request.urlopen(url).read()
links = re.findall('<a href="(.*)">', html)
I am working on python and selenium. I want to download file from clicking event using selenium. I wrote following code.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.close()
I want to download both files from links with name "Export Data" from given url. How can I achieve it as it works with click event only?
Find the link using find_element(s)_by_*, then call click method.
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
browser = webdriver.Firefox(profile)
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.find_element_by_id('exportpt').click()
browser.find_element_by_id('exporthlgt').click()
Added profile manipulation code to prevent download dialog.
I'll admit this solution is a little more "hacky" than the Firefox Profile saveToDisk alternative, but it works across both Chrome and Firefox, and doesn't rely on a browser-specific feature which could change at any time. And if nothing else, maybe this will give someone a little different perspective on how to solve future challenges.
Prerequisites: Ensure you have selenium and pyvirtualdisplay installed...
Python 2: sudo pip install selenium pyvirtualdisplay
Python 3: sudo pip3 install selenium pyvirtualdisplay
The Magic
import pyvirtualdisplay
import selenium
import selenium.webdriver
import time
import base64
import json
root_url = 'https://www.google.com'
download_url = 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png'
print('Opening virtual display')
display = pyvirtualdisplay.Display(visible=0, size=(1280, 1024,))
display.start()
print('\tDone')
print('Opening web browser')
driver = selenium.webdriver.Firefox()
#driver = selenium.webdriver.Chrome() # Alternately, give Chrome a try
print('\tDone')
print('Retrieving initial web page')
driver.get(root_url)
print('\tDone')
print('Injecting retrieval code into web page')
driver.execute_script("""
window.file_contents = null;
var xhr = new XMLHttpRequest();
xhr.responseType = 'blob';
xhr.onload = function() {
var reader = new FileReader();
reader.onloadend = function() {
window.file_contents = reader.result;
};
reader.readAsDataURL(xhr.response);
};
xhr.open('GET', %(download_url)s);
xhr.send();
""".replace('\r\n', ' ').replace('\r', ' ').replace('\n', ' ') % {
'download_url': json.dumps(download_url),
})
print('Looping until file is retrieved')
downloaded_file = None
while downloaded_file is None:
# Returns the file retrieved base64 encoded (perfect for downloading binary)
downloaded_file = driver.execute_script('return (window.file_contents !== null ? window.file_contents.split(\',\')[1] : null);')
print(downloaded_file)
if not downloaded_file:
print('\tNot downloaded, waiting...')
time.sleep(0.5)
print('\tDone')
print('Writing file to disk')
fp = open('google-logo.png', 'wb')
fp.write(base64.b64decode(downloaded_file))
fp.close()
print('\tDone')
driver.close() # close web browser, or it'll persist after python exits.
display.popen.kill() # close virtual display, or it'll persist after python exits.
Explaination
We first load a URL on the domain we're targeting a file download from. This allows us to perform an AJAX request on that domain, without running into cross site scripting issues.
Next, we're injecting some javascript into the DOM which fires off an AJAX request. Once the AJAX request returns a response, we take the response and load it into a FileReader object. From there we can extract the base64 encoded content of the file by calling readAsDataUrl(). We're then taking the base64 encoded content and appending it to window, a gobally accessible variable.
Finally, because the AJAX request is asynchronous, we enter a Python while loop waiting for the content to be appended to the window. Once it's appended, we decode the base64 content retrieved from the window and save it to a file.
This solution should work across all modern browsers supported by Selenium, and works whether text or binary, and across all mime types.
Alternate Approach
While I haven't tested this, Selenium does afford you the ability to wait until an element is present in the DOM. Rather than looping until a globally accessible variable is populated, you could create an element with a particular ID in the DOM and use the binding of that element as the trigger to retrieve the downloaded file.
In chrome what I do is downloading the files by clicking on the links, then I open chrome://downloads page and then retrieve the downloaded files list from shadow DOM like this:
docs = document
.querySelector('downloads-manager')
.shadowRoot.querySelector('#downloads-list')
.getElementsByTagName('downloads-item')
This solution is restrained to chrome, the data also contains information like file path and download date. (note this code is from JS, may not be the correct python syntax)
Here is the full working code. You can use web scraping to enter the username password and other field. For getting the field names appearing on the webpage, use inspect element. Element name(Username,Password or Click Button) can be entered through class or name.
from selenium import webdriver
# Using Chrome to access web
options = webdriver.ChromeOptions()
options.add_argument("download.default_directory=C:/Test") # Set the download Path
driver = webdriver.Chrome(options=options)
# Open the website
try:
driver.get('xxxx') # Your Website Address
password_box = driver.find_element_by_name('password')
password_box.send_keys('xxxx') #Password
download_button = driver.find_element_by_class_name('link_w_pass')
download_button.click()
driver.quit()
except:
driver.quit()
print("Faulty URL")