Use Selenium to download multiple URLs as PDFs - python

I am trying to have Selenium download the URLs of a webpage as PDFs on Safari. So far, I have been able to open the URL, but I can't get Safari to download it. All the solutions I found so far were either for another browser, or they didn't work. Ideally I would like it to download all links of one page and then move on the next page.
At first I thought that clicking on each hyperlink and then downloading it was the way to go. But that would require switching windows each time, so then I tried to find a way to download it without having to click on it, but nothing worked.
I am quite new at programming so I am sure that I am missing something.
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pdfkit
browser = webdriver.Safari()
browser.get(a_base_url)
username = browser.find_element_by_name("tb_LoginName")
password = browser.find_element_by_name("tb_Password")
submit = browser.find_element_by_id("btn_Login")
username.send_keys(username)
password.send_keys(password)
submit.click()
element=WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="maincolumn"]/div/table/tbody/tr[2]/td[9]/a[2]'))).click()
browser.switch_to_window(browser.window_handles[0])
url=browser.current_url

I would go for the following approach:
Get href attribute of the link you want to download via WebElement.get_attribute() function
Use urllib or requests library to retrieve the URL from step 1 without using the browser
Most probably you will also need to get the browser Cookies via WebDriver.get_cookies function and add them to Cookie header for your download request

Related

Not able to download the file from websiteusing selenium python

I am trying to download the daily report from the website NSE-India using selenium & python.
Approach to download the daily report
Website loads with no data
After X time,page is loaded with report information
Once the page is loaded with report data,"table[#id='etfTable']" appears
Explicit wait is added in the code,to wait till the "table[#id='etfTable']" loads
Code for explicit wait
element=WebDriverWait(driver,50).until(EC.visibility_of_element_located(By.xpath,"//table[#id='etfTable']"))
Extract the onclick event using xpath
downloadcsv= driver.find_element_by_xpath("//div[#id='esw-etf']/div[2]/div/div[3]/div/ul/li/a")
Trigger the click to download the file
Full code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options =webdriver.ChromeOptions();
prefs={"download.default_directory":"/Volumes/Project/WebScraper/downloadData"};
options.binary_location=r'/Applications/Google Chrome 2.app/Contents/MacOS/Google Chrome'
chrome_driver_binary =r'/usr/local/Caskroom/chromedriver/94.0.4606.61/chromedriver'
options.add_experimental_option("prefs",prefs)
driver =webdriver.Chrome(chrome_driver_binary,options=options)
try:
#driver.implicity_wait(10)
driver.get('https://www.nseindia.com/market-data/exchange-traded-funds-etf')
element =WebDriverWait(driver,50).until(EC.visibility_of_element_located(By.xpath,"//table[#id='etfTable']"))
downloadcsv= driver.find_element_by_xpath("//div[#id='esw-etf']/div[2]/div/div[3]/div/ul/li/a")
print(downloadcsv)
downloadcsv.click()
time.sleep(5)
driver.close()
except:
print("Invalid URL")
Issue i am facing.
The page is keeps on loading but when launched without selenium the daily report is getting loaded
Normal
Loading via Selenium
Not able to download the daily report
There are some syntax error in the program. Like semi-colon in few lines and while finding element using WebDriverWait, brackets are missing.
Try like below and confirm.
Can use Javascript to click on that element.
driver.get("https://www.nseindia.com/market-data/exchange-traded-funds-etf")
element =WebDriverWait(driver,50).until(EC.visibility_of_element_located((By.XPATH,"//table[#id='etfTable']/tbody/tr[2]")))
downloadcsv= driver.find_element_by_xpath("//img[#title='csv']/parent::a")
print(downloadcsv)
driver.execute_script("arguments[0].click();",downloadcsv)
It's not an issue with your code it's an issue with the website. I checked it most of the time it did not allow me to click on the CSV file. instead of downloading the CSV file, you can scrape the table.
# for direct to the page delete cookies is very important otherwise it will deny the access
browser.delete_all_cookies()
browser.get('https://www.nseindia.com/market-data/exchange-traded-funds-etf')
sleep(5)
soup = BeautifulSoup(browser.page_source, 'html.parser')
# scrape the table from the soup

Selenium get() redirects to another url

I'm trying to navigate to the following page and extract the html https://www.automobile.it/annunci?b=data&d=DESC, but everytime i call the get() method it looks like the website redirects me to the another page, always the same one which is https://www.automobile.it/torrenova?radius=100&b=data&d=DESC.
here's the simple code i'm running:
from selenium import webdriver
driver = webdriver.Chrome(executable_path=ex_path)
driver.get("https://www.automobile.it/annunci?b=data&d=DESC")
html=driver.page_source
if i do the same thing using the request module i don't get redirected
import requests
html=requests.get("https://www.automobile.it/annunci?b=data&d=DESC")
i don't understand why it's behaving like this, any ideas?
Use driver.delete_all_cookies()
from selenium import webdriver
driver = webdriver.Chrome(executable_path=ex_path)
driver.delete_all_cookies()
driver.get("https://www.automobile.it/annunci?b=data&d=DESC")
html=driver.page_source
PS: be also warned: Page_source will not get you the completed DOM as rendered.
Well you can clear browser cache by using the below code :
I am assuming that you are using chrome.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path=ex_path)
driver.get('chrome://settings/clearBrowserData')
driver.find_element_by_xpath('//settings-ui').send_keys(Keys.ENTER)
driver.get("https://www.automobile.it/annunci?b=data&d=DESC")

Web-scraping with selenium using google translate

I am trying to scrape multiple web pages across the world. So, I want to translate the website using Google translate extension and then scrape the page using selenium.
I did some research and figured out how to add extension while running selenium.
1) download google translate extension
2) Create .crx file
3) add extension to selenium
but I have no idea how to automatically execute the extension (By default, it does nothing)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
option = webdriver.ChromeOptions()
option.add_extension('./translate.crx')
driver = webdriver.Chrome(executable_path = "./chromedriver", chrome_options = option)
driver.get("naver.com")
WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
''' #### Here I want something like####
driver.execute_extension("translate this page")
'''
print driver.find_element_by_tag_name("body").text
driver.quit()
Also, I found that the extension doesn't translate the original HTML, so I might have to use a different method for crawling. (Maybe passing ctrl-a, ctrl-c, ctrl-v instead by_tag_name("body"))
Could you give me any pointer for this?
Thanks in advance
driver.execute_extension
Seems to me if you can open the extension by Selenium (see an example in C#). Then you by Selenium may click on the TRANSLATE THIS PAGE link:
Shortcut
Use Google Translate API.

Eaten by the net? Crawler survival under this website using python

I am making web-crawler to get information from http://www.caam.org.cn/hyzc, but it showed me HTTP Error 302, and I cannot fix it.
https://imgur.com/a/W0cykim
The picture gives you a rough idea about the special layout of this website in that when you are browsing it, it will pop out a window, telling you that the website is accelerating, for the reason that there are so many people online, and then direct you to that website. As a result, when I use web-crawler, all I get is the information on this window, but nothing on this website. I think this is a good way for the website keeper to get rid of our web crawlers. So I want to ask for your help to get useful information from this website
At first, I used requests of python for my web crawler, and I only got information on that window, the results are shown here: https://imgur.com/a/GLcpdZn
And then I forbad website redirect, I got HTTP Error 303, shown:
https://imgur.com/a/6YtaVOt
This is the latest code I used:
python
import requests
def getpage(url):
try:
r= requests.get(url, headers={'User-Agent':'Mozilla/5.0'}, timeout=10)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return "try again"
url = "http://www.caam.org.cn/hyzc"
print(getpage(url))
The expected outcome of this question is to get useful information from the website http://www.caam.org.cn/hyzc. We may need to deal with the window popped out.
Looks like this website have some kind of protection against crawlers using requests, the page is not entirely loaded when you send a get request.
You can try to emulate a browser using selenium:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.caam.org.cn/hyzc')
print(driver.page_source)
driver.close()
driver.page_source will contain the page source.
You can learn how to setup selenium webdriver here.
I added something to delay the closure of my web crawl and this worked. So I want to share my lines in case you meet similar problem in the future:
python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get('http://www.caam.org.cn')
body = driver.find_element_by_tag_name("body")
wait = WebDriverWait(driver, 5, poll_frequency=0.05)
wait.until(EC.staleness_of(body))
print(driver.page_source)
driver.close()

How can I scrape this?

I need to scrape this page (which has a form): http://kllads.kar.nic.in/MLAWise_reports.aspx, with Python preferably (if not Python, then JavaScript). I was looking at libraries like RoboBrowser (which is basically Mechanize + BeautifulSoup) and (maybe) Selenium but I'm not quite sure on how to go about it. From inspecting the element, it seems to be a WebForm that I need to fill in. After filling that in, the webpage generates some data that I need to store. How should I do this?
You can interact with the javascript web forms relatively easily in Selenium. You may need to install a webdriver quickly, but besides that all you need to do is find the form using its xpath and then have Selenium select an option from the drop down menu using the option's xpath. For the web page provided that would look something like this:
#import functions from selenium module
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# open chrome browser using webdriver
path_to_chromedriver = '/Users/Michael/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path=path_to_chromedriver)
# open web page using browser
browser.get('http://kllads.kar.nic.in/MLAWise_reports.aspx')
# wait for page to load then find 'Constituency Name' dropdown and select 'Aland (46)''
const_name = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="ddlconstname"]')))
browser.find_element_by_xpath('//*[#id="ddlconstname"]/option[2]').click()
# wait for the page to load then find 'Select Status' dropdown and select 'OnGoing'
sel_status = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="ddlstatus1"]')))
browser.find_element_by_xpath('//*[#id="ddlstatus1"]/option[2]').click()
# wait for browser to load then click 'Generate Report'
gen_report = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="BtnReport"]')))
browser.find_element_by_xpath('//*[#id="BtnReport"]').click()
Between each interaction, you are just giving the browser some time to load before attempting click the next element. Once all the forms are filled out, the page will display the data based on the options selected and you should be able to scrape the table data. I had a few issues when attempting to load data for the first Constituency Name option, but the others seemed to work fine.
You should also be able to loop through all the dropdown options available under each web form to display all the data.
Hope that helps!

Categories