Download PDF file from .cfm URL - python

I'm trying to download a PDF file from this address: https://aisweb.decea.mil.br/inc/notam/gerar-boletim/reports/report-notam.cfm
I wrote some code that first fills out some information in this page (correctly) https://aisweb.decea.mil.br/?i=notam and then clicks a button that opens a new tab to the generated PDF file. The problem is that when it tries to save the PDF file at the end, it downloads directly from the .cfm address, resulting in an empty PDF template (you can see this by clicking the fist link).
How can I download the PDF that is currently being shown to me on the page, instead of accessing the first URL directly?
This is my code
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
from selenium.webdriver.common.print_page_options import PrintOptions
from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib
import time
import requests
from urllib.parse import urljoin
aerodromos = "SBNT,SBJP,SBFZ,SBRF" #TEST
driver = webdriver.Chrome('C:\Windows\chromedriver.exe')
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
driver.get("https://aisweb.decea.mil.br/?i=notam")
driver.maximize_window()
caixaTexto = driver.find_element(By.XPATH,'//*[#id="icaocode"]')
caixaTexto.send_keys(aerodromos)
botao = driver.find_element(By.XPATH, '//*[#id="a"]/form/div/div[3]/div/input[2]')
botao.click()
botao = driver.find_element(By.XPATH, '//*[#id="select-all"]')
botao.click()
botao = driver.find_element(By.XPATH, '/html/body/div/div/div/div/div[2]/div/div/form/input[3]')
botao.click()
response = urllib.request.urlretrieve('https://aisweb.decea.mil.br/inc/notam/gerar-boletim/reports/report-notam.cfm', filename='relatorio1.pdf')

I did it! When I tried to change the settings in Chrome to download PDFs instead of opening them, it made no difference, but I ended up finding a solution while searching for another way to do it.
Unable to access the modal elements to download pdf with selenium
I changed Chrome experimental options profile in my code and it worked! Now it opens the tab, immediately downloads the file and closes the tab!

Related

Specific web page doesn't load (empty page) HTML and CSS with Selenium?

I started working with Selenium, it works for any website I tried except one (myvisit.com) that doesn't load the page.
It opens Chrome but the page is empty. I tried to set number of delays but it still doesn't load.
When I go to the website on a regular Chrome (without Selenium) it loads everything.
Here is my simple code, not sure how to continue from that:
import os
import random
import time
# selenium libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import ChromiumOptions
def delay():
time.sleep(random.randint(2,3))
driver = webdriver.Chrome(os.getcwd()+"\\webdriver\\chromedriver.exe")
driver.get("https://myvisit.com")
delay()
delay()
delay()
delay()
I also tried to use ChromiumOptions with flags like --no-sandbox but it didn't help:
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(os.getcwd()+"\\webdriver\\chromedriver.exe",options=options)
Simply add the arguement to remove it from determining it's an automation.

Saving pages with Selenium

I will try again.
The code below I copied from another site and the user say it works (shows a screenshot).Original code
I tested the code: No error, but no file save.
All questions use this answer to save a file: A question!
why the page is not saved or, if it is, where is the file?
Thanks
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Selenium\chromedriver.exe")
driver.get("http://www.example.com")
saveas = ActionChains(driver).key_down(Keys.CONTROL).send_keys('S').key_up(Keys.CONTROL)
saveas.perform()
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Selenium\chromedriver.exe")
driver.get("http://www.example.com")
with open('page.html', 'w+') as f:
f.write(driver.page_source)
Must work
If you do the key combination in the browser, you will see this only brings up the 'save page' dialog box. You need to additionally send ALT+S to save the page, in Windows it will be saved in your Downloads folder by default.
saveas = ActionChains(driver).key_down(Keys.CONTROL).send_keys('S').key_up(Keys.CONTROL).send_keys('MyDocumentName').key_down(Keys.ALT).send_keys('S').key_up(Keys.ALT)
EDIT:
ActionChains are unreliable. It would be easier not to interact with the browser GUI.
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Selenium\chromedriver.exe")
driver.get("http://www.example.com")
with open('page.html', 'w') as f:
f.write(driver.page_source)

how to open existing instance of opera in selenium

I'm trying to use BeautifulSoup on a site that assigns a unique identifier to the session when a user logs in (e.g. d83c231c-a1a0-4b9b-85bb-7cc36aa6b66c) as well as a "serial number": serial:"ABzepITkLfRuQ+4hCMTXfA.
Upon examination via WebDriver, it seems that I receive a newly generated guid and serial number every time I login to the site if i use WebDriver. Is there a way to open the same instance of a browser programmatically so that it appears as the same guid to the site each time?
This is what i am using now to open browser:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome import service
from selenium.webdriver.support.ui import WebDriverWait
import pprint
webdriver_service = service.Service('operadriver')
webdriver_service.start()
options = webdriver.ChromeOptions()
options.binary_location = "C:\Program Files (x86)\Opera\launcher.exe"
driver = webdriver.Opera(opera_options=options) # success!
driver.get('http://somesite.com')

How to download a HTML webpage using Selenium with python?

I want to download a webpage using selenium with python. using the following code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument('--save-page-as-mhtml')
d = DesiredCapabilities.CHROME
driver = webdriver.Chrome()
driver.get("http://www.yahoo.com")
saveas = ActionChains(driver).key_down(Keys.CONTROL)\
.key_down('s').key_up(Keys.CONTROL).key_up('s')
saveas.perform()
print("done")
However the above code isnt working. I am using windows 7.
Is there any by which i can bring up the 'Save as" Dialog box?
Thanks
Karan
You can use below code to download page HTML:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.yahoo.com")
with open("/path/to/page_source.html", "w", encoding='utf-8') as f:
f.write(driver.page_source)
Just replace "/path/to/page_source.html" with desirable path to file and file name
Update
If you need to get complete page source (including CSS, JS, ...), you can use following solution:
pip install pyahk # from command line
Python code:
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import ahk
firefox = FirefoxBinary("C:\\Program Files (x86)\\Mozilla Firefox\\firefox.exe")
from selenium import webdriver
driver = web.Firefox(firefox_binary=firefox)
driver.get("http://www.yahoo.com")
ahk.start()
ahk.ready()
ahk.execute("Send,^s")
ahk.execute("WinWaitActive, Save As,,2")
ahk.execute("WinActivate, Save As")
ahk.execute("Send, C:\\path\\to\\file.htm")
ahk.execute("Send, {Enter}")

Bypass Referral Denied error in selenium using python

I was making a script to download images from comic naver and I'm kind of done with it, however I can't seem to save the images.
I successfully grabbed the images via urlib and BeasutifulSoup, now, seems like they've introduced hotlink blocking and I can't seem to save the images on my system via urlib or selenium.
Update: I tried changing the useragent to see if that was causing problems... still the same.
Any fix or solution?
My code right now :
import requests
from bs4 import BeautifulSoup
import re
import urllib
import urllib2
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Chrome/15.0.87"
)
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver = webdriver.PhantomJS(desired_capabilities=dcap)
soup = BeautifulSoup(urllib.urlopen(url).read())
scripts = soup.findAll('img', alt='comic content')
for links in scripts:
Imagelinks = links['src']
filename = Imagelinks.split('_')[-1]
print 'Downloading Image : '+filename
driver.get(Imagelinks)
driver.save_screenshot(filename)
driver.close()
Following 'MAI's' reply, I tried what I could with selenium, and got what I wanted. It's solved now. My code :
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome()
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver.get(url)
elem = driver.find_elements_by_xpath("//div[#class='wt_viewer']//img[#alt='comic content']")
for links in elem:
print links.get_attribute('src')
driver.quit()
but, when I try to taek screenshots of this, it shows that the "element is not attached to the page". Now, how am I supposed to solve that :/
(Note: Apologies, I'm not able to comment, so I have to make this an answer.)
To answer your original question, I've just been able to download an image in cURL from Naver Webtoons (the English site) by adding a Referer: http://www.webtoons.com header like so:
curl -H "Referer: http://www.webtoons.com" [link to image] > img.jpg
I haven't tried, but you'll probably want to use http://comic.naver.com instead. To do this with urllib, create a Request object with the header required:
req = urllib.request.Request(url, headers={"Referer": "http://comic.naver.com"})
with urllib.request.urlopen(req) as response, open("image.jpg", "wb") as outfile:
Then you can save the file using shutil.copyfileobj(src, dest). So instead of taking screenshots, you can simply get a list of all the images to download, then make a request for each one using the referer header.
Edit: I have a working script on GitHub which only requires urllib and BeautifulSoup.
I took a short look at the website with Chrome dev tools.
I would suggest you to download the image directly instead of screen-shooting. Selenium webdriver should actually run the javascripts on PhantomJS headless browser, so you should get images loaded by javascript at the following path.
The path that I am getting by eye-balling the html is
html body #wrap #container #content div #comic_view_area div img
The image tags in the last level have IDs like content_image_N, N counting from 0. So you can also get specific picture by using img#content_image_0 for example.

Categories