I have been attempting to use Weasyprint and PDFKIT to transform a webpage into a pdf. I have successfully saved a PDF with a portion of the page.
in weasyprint i cannot work out how to grab the correct CSS style from the page. using PDFKIT i seem to be retrieving the mobile version of the site rather than the full page. i'm using python 3.6.
from urllib.request import Request, urlopen
import webbrowser
import pdfkit
import weasyprint
#pdfkit.from_url('http://google.com', 'out.pdf')
print("started script")
website = 'https://www.bbcgoodfood.com/recipes/3228/chilli-con-carne'
filename = 'savedPDF.pdf'
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
print(urlopen(req).getcode())
temp = urlopen(req).getcode()
if temp == 200:
pdfkit.from_url(website, 'out.pdf')
weasyprint.HTML(website).write_pdf('/Users/me/Documents/weasyprint.pdf')
weasyprint.HTML(website).write_pdf(filename,stylesheets=[weasyprint.CSS('https://www.bbcgoodfood.com/sites/default/files/advagg_css/css__pDgD1vQBFL4LZ6AO_Uw8wEc3MBEaHOzbhMtPie685P8__Kxa0k0VBbKvV5-TOMN_kW3S7CrkFMM4Zf0LjDvzMFnk__mXPuNFBZ0nocZLk5Qifty02tMfg-gomArSBCcGw1mLo.css')])
I cant see an option in pdfkit to specify what to connect with.
Furthermore the two PDF's created from weasyprint are identical.
After quite a while of messing around with the above mentioned packages I was still struggling to achieve a correct looking output.
I have settled with using webkit2png, this works almost perfectly, the only downside is that I get a cookie popup message appearing in some of the saved files.
Related
I'm trying to grab snowfall data from the National Weather Service at this site:
https://www.nohrsc.noaa.gov/snowfall/
The data can be downloaded via the webpage with a 'click' on the file type in the drop down, but I can't seem to figure out how to automate this using python. They have an ftp archive, but it requires a login and I can't access it for some reason.
However, since the files can be downloaded via a "click" on the webpage interface, I imagine there must be a way to grab it using wget or urlopen? But I can't seem to figure out what the exact url address would be in this case in order to use those functions. Does anyone have any ideas on how to download this data straight from the website listed above?
Thanks!
You can inspect links with Chrome Console.
Press F12, then click on file type:
Here an URL https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc
You can download it with python using Requests library
import requests
r = requests.get('https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc')
data = r.content # file context
Or you can just save it to file with urlretrieve
from urllib.request import urlretrieve
url = 'https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc'
dst = 'data.nc'
urlretrieve(url, dst)
I am currently trying to figure out how i can take a list of links and make python run through all of them and save them as pdf. (I'm not a python expert)
I found a python package called "pdfkit" which is quite good, but how do i set it up so that it follow my url-list and save the pdf as different names all the time?
import pdfkit
config = pdfkit.configuration(wkhtmltopdf="C:\\Program Files (x86)\\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
pdfkit.from_url('http://google.com', 'MyPDF.pdf', configuration=config)
This is my current code, lets say that i have a list of 10 webpages that i want to save as 10 different pdf files how do i make a setup that would allow me to do so?
Another issue is that i need to login to the page in order to scrape the information from the links, how would you implement that?
Best Regards,
Answer for the first question:
import pdfkit
config = pdfkit.configuration(wkhtmltopdf="C:\\Program Files (x86)\\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
url_list = [
['http://google.com', 'google.com.pdf'],
['http://facebook.com', 'facebook.com.pdf'],
['http://yahoo.com', 'yahoo.com.pdf'],
]
for k, v in url_list:
pdfkit.from_url(k, v, configuration=config)
For answer to the second question, you can use the requests module session feature to login first and then pass the cookie to pdfkit to download the page. See Create PDF of a https webpage which requires login using pdfkit
import selenium.webdriver
import pdfkit
import time
config = pdfkit.configuration(wkhtmltopdf="C:\\Program Files
(x86)\\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
driver = selenium.webdriver.Chrome()
driver.get('https://www.linkedin.com/')
time.sleep(1)
driver.find_element_by_id('login-email').send_keys('username')
driver.find_element_by_id('login-password').send_keys('password')
driver.find_element_by_id('login-submit').click()
time.sleep(2)
driver.save_screenshot('output.png') # only visible part
print(driver.page_source)
pdfkit.from_string(driver.page_source, 'file.pdf')
I want to retrieve data from a website named as myip.ms. I'm using requests to send data to form and then I want the response page back to me. When I run the script it returns the same page (homepage) in response. I want the next page using the query I provide. I'm new in WebScraping. Here's the code I'm using to achieve this.
import requests
from urllib.parse import urlencode, quote_plus
payload={
'name':'educationmaza.com',
'value':'educationmaza.com',
}
payload=urlencode(payload)
r=requests.post("http://myip.ms/s.php",data=payload)
infile=open("E://abc.html",'wb')
infile.write(r.content)
infile.close()
I'm no expert, but it appears that when interacting with the webpage, the post is processed by jQuery, which requests does not do well with.
As such, you would have to use the Selenium module to interact with it.
The following code will execute as desired:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("https://myip.ms/s.php")
driver.find_element_by_id("home_txt").send_keys('educationmaza.com')
driver.find_element_by_id("home_submit").click()
html = driver.page_source
infile=open("stack.html",'w')
infile.write(html)
infile.close()
You will have to install the Selenium package, as well as Phantom.JS.
I have tested this code, and it works fine. Let me know if you need any further help!
I would like to download (using Python 3.4) all (.zip) files on the Google Patent Bulk Download Page http://www.google.com/googlebooks/uspto-patents-grants-text.html
(I am aware that this amounts to a large amount of data.) I would like to save all files for one year in directories [year], so 1976 for all the (weekly) files in 1976. I would like to save them to the directory that my Python script is in.
I've tried using the urllib.request package, but I could get far enoughto get to the http text, not how to "click" on the file to download it.
import urllib.request
url = 'http://www.google.com/googlebooks/uspto-patents-grants-text.html'
savename = 'google_patent_urltext'
urllib.request.urlretrieve(url, savename )
Thank you very much for help.
As I understand you seek for a command that will simulate leftclicking on file and automatically download it. If so, you can use Selenium.
something like:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
profile = FirefoxProfile ()
profile.set_preference("browser.download.folderList",2)
profile.set_preference("browser.download.manager.showWhenStarting",False)
profile.set_preference("browser.download.dir", 'D:\\') #choose folder to download to
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",'application/octet-stream')
driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.google.com/googlebooks/uspto-patents-grants-text.html#2015')
filename = driver.find_element_by_xpath('//a[contains(text(),"ipg150106.zip")]') #use loop to list all zip files
filename.click()
UPDATED! 'application/octet-stream' zip-mime type should be used instead of "application/zip". Now it should work:)
The html you are downloading is the page of links. You need to parse the html to find all the download links. You could use a library like beautiful soup to do this.
However, the page is very regularly structured so you could use a regular expression to get all the download links:
import re
html = urllib.request.urlopen(url).read()
links = re.findall('<a href="(.*)">', html)
I am trying to download a pdf from a webpage using urllib. I used the source link that downloads the file in the browser but that same link fails to download the file in Python. Instead what downloads is a redirect to the main page.
import os
import urllib
os.chdir(r'/Users/file')
url = "http://www.australianturfclub.com.au/races/SectionalsMeeting.aspx?meetingId=2414"
urllib.urlretrieve (url, "downloaded_file")
Please try downloading the file manually from the link provided or from the redirected site, the link on the main page is called 'sectionals'.
Your help is much appreciated.
It is because the given link redirects you to a "raw" pdf file. Examining the response headers via Firebug, I am able to get the filename sectionals/2014/2607RAND.pdf (see screenshot below) and as it is relative to the current .aspx file, the required URI should be switched to (in your case by changing the url variable to this link) http://www.australianturfclub.com.au/races/sectionals/2014/2607RAND.pdf
In python3:
import urllib.request
import shutil
local_filename, headers = urllib.request.urlretrieve('http://www.australianturfclub.com.au/races/SectionalsMeeting.aspx?meetingId=2414')
shutil.move(local_filename, 'ret.pdf')
The shutil is there because python save to a temp folder (im my case, that's another partition so os.rename will give me an error).