Struggling to grab data from website using python

Struggling to grab data from website using python - python

I'm trying to grab snowfall data from the National Weather Service at this site:
https://www.nohrsc.noaa.gov/snowfall/
The data can be downloaded via the webpage with a 'click' on the file type in the drop down, but I can't seem to figure out how to automate this using python. They have an ftp archive, but it requires a login and I can't access it for some reason.
However, since the files can be downloaded via a "click" on the webpage interface, I imagine there must be a way to grab it using wget or urlopen? But I can't seem to figure out what the exact url address would be in this case in order to use those functions. Does anyone have any ideas on how to download this data straight from the website listed above?
Thanks!

You can inspect links with Chrome Console.
Press F12, then click on file type:
Here an URL https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc
You can download it with python using Requests library
import requests
r = requests.get('https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc')
data = r.content # file context
Or you can just save it to file with urlretrieve
from urllib.request import urlretrieve
url = 'https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc'
dst = 'data.nc'
urlretrieve(url, dst)

Related

Getting blob links from headers using requests

I'm scraping a website that streams free movies using requests & BeautifulSoup, and I was able to get the streaming page. But I need to get the video source so I can stream/download the video and I'm stuck here.
The video source is "src = blob:https://example.com/blabla....etc " , and it's not the original source.
After googling blob sources, I found out that the original video source will be in the header itself:
(you need to go to Network > find the stream.m3u8 > copy the header
How can we do this with Python code? Getting that link?

For accessing the M3U8 file, you can parse the response using regex.
#enter code here
import requests
import re
url = 'https://www.someexamplewebsite.com"
response = requests.get(url)
video_link = re.search("https.*m3u8", response.text) #http/https
print(video_link) #m3u8 link
Please read the website policy before proceeding to use the m3u8 url. If its free to use then only use the m3u8 file otherwise it can be a breach of policy.
More on M3U8 ->
The M3U8 files are simple text files which contains the information of audio/video located on the internet.
See: > https://www.lifewire.com/m3u8-file-2621956
For using M3U8 file, you can use HLS.js library developed by Apple.
See: > https://github.com/video-dev/hls.js

HTML to PDF with correct formatting from Python

I have been attempting to use Weasyprint and PDFKIT to transform a webpage into a pdf. I have successfully saved a PDF with a portion of the page.
in weasyprint i cannot work out how to grab the correct CSS style from the page. using PDFKIT i seem to be retrieving the mobile version of the site rather than the full page. i'm using python 3.6.
from urllib.request import Request, urlopen
import webbrowser
import pdfkit
import weasyprint
#pdfkit.from_url('http://google.com', 'out.pdf')
print("started script")
website = 'https://www.bbcgoodfood.com/recipes/3228/chilli-con-carne'
filename = 'savedPDF.pdf'
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
print(urlopen(req).getcode())
temp = urlopen(req).getcode()
if temp == 200:
pdfkit.from_url(website, 'out.pdf')
weasyprint.HTML(website).write_pdf('/Users/me/Documents/weasyprint.pdf')
weasyprint.HTML(website).write_pdf(filename,stylesheets=[weasyprint.CSS('https://www.bbcgoodfood.com/sites/default/files/advagg_css/css__pDgD1vQBFL4LZ6AO_Uw8wEc3MBEaHOzbhMtPie685P8__Kxa0k0VBbKvV5-TOMN_kW3S7CrkFMM4Zf0LjDvzMFnk__mXPuNFBZ0nocZLk5Qifty02tMfg-gomArSBCcGw1mLo.css')])
I cant see an option in pdfkit to specify what to connect with.
Furthermore the two PDF's created from weasyprint are identical.

After quite a while of messing around with the above mentioned packages I was still struggling to achieve a correct looking output.
I have settled with using webkit2png, this works almost perfectly, the only downside is that I get a cookie popup message appearing in some of the saved files.

Python, Downloading files from url link csv with login

Currently I have a csv file with a large number of links (900+) to download files from. What I wish to do is download all the files from this csv file however in order to download the files I need to log into the website which is done by navigating to a specific page on the website of interest and logging in from there.
I can set up a login session via selenium and use repeated driver.get commands to initiate the downloads but this has a tendency to not work in my experience.
Wget is an option to retrieve the files via iterating over the links in the file but it doesn't get around the issue that the website requires a login to work.
So in short my question what is the most efficient implementation for iterating over a series of download links located in a csv file, downloading all files from said links and enabling a login session to be able to download these files?
EDIT: Currently testing with requests
import requests
s = requests.Session()
print(s.cookies.get_dict())
s.get("URL of Landing page to generate cookies")
print(s.cookies.get_dict())
s.get("Login page URL")

Use the urllib.request module and its HTTPBasicAuthHandler() class. So you could:
import urllib.request as ur
mgr = ur.HTTPPasswordMgrWithDefaultRealm()
mgr.add_password(None, 'url', 'username', 'password') # where url is each url
auth = ur.HTTPBasicAuthHandler(mgr)
opener = ur.build_opener(auth)
rsp = opener.open('url/at/some/path').read()
However you want to iterate through the CSV to build a list of URLs and make queries is up to you.

Download files using Python 3.4 from Google Patents

I would like to download (using Python 3.4) all (.zip) files on the Google Patent Bulk Download Page http://www.google.com/googlebooks/uspto-patents-grants-text.html
(I am aware that this amounts to a large amount of data.) I would like to save all files for one year in directories [year], so 1976 for all the (weekly) files in 1976. I would like to save them to the directory that my Python script is in.
I've tried using the urllib.request package, but I could get far enoughto get to the http text, not how to "click" on the file to download it.
import urllib.request
url = 'http://www.google.com/googlebooks/uspto-patents-grants-text.html'
savename = 'google_patent_urltext'
urllib.request.urlretrieve(url, savename )
Thank you very much for help.

As I understand you seek for a command that will simulate leftclicking on file and automatically download it. If so, you can use Selenium.
something like:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
profile = FirefoxProfile ()
profile.set_preference("browser.download.folderList",2)
profile.set_preference("browser.download.manager.showWhenStarting",False)
profile.set_preference("browser.download.dir", 'D:\\') #choose folder to download to
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",'application/octet-stream')
driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.google.com/googlebooks/uspto-patents-grants-text.html#2015')
filename = driver.find_element_by_xpath('//a[contains(text(),"ipg150106.zip")]') #use loop to list all zip files
filename.click()
UPDATED! 'application/octet-stream' zip-mime type should be used instead of "application/zip". Now it should work:)

The html you are downloading is the page of links. You need to parse the html to find all the download links. You could use a library like beautiful soup to do this.
However, the page is very regularly structured so you could use a regular expression to get all the download links:
import re
html = urllib.request.urlopen(url).read()
links = re.findall('<a href="(.*)">', html)

Downloading a pdf from link but server redirects to homepage

I am trying to download a pdf from a webpage using urllib. I used the source link that downloads the file in the browser but that same link fails to download the file in Python. Instead what downloads is a redirect to the main page.
import os
import urllib
os.chdir(r'/Users/file')
url = "http://www.australianturfclub.com.au/races/SectionalsMeeting.aspx?meetingId=2414"
urllib.urlretrieve (url, "downloaded_file")
Please try downloading the file manually from the link provided or from the redirected site, the link on the main page is called 'sectionals'.
Your help is much appreciated.

It is because the given link redirects you to a "raw" pdf file. Examining the response headers via Firebug, I am able to get the filename sectionals/2014/2607RAND.pdf (see screenshot below) and as it is relative to the current .aspx file, the required URI should be switched to (in your case by changing the url variable to this link) http://www.australianturfclub.com.au/races/sectionals/2014/2607RAND.pdf

In python3:
import urllib.request
import shutil
local_filename, headers = urllib.request.urlretrieve('http://www.australianturfclub.com.au/races/SectionalsMeeting.aspx?meetingId=2414')
shutil.move(local_filename, 'ret.pdf')
The shutil is there because python save to a temp folder (im my case, that's another partition so os.rename will give me an error).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Struggling to grab data from website using python - python

Related

Getting blob links from headers using requests

HTML to PDF with correct formatting from Python

Python, Downloading files from url link csv with login

Download files using Python 3.4 from Google Patents

Downloading a pdf from link but server redirects to homepage

Categories

Resources