I'm scraping a website that streams free movies using requests & BeautifulSoup, and I was able to get the streaming page. But I need to get the video source so I can stream/download the video and I'm stuck here.
The video source is "src = blob:https://example.com/blabla....etc " , and it's not the original source.
After googling blob sources, I found out that the original video source will be in the header itself:
(you need to go to Network > find the stream.m3u8 > copy the header
How can we do this with Python code? Getting that link?
For accessing the M3U8 file, you can parse the response using regex.
#enter code here
import requests
import re
url = 'https://www.someexamplewebsite.com"
response = requests.get(url)
video_link = re.search("https.*m3u8", response.text) #http/https
print(video_link) #m3u8 link
Please read the website policy before proceeding to use the m3u8 url. If its free to use then only use the m3u8 file otherwise it can be a breach of policy.
More on M3U8 ->
The M3U8 files are simple text files which contains the information of audio/video located on the internet.
See: > https://www.lifewire.com/m3u8-file-2621956
For using M3U8 file, you can use HLS.js library developed by Apple.
See: > https://github.com/video-dev/hls.js
Related
I'm trying to grab snowfall data from the National Weather Service at this site:
https://www.nohrsc.noaa.gov/snowfall/
The data can be downloaded via the webpage with a 'click' on the file type in the drop down, but I can't seem to figure out how to automate this using python. They have an ftp archive, but it requires a login and I can't access it for some reason.
However, since the files can be downloaded via a "click" on the webpage interface, I imagine there must be a way to grab it using wget or urlopen? But I can't seem to figure out what the exact url address would be in this case in order to use those functions. Does anyone have any ideas on how to download this data straight from the website listed above?
Thanks!
You can inspect links with Chrome Console.
Press F12, then click on file type:
Here an URL https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc
You can download it with python using Requests library
import requests
r = requests.get('https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc')
data = r.content # file context
Or you can just save it to file with urlretrieve
from urllib.request import urlretrieve
url = 'https://www.nohrsc.noaa.gov/snowfall/data/202112/sfav2_CONUS_6h_2021122618_grid184.nc'
dst = 'data.nc'
urlretrieve(url, dst)
so i try to add a file size info next to hyperlinks of my files. Files are on a server and on the website i have multiple hyperlinks with files to download.
I've found a python solution, but as i understand, it gets a size from a specified url
import requests
#importing the requests module
url = "https://speed.hetzner.de/100MB.bin"
#just a dummy file URL
info = requests.head(url)
#fetching the header information
print(info.headers)
#printing the details
How to change this, so automatically all hyperlinks get file size?
Currently I have a csv file with a large number of links (900+) to download files from. What I wish to do is download all the files from this csv file however in order to download the files I need to log into the website which is done by navigating to a specific page on the website of interest and logging in from there.
I can set up a login session via selenium and use repeated driver.get commands to initiate the downloads but this has a tendency to not work in my experience.
Wget is an option to retrieve the files via iterating over the links in the file but it doesn't get around the issue that the website requires a login to work.
So in short my question what is the most efficient implementation for iterating over a series of download links located in a csv file, downloading all files from said links and enabling a login session to be able to download these files?
EDIT: Currently testing with requests
import requests
s = requests.Session()
print(s.cookies.get_dict())
s.get("URL of Landing page to generate cookies")
print(s.cookies.get_dict())
s.get("Login page URL")
Use the urllib.request module and its HTTPBasicAuthHandler() class. So you could:
import urllib.request as ur
mgr = ur.HTTPPasswordMgrWithDefaultRealm()
mgr.add_password(None, 'url', 'username', 'password') # where url is each url
auth = ur.HTTPBasicAuthHandler(mgr)
opener = ur.build_opener(auth)
rsp = opener.open('url/at/some/path').read()
However you want to iterate through the CSV to build a list of URLs and make queries is up to you.
I am trying to download a pdf from a webpage using urllib. I used the source link that downloads the file in the browser but that same link fails to download the file in Python. Instead what downloads is a redirect to the main page.
import os
import urllib
os.chdir(r'/Users/file')
url = "http://www.australianturfclub.com.au/races/SectionalsMeeting.aspx?meetingId=2414"
urllib.urlretrieve (url, "downloaded_file")
Please try downloading the file manually from the link provided or from the redirected site, the link on the main page is called 'sectionals'.
Your help is much appreciated.
It is because the given link redirects you to a "raw" pdf file. Examining the response headers via Firebug, I am able to get the filename sectionals/2014/2607RAND.pdf (see screenshot below) and as it is relative to the current .aspx file, the required URI should be switched to (in your case by changing the url variable to this link) http://www.australianturfclub.com.au/races/sectionals/2014/2607RAND.pdf
In python3:
import urllib.request
import shutil
local_filename, headers = urllib.request.urlretrieve('http://www.australianturfclub.com.au/races/SectionalsMeeting.aspx?meetingId=2414')
shutil.move(local_filename, 'ret.pdf')
The shutil is there because python save to a temp folder (im my case, that's another partition so os.rename will give me an error).
First post here, any help would be greatly appreciated :).
I'm trying to scrape from a website with an embedded pdf viewer. As far as I can tell, there is no way to directly download the PDF file.
The browser displays the pdf as multiple PNG image files, the problem is that the png files aren't directly accessible either. They are rendered from the original pdf and then displayed.
And the URL with the heading stripped out is in the codeblock.
The original URL to the pdf viewer (I'm using the second URL), and the link to render the pdf are included in the code.
My strategy here is to pull the viewstate and eventvalidation using urllib, then use wget to download all files from the site. This method does work without post data (page 1). I am getting the rest of the parameters from fiddler (sniffing tool)
But when I use post data to specify the page, I get 405 errors like these when trying to download the image files. However, it downloads the actual html page without a problem, just none of the png files that go along with it. Here is an example of the wget errors.
HTTP request sent, awaiting response... 405 Method Not Allowed
2014-03-27 17:09:38 ERROR 405: Method Not Allowed.
Since I can't access the image file link directly, I thought grabbing the entire page with wget would be my best bet. If anyone knows some better alternatives, please let me know :). The post data seems to work at least partially since the downloaded html file is set to the page I specified in parameters.
According to fiddler, the site automatically does a get request for the image file. I'm not quite sure how to emulate this however.
Any help is appreciated, thanks for your time!
imglink = 'http://201.150.36.178/consultaexpedientes/render/2132495e-863c-4b96-8135-ea7357ff41511.png'
origurl = 'http://201.150.36.178/consultaexpedientes/sistemas/boletines/wfBoletinVisor.aspx?tomo=1&numero=9760&fecha=14/03/2014%2012:40:00'
url = 'http://201.150.36.178/consultaexpedientes/usercontrol/Default.aspx?name=e%3a%5cBoletinesPdf%5c2014%5c3%5cBol_9760.pdf%7c0'
f = urllib2.urlopen(url)
html = f.read()
soup = BeautifulSoup(html)
eventargs = soup.findAll(attrs={'type':'hidden'})
reValue = re.compile(r'value=\"(.*)\"', re.DOTALL)
viewstate = re.findall(reValue, str(eventargs[0]))[0]
validation = re.findall(reValue, str(eventargs[1]))[0]
params = urllib.urlencode({'__VIEWSTATE':viewstate,
'__EVENTVALIDATION':validation,
'PDFViewer1$PageNumberTextBox':6,
'PDFViewer1_BookmarkPanelScrollX':0,
'PDFViewer1_BookmarkPanelScrollY':0,
'PDFViewer1_ImagePanelScrollX' : 0,
'PDFViewer1_ImagePanelScrollY' : 0,
'PDFViewer1$HiddenPageNumber':6,
'PDFViewer1$HiddenAplicaMarcaAgua':0,
'PDFViewer1$HiddenBrowserWidth':1920,
'PDFViewer1$HiddenBrowserHeight':670,
'PDFViewer1$HiddenPageNav':''})
command = '/usr/bin/wget -E -H -k -K -p --post-data=\"%s' % params + '\" ' + url
print command
os.system(command)