I would like to make a script (in any language, but preferably python or perl) download a specific type of file being streamed by a web page. However i do not know this files location so i will have to find it out by finding all the files being streamed by the page, and selecting the one i want based on file type.
a similar example would be to say i want to download a video off youtube, however there is no pattern or way to find the URL except finding the files being streamed to my computer.
The part i cannot figure out is how to find all the files being streamed by the page. The rest i can do myself. The file name is not mentioned anywhere in the source of the html page.
Example of the problem...
This works fine:
import urllib
urllib.urlretrieve ("http://example.com/anything.mp3", "a.mp3")
However this does not:
import urllib
urllib.urlretrieve ("http://example.com/page-where-the-mp3-file-is-being-streamed.html", "a.mp3")
If someone can help me figure out how to download all the files from a page or find the files being streamed i would really appreciate it. All i need is to know which language/library/method can accomplish this.Thanks
Related
Thank you all wonderful out there for reading this post and their help
For below URL, I have been trying to understand how to go about getting excel files which are downloaded after clicking on "Download Data" hyperlink. On inspecting this element, i get something like this "::before". Not sure what this is.
https://www.moneycontrol.com/mutual-funds/find-fund/returns?&amc=AXMF&EXCLUDE_FIXED_MATURITY_PLANS=Y
I have downloaded files, in somewhat similar cases in the past, where such buttons contain URL directing to the file. I had to then make use of request library to get a bytes response which downloads the file in my local.
However, in this case, i am not able to find the URL to send response to.
Cheers,
Aakash
So the issue I am having isn't that there is a link of a PDF on the web I am trying to scrape and download onto my PC (It doesn't end in .pdf). I have a download link that I want to activate, which would then lead me to download a PDF onto my computer. It looks like this:
https://***.com/files/4122109/download?download_frd=1&verifier=xxx
When I click the link, it verifies I am the user that I am, and then lets me download the file with the ID contained in the above query. The content-type for this file is "application/pdf" so I know it downloads a PDF file for me. I just need a library that "clicks" or "activates" the download for me.
Also, I am trying to do this for all the URLs I am pulling from a course on Canvas in a GET request. I am not trying to use Selenium here because I am getting these URLs from an API. Any advice in this approach would be highly appreciated.
If I go to this website:
https://covid.cdc.gov/covid-data-tracker/#ed-visits
and click the "download" button (on the right), a .csv file is downloaded.
I can't find the address of that csv file, so that I could fetch it automatically with pd.read_csv(). I had a snoop around the web inspector thing, but I don't really know what I'm doing, and nothing jumped out at me as being the obvious answer. I've also looked around various other sites to try to find an API that gives me access to this data, bat there doesn't appear to be such thing.
Can anyone help me with that?
Thanks so much!
You might want to open your web inspector and go to the "Network"-Tab and then reload the page. You are going to see, that there's never a csv actually being loaded.
Also the export button doesn't link to any file. Rather it has some javascript binding, that exports the existing data in your client (the browser) as a csv to your filesystem. In other words: There isn't an address for that file. Its being created in your browser.
So even better, you can read the json directly. Just find the correct data in the Network-Tab, I think it might be this: https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=ed_trend_data
So instead you could directly read the json:
pd.read_json('https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=ed_trend_data') and then filter for the data that you need.
How can we save the webpage including the content in it, so that it is viewable offline, using urllib in python language? Currently I am using the following code:
import urllib.request
driver.webdriver.Chrome()
driver.get("http://www.yahoo.com")
urllib.request.urlretrieve("http://www.yahoo.com", C:\\Users\\karanjuneja\\Downloads\\kj\\yahoo.mhtml")
This works and strores an mhtml version of the webpage in the folder, but when you open the file, you will only find the codes written and not the page how it appears online. Do we need to make changes to the code?
Also, is there an alternate way of saving the webpage in MHTML format with all the content as it appears online, and not just the source.Any suggestions?
Thanks Karan
I guess this site might help you~
Create an MHTML archive
How can I download all the pdf (or specific extension files like .tif or .pdf) from a webpage that requires login. I dont want to log in everytime for every pdf so I cant use link generation and pushing to browser scheme
The solution was simple: just posting it for others may have the same question
mydriver.get("https://username:password#www.somewebsite.com/somelink")