Python getting files from sourceforge - python

Hello I am making a program, and I need it to download an exe from sourceforge. I have the download link which leads it to the "wait 5 seconds thing". How can I download the file from it, and save it to the cwd?

The HTML file you get contains a refresh link:
<meta http-equiv="refresh" content="5; url=https://downloads.sourceforge.net/project/...">
You can search the HTML document for that element, extract the url, then download that.
However, remember to respect the robots.txt file. I.e. have a delay of at least one second between requests and do not try to download disallowed paths.
Edit: Actually, the downloads subdomain has its own robots.txt that prohibits all automated downloads, so you should not do this. You could e.g. open a link in the user's web browser instead.

Related

Not able to scrape the response from clicking on a Button using Python Scrapy

Thank you all wonderful out there for reading this post and their help
For below URL, I have been trying to understand how to go about getting excel files which are downloaded after clicking on "Download Data" hyperlink. On inspecting this element, i get something like this "::before". Not sure what this is.
https://www.moneycontrol.com/mutual-funds/find-fund/returns?&amc=AXMF&EXCLUDE_FIXED_MATURITY_PLANS=Y
I have downloaded files, in somewhat similar cases in the past, where such buttons contain URL directing to the file. I had to then make use of request library to get a bytes response which downloads the file in my local.
However, in this case, i am not able to find the URL to send response to.
Cheers,
Aakash

List all media and document files loaded with webpage with requests python

I'm looking for a way to list all loaded files with the requests module.
Like there is in chrome's Inspector Network tab, you can see all kinds of files that have been loaded by the webpage.
The problem is the file(in this case .pdf file) I want to fetch does not have a specific tab, and the webpage loads it by javascript and AJAX I guess, because even after the page loaded completely, I couldn't find a tag that has a link to the .pdf file or something like that, so every time I should goto Networks tab and reload the page and find the file in the loaded resources list.
Is there any way to catch all the loaded files and list them using the Requests module?
When a browser loads an HTML file it then interprets the contents of that file. It may discover that there is a tag referencing an external JavaScript URL. The browser will then issue a GET request to retrieve that file. When said file is received, it hen interprets the JavaScript file by executing the code within. That code might contain AJAX code that in turn fetches more files. Or the HTML file may reference an extern CSS file with a tag or image file with an tag. These files will also be loaded by the browser and can be seen when you run the browser's inspector.
In contrast, when you do a get request with the requests module for a particular URL, only that one page is fetched. There is no logic to interpret the contents of the returned page and fetch those images, style sheets, JavaScript files, etc. that are referenced within the page.
You can, however, use Python to automate a browser using a tool such as Selenium WebDriver, which can be used to fully download a page.

urllib.urlretrieve() cannot download the file

Hi there's a button in the web, if you click it, it'll download a file.
Say the corresponding url is like this
http://www.mydata.com/data/filedownload.aspx?e=MyArgu1&k=kfhk22wykq
If i put this url in the address bar in the browser, it can download the file as well properly.
Now i do this in the python,
urllib.urlretrieve(url, "myData.csv")
The csv file is empty. Any suggestions please ?
This may not be possible with every website. If a link has a token then python is unlikely to be able to use the link as it is tied to your browser.

download file link produces 404 page

I am trying to create a link that allows a user to download a zip file that's been generated earlier in the python script. The script then writes an HTML link to a web page. The user should be able to click the link and download their zip file.
import os,sys
downloadZip = ("http://<server>/folder/structure/here/" + zipFileName + ".zip")
print """<h3><a href="{}" download>Download zip file</a></h3>""".format(downloadZip)
The result is a link that when clicked opens a 404 page. I've noticed that on that page, it displays
Physical Path C:\inetpub\wwwroot\inputted\path\here\file.zip
I am testing this on the same server the processing is occurring on. I wouldn't think that should make a difference, but here I am. The end result should be a zip file downloaded to the user's pc.
Not sure if this would be helpful or not but I have noticed that some 'server package apps' disallow the execution/download of certain filetypes. I had a similar thing happen years ago.
To test if this is the case, create a new folder in your web directory and add an index.html page with some random writing (to identify that you have the correct page). Quickly try to access this page.
Next create a .zip file, put it in the same folder as the index.html file you just created, and add a Download link on the index.html page.
Now revisit the page, and try to download the file you created. If it works then there is a problem elsewhere, if it doesn't then whatever server package application you are using probably set Apache to block .zip files by default. Hope this helps buddy :)

python mechanize blank download or how to do it in casperjs

I am downloading information for a research project from a site that uses ajax to load URLs and does not allow serial downloading. I am dumping the urls from casperjs into a file I read and use browser.retrieve(url,dump_filename) to download the information with mechanize. I mostly get blank file downloads but they are periodically filled with content. Is there a way to modify the headers so that I can always get data. Also, a casperjs download alternative is welcome. I have tried casperjs download() but it saves a blank file as well. I think it has something to do with the headers. File downloads always work in a browser.
I prefer Selenium over Mechanize when it comes to more "sophisticated" web-sites, that use AJAX, JS, etc.
You said downloading works, when you're using your browser. Well Selenium does the same thing - it uses Firefox on your desktop to fulfill its tasks

Categories