In Python, urllib.urlretrieve downloads a file which says "Go away"

In Python, urllib.urlretrieve downloads a file which says "Go away" - python

I'm trying to download the (APK) files from links such as https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041. When you enter the link in your browser, it brings up a dialog to open or save the file (see below).
I would like to save the file using a Python script. I've tried the following:
import urllib
download_link = 'https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041'
download_file = '/tmp/apkmirror_test/youtube.apk'
if __name__ == "__main__":
urllib.urlretrieve(url=download_link, filename=download_file)
but the resulting youtube.apk contains only the words "Go away".
Since I am able to download the file by pasting the link in my browser's address bar, there must be some difference between that and urllib.urlretrieve that makes this not work. Can someone explain this difference and how to eliminate it?

You should not programmatically access that download page as it is disallowed in the robots.txt:
https://www.apkmirror.com/robots.txt
That being said, your request header is different. Python by default sets User-Agent to something like "Python...". That is the most likely cause of detection.

Related

Python 'corrupts' downloaded pdf file

I am writing a program to automatically download a Wikipedia page as pdf, using their embedded tool.
I was able to fix a problem where I wasn't able to retrieve the data from a submit button. The new problem is, that I can download the file (I used open() and also urllib.request.urlretrieve(), but I'm not able to open it manually then.
It seems like the file is being corrupted while being downloaded (I think it's an encoding failure). This disables opening the PDF.. (unsupported datatype or corrupted file=
This is the code I use:
import requests
import urllib.request
base_url = 'https://de.wikipedia.org/wiki/'
def createURL(base):
title = 'Rektifikation (Verfahrenstechnik)'
name = title.replace(" ", "_")
url = (base + name).replace('(', '%28').replace(')', '%29')
print(url)
getPDF(url, title)
def getPDF(url, title):
r = requests.get(url, allow_redirects=True)
open('{}.pdf'.format(title), 'wb').write(r.content)
urllib.request.urlretrieve('https://de.wikipedia.org/w/index.php?title=Spezial:ElectronPdf&page=Rektifikation+%28Verfahrenstechnik%29&action=show-download-screen)', '{}_vl.pdf'.format(title))
createURL(base_url)
I hardcoded most of the stuff for debugging reasons. Feel free to help me improve the code, but note that this isn't my main purpose, please.
My question now is: What could I possibly do to stop the file from getting corrupted (encode it the right way)
This is the link (click the button) I'm trying to download from.
Note: It's an instant download link with a redirect.
Thanks for your help, if you need any more information just ask me.
EDIT: Opening the PDF via Word lets me see, that the data (texts, paragraphs..) is available. So the PDF contains the data and the download itself seems to be successful.
The sizes of the downloaded files differ, maybe someone can have a look at this problem too:
open: 58KB
urllib: 18KB
manually: 239KB

urllib2 download HTML file

Using urllib2 in Python 2.7.4, I can readily download an Excel file:
output_file = 'excel.xls'
url = 'http://www.nbmg.unr.edu/geothermal/GEOTHERM-30Jun11.xls'
file(output_file, 'wb').write(urllib2.urlopen(url).read())
This results in the expected file that I can use as I wish.
However, trying to download just an HTML file gives me an empty file:
output_file = 'webpage.html'
url = 'http://www.nbmg.unr.edu/geothermal/mapfiles/nvgeowel.html'
file(output_file, 'wb').write(urllib2.urlopen(url).read())
I had the same results using urllib. There must be something simple I'm missing or don't understand. How do I download an HTML file from a URL? Why doesn't my code work?

If you want to download files or simply save a webpage you can use urlretrieve(from urllib library)instead of use read and write.
import urllib
urllib.urlretrieve("http://www.nbmg.unr.edu/geothermal/mapfiles/nvgeowel.html","doc.html")
#urllib.urlretrieve("url","save as..")
If you need to set a timeout you have to put it at the start of your file:
import socket
socket.setdefaulttimeout(25)
#seconds

It also Python 2.7.4 in my OS X 10.9, and the codes work well on it.
So I think there maybe other problems prevent its working. Can you open "http://www.nbmg.unr.edu/geothermal/GEOTHERM-30Jun11.xls" in your browser?

This may not directly answer the question, but if you're working with HTTP and have sufficient privileges to install python packages, I'd really recommend doing this with 'requests'. There's a related answered here - https://stackoverflow.com/a/13137873/45698

How do you save a Google Sheets file as CSV from Python 3 (or 2)?

I am looking for a simple way to save a csv file originating from a published Google Sheets document? Since it's published, it's accessible through a direct link (modified on purpose in the example below).
All my browsers will prompt me to save the csv file as soon as I launch the link.
Neither:
DOC_URL = 'https://docs.google.com/spreadsheet/ccc?key=0AoOWveO-dNo5dFNrWThhYmdYW9UT1lQQkE&output=csv'
f = urllib.request.urlopen(DOC_URL)
cont = f.read(SIZE)
f.close()
cont = str(cont, 'utf-8')
print(cont)
, nor:
req = urllib.request.Request(DOC_URL)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1284.0 Safari/537.13')
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
print anything but html content.
(Tried the 2nd version after reading this other post: Download google docs public spreadsheet to csv with python .)
Any idea on what I am doing wrong? I am logged out of my Google account, if that worths to anything, but this works from any browser that I tried. As far as I understood, the Google Docs API is not yet ported on Python 3 and given the "toy" magnitude of my little project for personal use, it would not even make too much sense to use it from the get-go, if I can circumvent it.
In the 2nd attempt, I left the 'User-Agent', as I was thinking that maybe requests thought as coming from scripts (b/c no identification info is present) might be ignored, but it didn't make a difference.

While the requests library is the gold standard for HTTP requests from Python, this style of download is (while not deprecated yet) not likely to last, specifically referring to the use of links, managing cookies & redirects, etc. One of the reasons for not preferring links is that it's less secure and generally such access should require authorization. Instead, the currently accepted way of exporting Google Sheets as CSV is by using the Google Drive API.
So why the Drive API? Isn't this supposed to be something for the Sheets API instead? Well, the Sheets API is for spreadsheet-oriented functionality, i.e., data formatting, column resize, creating charts, cell validation, etc., while the Drive API is for file-oriented functionality, i.e., import/export, copy, rename, etc.
Below is a complete cmd-line solution. (If you don't do Python, you can use it as pseudocode and pick any language supported by the Google APIs Client Libraries.) For the code snippet, assume the most current Sheet named inventory (older files with that name are ignored) and DRIVE is the API service endpoint:
FILENAME = 'inventory'
SRC_MIMETYPE = 'application/vnd.google-apps.spreadsheet'
DST_MIMETYPE = 'text/csv'
# query for latest file named FILENAME
files = DRIVE.files().list(
q='name="%s" and mimeType="%s"' % (FILENAME, SRC_MIMETYPE),
orderBy='modifiedTime desc,name').execute().get('files', [])
# if found, export Sheets file as CSV
if files:
fn = '%s.csv' % os.path.splitext(files[0]['name'].replace(' ', '_'))[0]
print('Exporting "%s" as "%s"... ' % (files[0]['name'], fn), end='')
data = DRIVE.files().export(fileId=files[0]['id'], mimeType=DST_MIMETYPE).execute()
# if non-empty file
if data:
with open(fn, 'wb') as f:
f.write(data)
print('DONE')
If your Sheet is large, you may have to export it in chunks -- see this page on how to do that. If you're generally new to Google APIs, I have a (somewhat dated but) user-friendly intro video for you. (There are 2 videos after that which maybe useful too.)

Google responds to the initial request with a series of cookie-setting 302 redirects. If you don't store and resubmit the cookies between requests, it redirects you to the login page.
So, the problem is not with the User-Agent header, it's the fact that by default, urllib.request.urlopen doesn't store cookies, but it will follow the HTTP 302 redirects.
The following code works just fine on a public spreadsheet available at the location specified by DOC_URL:
>>> from http.cookiejar import CookieJar
>>> from urllib.request import build_opener, HTTPCookieProcessor
>>> opener = build_opener(HTTPCookieProcessor(CookieJar()))
>>> resp = opener.open(DOC_URL)
>>> # should really parse resp.getheader('content-type') for encoding.
>>> csv_content = resp.read().decode('utf-8')
Having shown you how to do it in vanilla python, I'll now say that the Right Way™ to go about this is to use the most-excellent requests library. It is extremely well documented and makes these sorts of tasks incredibly pleasant to complete.
For instance, to get the same csv_content as above using the requests library is as simple as:
>>> import requests
>>> csv_content = requests.get(DOC_URL).text
That single line expresses your intent more clearly. It's easier to write and easier to read. Do yourself - and anyone else who shares your codebase - a favor and just use requests.

Python and Plone help

Im using the plone cms and am having trouble with a python script. I get a name error "the global name 'open' is not defined". When i put the code in a seperate python script it works fine and the information is being passed to the python script becuase i can print the query. Code is below:
#Import a standard function, and get the HTML request and response objects.
from Products.PythonScripts.standard import html_quote
request = container.REQUEST
RESPONSE = request.RESPONSE
# Insert data that was passed from the form
query=request.query
#print query
f = open("blast_query.txt","w")
for i in query:
f.write(i)
return printed
I also have a second question, can i tell python to open a file in in a certain directory for example, If the script is in a certain loaction i.e. home folder, but i want the script to open a file at home/some_directory/some_directory can it be done?

Python Scripts in Plone are restricted and have no access to the filesystem. The open call is thus not available. You'll have to use an External Method or full python module to have full access to the filesystem.

Python urllib2 file upload problems

I'm currently trying to initiate a file upload with urllib2 and the urllib2_file library. Here's my code:
import sys
import urllib2_file
import urllib2
URL='http://aquate.us/upload.php'
d = [('uploaded', open(sys.argv[1:]))]
req = urllib2.Request(URL, d)
u = urllib2.urlopen(req)
print u.read()
I've placed this .py file in my My Documents directory and placed a shortcut to it in my Send To folder (the shortcut URL is ).
When I right click a file, choose Send To, and select Aquate (my python), it opens a command prompt for a split second and then closes it. Nothing gets uploaded.
I knew there was probably an error going on so I typed the code into CL python, line by line.
When I ran the u=urllib2.urlopen(req) line, I didn't get an error;
alt text http://www.aquate.us/u/55245858877937182052.jpg
instead, the cursor simply started blinking on a new line beneath that line. I waited a couple of minutes to see if something would happen but it just stayed like that. To get it to stop, I had to press ctrl+break.
What's up with this script?
Thanks in advance!
[Edit]
Forgot to mention -- when I ran the script without the request data (the file) it ran like a charm. Is it a problem with urllib2_file?
[edit 2]:
import MultipartPostHandler, urllib2, cookielib,sys
import win32clipboard as w
cookies = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies),MultipartPostHandler.MultipartPostHandler)
params = {"uploaded" : open("c:/cfoot.js") }
a=opener.open("http://www.aquate.us/upload.php", params)
text = a.read()
w.OpenClipboard()
w.EmptyClipboard()
w.SetClipboardText(text)
w.CloseClipboard()
That code works like a charm if you run it through the command line.

If you're using Python 2.5 or newer, urllib2_file is both unnecessary and unsupported, so check which version you're using (and perhaps upgrade).
If you're using Python 2.3 or 2.4 (the only versions supported by urllib2_file), try running the sample code and see if you have the same problem. If so, there is likely something wrong either with your Python or urllib2_file installation.
EDIT:
Also, you don't seem to be using either of urllib2_file's two supported formats for POST data. Try using one of the following two lines instead:
d = ['uploaded', open(sys.argv[1:])]
## --OR-- ##
d = {'uploaded': open(sys.argv[1:])}

First, there's a third way to run Python programs.
From cmd.exe, type python myprogram.py. You get a nice log. You don't have to type stuff one line at a time.
Second, check the urrlib2 documentation. You'll need to look at urllib, also.
A Request requires a URL and a urlencoded encoded buffer of data.
data should be a buffer in the
standard
application/x-www-form-urlencoded
format. The urllib.urlencode()
function takes a mapping or sequence
of 2-tuples and returns a string in
this format.
You need to encode your data.

If you're still on Python2.5, what worked for me was to download the code here:
http://peerit.blogspot.com/2007/07/multipartposthandler-doesnt-work-for.html
and save it as MultipartPostHandler.py
then use:
import urllib2, MultipartPostHandler
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})
or if you need cookies:
import urllib2, MultipartPostHandler, cookielib
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

In Python, urllib.urlretrieve downloads a file which says "Go away" - python

You should not programmatically access that download page as it is disallowed in the robots.txt: https://www.apkmirror.com/robots.txt That being said, your request header is different. Python by default sets User-Agent to something like "Python...". That is the most likely cause of detection.

Related

Python 'corrupts' downloaded pdf file

urllib2 download HTML file

How do you save a Google Sheets file as CSV from Python 3 (or 2)?

Python and Plone help

Python urllib2 file upload problems

Categories

Resources