How do I turnoff WRITEFUNCTION and WRITEDATA?
Using pycurl I have a class call curlUtil. In it I have pageAsString (self, URL) which returns a string.
To do this I setopt WRITEFUNCTION. Now in downloadFile (self, URL, fn, overwrite=0) I do an open and self.c.Setopt (pycurl.WRITEFUNCTION, 0) which cause problems. Int is not a valid argument.
I then assumed WRITEDATA would overwrite the value or there would be a NOWRITEFUNCTION commend. NOWRITEFUNCTION didn't exist so I just used WRITEDATA and Python crashed.
I wrote a quick func called reboot() which closes curl, opens it again, and calls reset to put it in the default state. I call it in both pageAsString and downloadFile and there is no problem at all. But, I don't want to reinitialize curl. There might be some special options I set.
How do I turnoff WRITEFUNCTION and WRITEDATA?
using the writefunction, instead of turning it off would save you a lot off trouble. you might want to rewrite your pageAsString by utilizing WRITEFUNCTION..
as an example:
from cStringIO import StringIO
c = pycurl.Curl()
buffer = StringIO()
c.setopt(pycurl.WRITEFUNCTION, buffer.write)
c.setopt(pycurl.URL, "http://example.com")
c.perform()
...
buffer.getvalue() # will return the data fetched.
Related
I am retrieving data files from a FTP server in a loop with the following code:
response = urllib.request.urlopen(url)
data = response.read()
response.close()
compressed_file = io.BytesIO(data)
gin = gzip.GzipFile(fileobj=compressed_file)
Retrieving and processing the first few works fine, but after a few request I am getting the following error:
530 Maximum number of connections exceeded.
I tried closing the connection (see code above) and using a sleep() timer, but this both did not work. What is it I am doing wrong here?
Trying to make urllib do FTP properly makes my brain hurt. By default, it creates a new connection for each file, apparently without really properly ensuring the connections close.
ftplib is more appropriate I think.
Since I happen to be working on the same data you are(were)... Here is a very specific answer decompressing the .gz files and passing them into ish_parser (https://github.com/haydenth/ish_parser).
I think it is also clear enough to serve as a general answer.
import ftplib
import io
import gzip
import ish_parser # from: https://github.com/haydenth/ish_parser
ftp_host = "ftp.ncdc.noaa.gov"
parser = ish_parser.ish_parser()
# identifies what data to get
USAF_ID = '722950'
WBAN_ID = '23174'
YEARS = range(1975, 1980)
with ftplib.FTP(host=ftp_host) as ftpconn:
ftpconn.login()
for year in YEARS:
ftp_file = "pub/data/noaa/{YEAR}/{USAF}-{WBAN}-{YEAR}.gz".format(USAF=USAF_ID, WBAN=WBAN_ID, YEAR=year)
print(ftp_file)
# read the whole file and save it to a BytesIO (stream)
response = io.BytesIO()
try:
ftpconn.retrbinary('RETR '+ftp_file, response.write)
except ftplib.error_perm as err:
if str(err).startswith('550 '):
print('ERROR:', err)
else:
raise
# decompress and parse each line
response.seek(0) # jump back to the beginning of the stream
with gzip.open(response, mode='rb') as gzstream:
for line in gzstream:
parser.loads(line.decode('latin-1'))
This does read the whole file into memory, which could probably be avoided using some clever wrappers and/or yield or something... but works fine for a year's worth of hourly weather observations.
Probably a pretty nasty workaround, but this worked for me. I made a script (here called test.py) which does the request (see code above). The code below is used in the loop I mentioned and calls test.py
from subprocess import call
with open('log.txt', 'a') as f:
call(['python', 'test.py', args[0], args[1]], stdout=f)
Given a standard urllib.request object, retrieved so:
req = urllib.urlopen('http://example.com')
If I read its contents via req.read(), afterwards the request object will be empty.
Unlike normal file-like objects, however, the request object does not have a seek method, for I am sure are excellent reasons.
However, in my case I have a function, and I want it to make certain determinations about a request and then return that request "unharmed" so that it can be read again.
I understand that one option is to re-request it. But I'd like to be able to avoid making multiple HTTP requests for the same url & content.
The only other alternative I can think of is to have the function return a tuple of the extracted content and the request object, with the understanding that anything that calls this function will have to get the content in this way.
Is that my only option?
Delegate the caching to a StringIO object(code not tested, just to give the idea):
import urllib
from io import StringIO
class CachedRequest(object):
def __init__(self, url):
self._request = urllib.urlopen(url)
self._content = None
def __getattr__(self, attr):
# if attr is not defined in CachedRequest, then get it from
# the request object.
return getattr(self._request, attr)
def read(self):
if self._content is None:
content = self._request.read()
self._content = StringIO()
self._content.write(content)
self._content.seek(0)
return content
else:
return self._content.read()
def seek(self, i):
self._content.seek(i)
If the code actually expects a real Request object(i.e. calls isinstance to check the type) then subclass Request and you don't even have to implement __getattr__.
Note that it is possible that a function checks for the exact class(and in this case you can't do nothing) or, if it's written in C, calls the method using C/API calls(in which case the overridden method wont be called).
Make a subclass of urllib2.Request that uses a cStringIO.StringIO to hold whatever gets read. Then you can implement seek and so forth. Actually you could just use a string, but that'd be more work.
I want to write a downloader with python and I use PycURL as my library, but I got a problem.
I can't get the size of the file wich I wanna download. Here is part of my code :
import pycurl
url = 'http://www.google.com'
c = pycurl.Curl()
c.setopt(c.URL, url)
print c.getinfo(c.CONTENT_LENGTH_DOWNLOAD)
c.perform()
When I test this code in python shell, it's ok but when I write it as a function and run it, it gives me -1 instead of the size.
What is the problem?
(code's been edited)
This answer adds the missing c.setopt(c.NOBODY, 1) and is otherwise the same as the one given some months ago:
import pycurl
c = pycurl.Curl()
c.setopt(c.URL, 'http://www.alfe.de')
c.setopt(c.NOBODY, 1)
c.perform()
c.getinfo(c.CONTENT_LENGTH_DOWNLOAD)
Calling c.setopt(c.NOBODY, 1) before calling c.perform() avoids downloading the contents of the file ("No Body", but all headers).
From the pycurl documentation on the Curl object:
The getinfo method should not be called unless perform has been called
and finished.
You're calling getinfo before you've called perform.
Here is a simplified version of your example, does this work?
import pycurl
url = 'http://www.google.com'
c = pycurl.Curl()
c.setopt(c.URL, url)
c.perform()
print c.getinfo(c.CONTENT_LENGTH_DOWNLOAD)
You should see the HTML content followed by the size.
Try adding debug to see what happens actually. After you created curl make this:
def curl_debug(debug_type, msg):
print("debug: %s %s" % (repr(debug_type), repr(msg)))
c.setopt(pycurl.VERBOSE, 1)
c.setopt(pycurl.DEBUGFUNCTION, curl_debug)
I have the following code for managing file download through django.
def serve_file(request, id):
file = models.X.objects.get(id=id).file #FileField
file.open('rb')
wrapper = FileWrapper(file)
mt = mimetypes.guess_type(file.name)[0]
response = HttpResponse(wrapper, content_type=mt)
import unicodedata, os.path
filename = unicodedata.normalize('NFKD', os.path.basename(file.name)).encode("utf8",'ignore')
filename = filename.replace(' ', '-') #Avoid browser to ignore any char after the space
response['Content-Length'] = file.size
response['Content-Disposition'] = 'attachment; filename={0}'.format(filename)
#print response
return response
Unfortunately, my browser get an empty file when downloading.
The printed response seems correct:
Content-Length: 3906
Content-Type: text/plain
Content-Disposition: attachment; filename=toto.txt
blah blah ....
I have similar code running ok. I don't see what can be the problem. Any idea?
PS: I have tested the solution proposed here and get the same behavior
Update:
Replacing wrapper = FileWrapper(file) by wrapper = file.read() seems to fix the problem
Update: If I comment the print response, I get similar issue:. the file is empty. Only difference: FF detects a 20bytes size. (the file is bigger than this)
File object is an interable, and a generator. It can be read only once before being exausted. Then you have to make a new one, of use a method to start at the begining of the object again (e.g: seek()).
read() returns a string, which can be read multiple times without any problem, this is why it solves your issue.
So just make sure that if you use a file like object, you don't read it twice in a row. E.G: don't print it, then returns it.
From django documentation:
FieldFile.open(mode='rb') Behaves like the standard Python open()
method and opens the file associated with this instance in the mode
specified by mode.
If it works like pythons open then it should return a file-object, and should be used like this:
f = file.open('rb')
wrapper = FileWrapper(f)
I have seen the receipes for uploading files via multipartform-data and pycurl. Both methods seem to require a file on disk. Can these recipes be modified to supply binary data from memory instead of from disk ? I guess I could just use a xmlrpc server instead.I wanted to get around having to encode and decode the binary data and send it raw... Do pycurl and mutlipartform-data work with raw data ?
This (small) library will take a file descriptor, and will do the HTTP POST operation: https://github.com/seisen/urllib2_file/
You can pass it a StringIO object (containing your in-memory data) as the file descriptor.
Find something that cas work with a file handle. Then simply pass a StringIO object instead of a real file descriptor.
I met similar issue today, after tried both and pycurl and multipart/form-data, I decide to read python httplib/urllib2 source code to find out, I did get one comparably good solution:
set Content-Length header(of the file) before doing post
pass a opened file when doing post
Here is the code:
import urllib2, os
image_path = "png\\01.png"
url = 'http://xx.oo.com/webserviceapi/postfile/'
length = os.path.getsize(image_path)
png_data = open(image_path, "rb")
request = urllib2.Request(url, data=png_data)
request.add_header('Cache-Control', 'no-cache')
request.add_header('Content-Length', '%d' % length)
request.add_header('Content-Type', 'image/png')
res = urllib2.urlopen(request).read().strip()
return res
see my blog post: http://www.2maomao.com/blog/python-http-post-a-binary-file-using-urllib2/
Following python code works reliable on 2.6.x. The input data is of type str.
Note that the server that receives the data has to loop to read all the data as the large
data sizes will be chunked. Also attached java code snippet to read the chunked data.
def post(self, url, data):
self.curl = pycurl.Curl()
self.response_headers = StringIO.StringIO()
self.response_body = io.BytesIO()
self.curl.setopt(pycurl.WRITEFUNCTION, self.response_body.write)
self.curl.setopt(pycurl.HEADERFUNCTION, self.response_headers.write)
self.curl.setopt(pycurl.FOLLOWLOCATION, 1)
self.curl.setopt(pycurl.MAXREDIRS, 5)
self.curl.setopt(pycurl.TIMEOUT, 60)
self.curl.setopt(pycurl.ENCODING,"deflate, gzip")
self.curl.setopt(pycurl.URL, url)
self.curl.setopt(pycurl.VERBOSE, 1)
self.curl.setopt(pycurl.POST,1)
self.curl.setopt(pycurl.POSTFIELDS,data)
self.curl.setopt(pycurl.HTTPHEADER, [ "Content-Type: octate-stream" ])
self.curl.setopt(pycurl.POSTFIELDSIZE, len(data))
self.curl.perform()
return url, self.curl.getinfo(pycurl.RESPONSE_CODE),self.response_headers.getvalue(), self.response_body.getvalue()
Java code for the servlet engine:
int postSize = Integer.parseInt(req.getHeader("Content-Length"));
results = new byte[postSize];
int read = 0;
while(read < postSize) {
int n = req.getInputStream().read(results);
if (n < 0) break;
read += n;
}
found a solution that works with the cherrypy file upload example: urllib2-binary-upload.py
import io # Part of core Python
import requests # Install via: 'pip install requests'
# Get the data in bytes. I got it via:
# with open("smile.gif", "rb") as fp: data = fp.read()
data = b"GIF89a\x12\x00\x12\x00\x80\x01\x00\x00\x00\x00\xff\x00\x00!\xf9\x04\x01\n\x00\x01\x00,\x00\x00\x00\x00\x12\x00\x12\x00\x00\x024\x84\x8f\x10\xcba\x8b\xd8ko6\xa8\xa0\xb3Wo\xde9X\x18*\x15x\x99\xd9'\xad\x1b\xe5r(9\xd2\x9d\xe9\xdd\xde\xea\xe6<,\xa3\xd8r\xb5[\xaf\x05\x1b~\x8e\x81\x02\x00;"
# Hookbin is similar to requestbin - just a neat little service
# which helps you to understand which queries are sent
url = 'https://hookb.in/je2rEl733Yt9dlMMdodB'
in_memory_file = io.BytesIO(data)
response = requests.post(url, files=(("smile.gif", in_memory_file),))