getting the file size with pycurl - python

I want to write a downloader with python and I use PycURL as my library, but I got a problem.
I can't get the size of the file wich I wanna download. Here is part of my code :
import pycurl
url = 'http://www.google.com'
c = pycurl.Curl()
c.setopt(c.URL, url)
print c.getinfo(c.CONTENT_LENGTH_DOWNLOAD)
c.perform()
When I test this code in python shell, it's ok but when I write it as a function and run it, it gives me -1 instead of the size.
What is the problem?
(code's been edited)

This answer adds the missing c.setopt(c.NOBODY, 1) and is otherwise the same as the one given some months ago:
import pycurl
c = pycurl.Curl()
c.setopt(c.URL, 'http://www.alfe.de')
c.setopt(c.NOBODY, 1)
c.perform()
c.getinfo(c.CONTENT_LENGTH_DOWNLOAD)
Calling c.setopt(c.NOBODY, 1) before calling c.perform() avoids downloading the contents of the file ("No Body", but all headers).

From the pycurl documentation on the Curl object:
The getinfo method should not be called unless perform has been called
and finished.
You're calling getinfo before you've called perform.
Here is a simplified version of your example, does this work?
import pycurl
url = 'http://www.google.com'
c = pycurl.Curl()
c.setopt(c.URL, url)
c.perform()
print c.getinfo(c.CONTENT_LENGTH_DOWNLOAD)
You should see the HTML content followed by the size.

Try adding debug to see what happens actually. After you created curl make this:
def curl_debug(debug_type, msg):
print("debug: %s %s" % (repr(debug_type), repr(msg)))
c.setopt(pycurl.VERBOSE, 1)
c.setopt(pycurl.DEBUGFUNCTION, curl_debug)

Related

Python: Download only the HEAD tag of web page

I need to scrape multiple webpages (1000s per hour) as fast as possible, I only need to get the metadata from the head section. I don't think using range headers is going to be reliable as the head sections can vary in size greatly.
I came across another java implementation that Opens a URLConnection and read from the input stream, stopping once you find the closing </head> tag
See: Is it possible to download only the HEAD tag of a page?
is this possible in python?
I've been testing with pycurl and the WRITEFUNCTION callback
from time import time
import pycurl
import sys
c = pycurl.Curl()
class Body:
body = ""
def read(self, buf):
self.body += str(buf)
b = Body()
c.setopt(c.URL, "https://www.oracle.com/")
c.setopt(c.WRITEFUNCTION, b.read)
t1 = time()
try:
c.perform()
except pycurl.error:
pass
print(time() - t1)
print(b.body)
but using Wireshark and setting the debugger to stop in the read function. I'm still seeing all the transactions happen on the first pass.

pylance report GeneralTypeIssues on members while typing pycurl example

I am a python beginner learning pycurl with its example on VSCode.
import pycurl
from io import BytesIO
buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.INTERFACE, 'lo')
c.setopt(c.URL, "http://127.0.0.1")
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close()
body = buffer.getvalue()
print(body.decode('UTF-8'))
But pylance reports GeneralTypeIssues on Curl members as picture belows:
GeneralTypeIssues
Cannot access member "INTERFACE" for type "Curl"
Member "INTERFACE" is unknownPylancereportGeneralTypeIssues
I don't know where I can report this issue on Pylance or pycurl so I came to Stackoverflow for help.
Much thanks for your viewing in advance.
I think that's the problem:c.setopt(c.INTERFACE, 'lo') c.setopt(c.URL, "http://127.0.0.1") c.setopt(c.WRITEDATA, buffer)
Think again about whether you can use c.INTERFACE.
c is a curl while INTERFACE is a property of pycurl.
You can try to change c.INTERFACE into pycurl.INTERFACE

upload binary data or file from memory using python http post

I have seen the receipes for uploading files via multipartform-data and pycurl. Both methods seem to require a file on disk. Can these recipes be modified to supply binary data from memory instead of from disk ? I guess I could just use a xmlrpc server instead.I wanted to get around having to encode and decode the binary data and send it raw... Do pycurl and mutlipartform-data work with raw data ?
This (small) library will take a file descriptor, and will do the HTTP POST operation: https://github.com/seisen/urllib2_file/
You can pass it a StringIO object (containing your in-memory data) as the file descriptor.
Find something that cas work with a file handle. Then simply pass a StringIO object instead of a real file descriptor.
I met similar issue today, after tried both and pycurl and multipart/form-data, I decide to read python httplib/urllib2 source code to find out, I did get one comparably good solution:
set Content-Length header(of the file) before doing post
pass a opened file when doing post
Here is the code:
import urllib2, os
image_path = "png\\01.png"
url = 'http://xx.oo.com/webserviceapi/postfile/'
length = os.path.getsize(image_path)
png_data = open(image_path, "rb")
request = urllib2.Request(url, data=png_data)
request.add_header('Cache-Control', 'no-cache')
request.add_header('Content-Length', '%d' % length)
request.add_header('Content-Type', 'image/png')
res = urllib2.urlopen(request).read().strip()
return res
see my blog post: http://www.2maomao.com/blog/python-http-post-a-binary-file-using-urllib2/
Following python code works reliable on 2.6.x. The input data is of type str.
Note that the server that receives the data has to loop to read all the data as the large
data sizes will be chunked. Also attached java code snippet to read the chunked data.
def post(self, url, data):
self.curl = pycurl.Curl()
self.response_headers = StringIO.StringIO()
self.response_body = io.BytesIO()
self.curl.setopt(pycurl.WRITEFUNCTION, self.response_body.write)
self.curl.setopt(pycurl.HEADERFUNCTION, self.response_headers.write)
self.curl.setopt(pycurl.FOLLOWLOCATION, 1)
self.curl.setopt(pycurl.MAXREDIRS, 5)
self.curl.setopt(pycurl.TIMEOUT, 60)
self.curl.setopt(pycurl.ENCODING,"deflate, gzip")
self.curl.setopt(pycurl.URL, url)
self.curl.setopt(pycurl.VERBOSE, 1)
self.curl.setopt(pycurl.POST,1)
self.curl.setopt(pycurl.POSTFIELDS,data)
self.curl.setopt(pycurl.HTTPHEADER, [ "Content-Type: octate-stream" ])
self.curl.setopt(pycurl.POSTFIELDSIZE, len(data))
self.curl.perform()
return url, self.curl.getinfo(pycurl.RESPONSE_CODE),self.response_headers.getvalue(), self.response_body.getvalue()
Java code for the servlet engine:
int postSize = Integer.parseInt(req.getHeader("Content-Length"));
results = new byte[postSize];
int read = 0;
while(read < postSize) {
int n = req.getInputStream().read(results);
if (n < 0) break;
read += n;
}
found a solution that works with the cherrypy file upload example: urllib2-binary-upload.py
import io # Part of core Python
import requests # Install via: 'pip install requests'
# Get the data in bytes. I got it via:
# with open("smile.gif", "rb") as fp: data = fp.read()
data = b"GIF89a\x12\x00\x12\x00\x80\x01\x00\x00\x00\x00\xff\x00\x00!\xf9\x04\x01\n\x00\x01\x00,\x00\x00\x00\x00\x12\x00\x12\x00\x00\x024\x84\x8f\x10\xcba\x8b\xd8ko6\xa8\xa0\xb3Wo\xde9X\x18*\x15x\x99\xd9'\xad\x1b\xe5r(9\xd2\x9d\xe9\xdd\xde\xea\xe6<,\xa3\xd8r\xb5[\xaf\x05\x1b~\x8e\x81\x02\x00;"
# Hookbin is similar to requestbin - just a neat little service
# which helps you to understand which queries are sent
url = 'https://hookb.in/je2rEl733Yt9dlMMdodB'
in_memory_file = io.BytesIO(data)
response = requests.post(url, files=(("smile.gif", in_memory_file),))

Python code like curl

in curl i do this:
curl -u email:password http://api.foursquare.com/v1/venue.json?vid=2393749
How i can do this same thing in python?
Here's the equivalent in pycurl:
import pycurl
from StringIO import StringIO
response_buffer = StringIO()
curl = pycurl.Curl()
curl.setopt(curl.URL, "http://api.foursquare.com/v1/venue.json?vid=2393749")
curl.setopt(curl.USERPWD, '%s:%s' % ('youruser', 'yourpassword'))
curl.setopt(curl.WRITEFUNCTION, response_buffer.write)
curl.perform()
curl.close()
response_value = response_buffer.getvalue()
"The problem could be that the Python libraries, per HTTP-Standard, first send an unauthenticated request, and then only if it's answered with a 401 retry, are the correct credentials sent. If the Foursquare servers don't do "totally standard authentication" then the libraries won't work.
Try using headers to do authentication:"
taked from Python urllib2 Basic Auth Problem
import urllib2
import base64
req = urllib2.Request('http://api.foursquare.com/v1/venue.json?vid=%s' % self.venue_id)
req.add_header('Authorization: Basic ',base64.b64encode('email:password'))
res = urllib2.urlopen(req)
I'm more comfortable running the command line curl through subprocess. This avoids all of the potential version matching headaches of python, pycurl, and libcurl. The observation that pycurl hasn't been touched in 2 years, and is only listed as suppported through Python 2.5, made me wary.
-- John
import subprocess
def curl(*args):
curl_path = '/usr/bin/curl'
curl_list = [curl_path]
for arg in args:
curl_list.append(arg)
curl_result = subprocess.Popen(
curl_list,
stderr=subprocess.PIPE,
stdout=subprocess.PIPE).communicate()[0]
return curl_result
answer = curl('-u', 'email:password', 'http://api.foursquare.com/v1/venue.json?vid=2393749')
if use human_curl you can write some code
import human_curl as hurl
r = hurl.get('http://api.foursquare.com/v1/venue.json?vid=2393749', auth=('email','password'))
Json data in r.content
Use pycurl
http://pycurl.sourceforge.net/
There is a discussion on SO for tutorials
What good tutorials exist for learning pycURL?
A typical example:
import sys
import pycurl
class ContentCallback:
def __init__(self):
self.contents = ''
def content_callback(self, buf):
self.contents = self.contents + buf
t = ContentCallback()
curlObj = pycurl.Curl()
curlObj.setopt(curlObj.URL, 'http://www.google.com')
curlObj.setopt(curlObj.WRITEFUNCTION, t.content_callback)
curlObj.perform()
curlObj.close()
print t.contents

Pycurl WRITEDATA WRITEFUNCTION collision/crash

How do I turnoff WRITEFUNCTION and WRITEDATA?
Using pycurl I have a class call curlUtil. In it I have pageAsString (self, URL) which returns a string.
To do this I setopt WRITEFUNCTION. Now in downloadFile (self, URL, fn, overwrite=0) I do an open and self.c.Setopt (pycurl.WRITEFUNCTION, 0) which cause problems. Int is not a valid argument.
I then assumed WRITEDATA would overwrite the value or there would be a NOWRITEFUNCTION commend. NOWRITEFUNCTION didn't exist so I just used WRITEDATA and Python crashed.
I wrote a quick func called reboot() which closes curl, opens it again, and calls reset to put it in the default state. I call it in both pageAsString and downloadFile and there is no problem at all. But, I don't want to reinitialize curl. There might be some special options I set.
How do I turnoff WRITEFUNCTION and WRITEDATA?
using the writefunction, instead of turning it off would save you a lot off trouble. you might want to rewrite your pageAsString by utilizing WRITEFUNCTION..
as an example:
from cStringIO import StringIO
c = pycurl.Curl()
buffer = StringIO()
c.setopt(pycurl.WRITEFUNCTION, buffer.write)
c.setopt(pycurl.URL, "http://example.com")
c.perform()
...
buffer.getvalue() # will return the data fetched.

Categories