Download several parts of one file concurrently with Python? - python

I know how to use urllib to download a file. However, it's much faster, if the server allows it, to download several part of the same file simultaneously and then merge them.
How do you do that in Python? If you can't do it easily with the standard lib, any lib that would let you do it?

Although I agree with Gregory's suggestion of using an existing library, it's worth noting that you can do this by using the Range HTTP header. If the server accepts byte-range requests, you can start several threads to download multiple parts of the file in parallel. This snippet, for example, will only download bytes 0..65535 of the specified file:
import urllib2
url = 'http://example.com/test.zip'
req = urllib2.Request(url, headers={'Range':'bytes=0-65535'})
data = urllib2.urlopen(req).read()
You can determine the remote resource size and see whether the server supports ranged requests by sending a HEAD request:
import urllib2
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
url = 'http://sstatic.net/stackoverflow/img/sprites.png'
req = HeadRequest(url)
response = urllib2.urlopen(req)
response.close()
print respose.headers
The above prints:
Cache-Control: max-age=604800
Content-Length: 16542
Content-Type: image/png
Last-Modified: Thu, 10 Mar 2011 06:13:43 GMT
Accept-Ranges: bytes
ETag: "c434b24eeadecb1:0"
Date: Mon, 14 Mar 2011 16:08:02 GMT
Connection: close
From that we can see that the file size is 16542 bytes ('Content-Length') and the server supports byte ranges ('Accept-Ranges: bytes').

PycURL can do it. PycURL is a Python interface to libcurl. It can be used to fetch objects identified by a URL from a Python program, similar to the urllib Python module. PycURL is mature, very fast, and supports a lot of features.

Related

Apache sending Transfer-Encoding: chunked when deflate module is enabled

I have a simple web.py code like below, deployed with mod_wsgi in apache.
import web
urls = (
'/', 'index'
)
class index:
def GET(self):
content = 'hello'
web.header('Content-length', len(content))
return content
app = web.application(urls, globals())
application = app.wsgifunc()
This website runs well, except one minor issue. When mod_deflate is turn on, the response is chunked, even it has a very small response body.
Response Header
HTTP/1.1 200 OK
Date: Wed, 20 May 2015 20:14:12 GMT
Server: Apache/2.4.7 (Ubuntu)
Vary: Accept-Encoding
Content-Encoding: gzip
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html
When mod_deflate is turn off, Content-Length header is back.
HTTP/1.1 200 OK
Date: Wed, 20 May 2015 20:30:09 GMT
Server: Apache/2.4.7 (Ubuntu)
Content-Length: 5
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=utf-8
I've searched around and someone said reduce the DeflateBufferSize will help, but this response's size is only 5, far from it's default value: 8096, so I don't think it interferes with this issue.
And someone said apache send chunked response because it doesn't know the response's size before begin to send the response to client, but in my code, I do set Content-Length.
I've also tried Flask and Apache/2.2.15 (CentOS), same result.
How do I set content-length when deflate module is enabled? and I don't like to gzip content in python.
The response Content-Length has to reflect the final length of the data sent after the compression has been done, not the original length. Thus mod_deflate has to remove the original Content-Length header and use chunked transfer encoding. The only way it could otherwise know the content length to be able to send the Content-Length before sending the compressed data, would be to buffer up the complete compressed response in memory or into a file and then calculate the length. Buffering all the compressed content isn't practical and in part defeats the point of compressing the data as the response is streamed.
If you don't want mod_deflate enabled for the whole site, then only enable it for certain URL prefixes by scoping it within a Location block.

How to extract JSON data from a response containing a header and body?

this is my first question posed to Stack Overflow, because typically I can find the solutions to my problem here, but for this particular situation, I cannot. I am writing a Python plugin for my compiler that outputs REST calls in various languages for interaction with an API. I am authenticating with the socket and ssl modules by sending a username and password in the request body in JSON form. Upon successful authentication, the API returns a response in the following format with important response data in the body:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Date: Tue, 05 Feb 2013 03:36:18 GMT
Vary: Accept-Charset, Accept-Encoding, Accept-Language, Accept
Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST,OPTIONS,GET
Access-Control-Allow-Headers: Content-Type
Server: Restlet-Framework/2.0m5
Content-Type: text/plain;charset=ISO-8859-1
Content-Length: 94
{"authentication-token":"<token>","authentication-secret":"<secret>"}
This is probably a very elementary question for Pythonistas, given its powerful tools for String manipulation. But alas, I am a new programmer who started with Java. I would like to know what would be the best way to parse this entire response to obtain the "<token>" and "<secret>"? Should I use a search for a "{" and dump the substring into a json object? My intuition is telling me to try and use the re module, but I cannot seem to figure out how it would be used in this situation, since the pattern of the token and secret are obviously not predictable. Because I have opted to authenticate with a low-level module set, this response is one big String obtained by constructing the header and appending JSON data to it in the body, then executing the request and obtaining the response with the following code:
#Socket configuration and connection execution
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
conn = ssl.wrap_socket(sock, ca_certs = pem_file)
conn.connect((host, port))
conn.send(req)
response = conn.recv()
print(response)
The print statement outputs the first code sample. Any help or insight would be greatly appreciated!
HTTP headers are split from the rest of the body by a \r\n\r\n sequence. Do something like:
import json
...
(headers, js) = response.split("\r\n\r\n")
data = json.loads(js)
token = data["authentication-token"]
secret = data["authentication-secret"]
You'll probably want to check the response, etc, and various libraries (e.g. requests) can do all of this a whole lot easier for you.

How To Capture Output of Curl from Python script

I want to find the info about a webpage using curl, but in Python, so far I have this:
os.system("curl --head www.google.com")
If I run that, it prints out:
HTTP/1.1 200 OK
Date: Sun, 15 Apr 2012 00:50:13 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=3e39ad65c9fa03f3:FF=0:TM=1334451013:LM=1334451013:S=IyFnmKZh0Ck4xfJ4; expires=Tue, 15-Apr-2014 00:50:13 GMT; path=/; domain=.google.com
Set-Cookie: NID=58=Giz8e5-6p4cDNmx9j9QLwCbqhRksc907LDDO6WYeeV-hRbugTLTLvyjswf6Vk1xd6FPAGi8VOPaJVXm14TBm-0Seu1_331zS6gPHfFp4u4rRkXtSR9Un0hg-smEqByZO; expires=Mon, 15-Oct-2012 00:50:13 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Transfer-Encoding: chunked
What I want to do, is be able to match the 200 in it using a regex (i don't need help with that), but, I can't find a way to convert all the text above into a string. How do I do that?
I tried: info = os.system("curl --head www.google.com") but info was just 0.
For some reason... I need use curl (no pycurl, httplib2...), maybe this can help to somebody:
import os
result = os.popen("curl http://google.es").read()
print result
Try this, using subprocess.Popen():
import subprocess
proc = subprocess.Popen(["curl", "--head", "www.google.com"], stdout=subprocess.PIPE)
(out, err) = proc.communicate()
print out
As stated in the documentation:
The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. This module intends to replace several other, older modules and functions, such as:
os.system
os.spawn*
os.popen*
popen2.*
commands.*
import os
cmd = 'curl https://randomuser.me/api/'
os.system(cmd)
Result
{"results":[{"gender":"male","name":{"title":"mr","first":"çetin","last":"nebioğlu"},"location":{"street":"5919 abanoz sk","city":"adana","state":"kayseri","postcode":53537},"email":"çetin.nebioğlu#example.com","login":{"username":"heavyleopard188","password":"forgot","salt":"91TJOXWX","md5":"2b1124732ed2716af7d87ff3b140d178","sha1":"cb13fddef0e2ce14fa08a1731b66f5a603e32abe","sha256":"cbc252db886cc20e13f1fe000af1762be9f05e4f6372c289f993b89f1013a68c"},"dob":"1977-05-10 18:26:56","registered":"2009-09-08 15:57:32","phone":"(518)-816-4122","cell":"(605)-165-1900","id":{"name":"","value":null},"picture":{"large":"https://randomuser.me/api/portraits/men/38.jpg","medium":"https://randomuser.me/api/portraits/med/men/38.jpg","thumbnail":"https://randomuser.me/api/portraits/thumb/men/38.jpg"},"nat":"TR"}],"info":{"seed":"0b38b702ef718e83","results":1,"page":1,"version":"1.1"}}
You could use an HTTP library or http client library in Python instead of calling a curl command. In fact, there is a curl library that you can install (as long as you have a compiler on your OS).
Other choices are httplib2 (recommended) which is a fairly complete http protocol client supporting caching as well, or just plain httplib or a library named Request.
If you really, really want to just run the curl command and capture its output, then you can do this with Popen in the builtin subprocess module documented here: http://docs.python.org/library/subprocess.html
Well, there is an easier to read, but messier way to do it. Here it is:
import os
outfile='' #put your file path there
os.system("curl --head www.google.com>>{x}".format(x=str(outfile)) #Outputs command to log file (and creates it if it doesnt exist).
readOut=open("{z}".format(z=str(outfile),"r") #Opens file in reading mode.
for line in readOut:
print line #Prints lines in file
readOut.close() #Closes file
os.system("del {c}".format(c=str(outfile)) #This is optional, as it just deletes the log file after use.
This should work properly for your needs. :)
Try this:
import httplib
conn = httplib.HTTPConnection("www.python.org")
conn.request("GET", "/index.html")
r1 = conn.getresponse()
print r1.status, r1.reason

Upload images from from web-page

I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.
Main problem for me is that it takes too much time for web pages with big number of images.
I'm doing this in Django (using curl or urllib) according to the next scheme:
Grab html of the page (takes about 1 sec for big pages):
file = urllib.urlopen(requested_url)
html_string = file.read()
Parse it with HTML-parser (BeautifulSoup), looking for img tags, and writing all src of images to a list. (takes about 1 sec too for big pages)
Check sizes of all images in my list and if they are big enough return them in json response (takes very long about 15 sec, when there are about 80 images on a web page). Here's the code of the function:
def get_image_size(uri):
file = urllib.urlopen(uri)
p = ImageFile.Parser()
data = file.read(1024)
if not data:
return None
p.feed(data)
if p.image:
return p.image.size
file.close()
#not an image
return None
As you can see, I'm not loading full image to get it's size, only 1kb of it. But it still takes too much time when there are lot of images (i'm calling this function one time for each image found).
So how can I make it work faster?
May be is there any way for not making a request for every single image?
Any help will be highly appreciated.
Thanks!
i can think of few optimisations:
parse as you are reading a file from the stream
use SAX parser (which will be great with point above)
use HEAD to get size of the images
use queue to put your images, then use few threads to connect and get file sizes
example of HEAD request:
$ telnet m.onet.pl 80
Trying 213.180.150.45...
Connected to m.onet.pl.
Escape character is '^]'.
HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1
host: m.onet.pl
HTTP/1.0 200 OK
Server: nginx/0.8.53
Date: Sat, 09 Apr 2011 18:32:44 GMT
Content-Type: image/jpeg
Content-Length: 37545
Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT
Expires: Sat, 16 Apr 2011 18:32:44 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes
Age: 6575
X-Cache: HIT from emka1.m10r2.onet
Via: 1.1 emka1.m10r2.onet:80 (squid)
Connection: close
Connection closed by foreign host.
You can use the headers attribute of the file like object returned by urllib2.urlopen (I don't know about urllib).
Here's a test I wrote for it. As you can see, it is rather fast, though I imagine some websites would block too many repeated requests.
|milo|laurie|¥ cat test.py
import urllib2
uri = "http://download.thinkbroadband.com/1GB.zip"
def get_file_size(uri):
file = urllib2.urlopen(uri)
content_header, = [header for header in file.headers.headers if header.startswith("Content-Length")]
_, str_length = content_header.split(':')
length = int(str_length.strip())
return length
if __name__ == "__main__":
get_file_size(uri)
|milo|laurie|¥ time python2 test.py
python2 test.py 0.06s user 0.01s system 35% cpu 0.196 total

python check url type

I wrote a crawler in python, fetched urls has different types: it can be url with html and url with image or big archives or other files. So i need fast determine this case to prevent of reading of big files such as big archives and continue crawling. How is the best way to determine url type at start of page loading?
i understand what i can do it by url name (end's with .rar .jpg etc) but i think it's not full solution. I need check header or something like that for this? also i need some page size predicition to prevent of large downloads. In other words set limit of downloaded page size, to prevent fast memory eating.
If you use a HTTP HEAD request on the resource, you will get relevant metadata on the resource without the resource data itself. Specifically, the content-length and content-type headers will be of interest.
E.g.
HEAD /stackoverflow/img/favicon.ico HTTP/1.1
host: sstatic.net
HTTP/1.1 200 OK
Cache-Control: max-age=604800
Content-Length: 1150
Content-Type: image/x-icon
Last-Modified: Mon, 02 Aug 2010 06:04:04 GMT
Accept-Ranges: bytes
ETag: "2187d82832cb1:0"
X-Powered-By: ASP.NET
Date: Sun, 12 Sep 2010 13:38:36 GMT
You can do this in python using httplib:
>>> import httplib
>>> conn = httplib.HTTPConnection("sstatic.net")
>>> conn.request("HEAD", "/stackoverflow/img/favicon.ico")
>>> res = conn.getresponse()
>>> print res.getheaders()
[('content-length', '1150'), ('x-powered-by', 'ASP.NET'), ('accept-ranges', 'bytes'), ('last-modified', 'Mon, 02 Aug 2010 06:04:04 GMT'), ('etag', '"2187d82832cb1:0"'), ('cache-control', 'max-age=604800'), ('date', 'Sun, 12 Sep 2010 13:39:26 GMT'), ('content-type', 'image/x-icon')]
This tells you it's an image (image/* mime-type) of 1150 bytes. Enough information for you to decide if you want to fetch the full resource.
Additionally, this header tells you the server accepts HTTP partial content request (accept-ranges header) which allows you to retrieve data in batches.
You will get the same header information if you do a GET directly, but this will also start sending the resource data in the body of the response, something you want to avoid.
If you want to learn more about HTTP headers and their meaning, you can use an online tool such as 'Fetch'

Categories