python check url type - python

I wrote a crawler in python, fetched urls has different types: it can be url with html and url with image or big archives or other files. So i need fast determine this case to prevent of reading of big files such as big archives and continue crawling. How is the best way to determine url type at start of page loading?
i understand what i can do it by url name (end's with .rar .jpg etc) but i think it's not full solution. I need check header or something like that for this? also i need some page size predicition to prevent of large downloads. In other words set limit of downloaded page size, to prevent fast memory eating.

If you use a HTTP HEAD request on the resource, you will get relevant metadata on the resource without the resource data itself. Specifically, the content-length and content-type headers will be of interest.
E.g.
HEAD /stackoverflow/img/favicon.ico HTTP/1.1
host: sstatic.net
HTTP/1.1 200 OK
Cache-Control: max-age=604800
Content-Length: 1150
Content-Type: image/x-icon
Last-Modified: Mon, 02 Aug 2010 06:04:04 GMT
Accept-Ranges: bytes
ETag: "2187d82832cb1:0"
X-Powered-By: ASP.NET
Date: Sun, 12 Sep 2010 13:38:36 GMT
You can do this in python using httplib:
>>> import httplib
>>> conn = httplib.HTTPConnection("sstatic.net")
>>> conn.request("HEAD", "/stackoverflow/img/favicon.ico")
>>> res = conn.getresponse()
>>> print res.getheaders()
[('content-length', '1150'), ('x-powered-by', 'ASP.NET'), ('accept-ranges', 'bytes'), ('last-modified', 'Mon, 02 Aug 2010 06:04:04 GMT'), ('etag', '"2187d82832cb1:0"'), ('cache-control', 'max-age=604800'), ('date', 'Sun, 12 Sep 2010 13:39:26 GMT'), ('content-type', 'image/x-icon')]
This tells you it's an image (image/* mime-type) of 1150 bytes. Enough information for you to decide if you want to fetch the full resource.
Additionally, this header tells you the server accepts HTTP partial content request (accept-ranges header) which allows you to retrieve data in batches.
You will get the same header information if you do a GET directly, but this will also start sending the resource data in the body of the response, something you want to avoid.
If you want to learn more about HTTP headers and their meaning, you can use an online tool such as 'Fetch'

Related

By downloading picture's file-header, how to obtain enough information?

I need to download some pictures from a picture server every day. This server adds thousands of pictures daily, and many pictures have large size. Since the server does not support thumb pics and any pic-description, I have to download the pictures completely to know whether this picture is the one I need. However, my network bandwidth is very low, so it cost considerable time to download every picture. Moreover, the server has strict network flow controlling, thus, I may only download less than 100 pictures every day if the pictures are all large.
I search some related articles and find that picture's file-header contains many useful information, so this is my plan:
Use python code to download all pictures' file-header. If I only download file-header, the network flow will be very small, so I can download all pictures' file-header on the server.
Analyse every pictures' file-header and obtain enough information. From my searching, I know the picture's format(png/jpg/gif), size(XXX,XXX bytes), resolution(XXXX×YYY,such as 1920x1080) can be obtained from picture's file-header which is less than 1000 bytes. Maybe it is possible to get more information from picture's file-header, so if you know more, please help me.
Export the result to an Excel file.
Could you tell me the effective python code to achieve the above three demands?
Added on July 22:
This is some information I got from HTTP header
HTTP/1.1 200 OK
Server: nginx
Date: Sun, 22 Jul 2018 15:13:19 GMT
Content-Type: image/jpeg
Content-Length: 376386
Cache-Control: public,max-age=518400
Expires: Sat, 28 Jul 2018 15:13:19 GMT
Last-Modified: Sun, 22 Jul 2018 15:13:19 GMT
Vary: Origin
ETag: "5be42"
Connection: Keep-alive
Now I can get Content-Type and Content-Length from HTTP header. But it is not enough for me.
I searched and found someone said they could read the image's resolution(XXXX×YYY,such as 1920x1080) from the first 100 bytes of the pic-file data.(100 here is only the maximum number, somebody even said he can get the resolution from the begin 30 bytes of the pic-file data.) I think it is true because many pics I downloaded not-finished can display the resolution and the top of the pic.
Moreover, maybe there is a way to generate thumb without downloading complete pic? I'm not sure it is possible or not, but I think if it can be done, it will be very useful.
You can use PIL library and use getdata.
I don't think that is possible or are there image headers at all? When I do something like
curl -I https://upload.wikimedia.org/wikipedia/de/b/bb/Png-logo.png
to get HTTP headers, I don't see an image size or anything like that:
HTTP/1.1 200 OK
Date: Sat, 21 Jul 2018 17:35:26 GMT
Content-Type: image/png
Content-Length: 811068
Connection: keep-alive
X-Object-Meta-Sha1Base36: tup6ux1u98mkbw32ta64fna0hqw6y09
Last-Modified: Thu, 03 Oct 2013 23:18:32 GMT
Etag: 1f427f6758058528cc0d474a14ee6dc1
X-Timestamp: 1380842311.64879
X-Trans-Id: txdbd33b3337fb497694bd8-005b536ebb
X-Varnish: 185288243, 96001562 108570149, 528370630
Via: 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1)
Accept-Ranges: bytes
Age: 34
X-Cache: cp1062 pass, cp3038 hit/2, cp3039 miss
X-Cache-Status: hit-local
Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
X-Analytics: https=1;nocookies=1
X-Client-IP: 87.152.115.72
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache, X-Varnish
Timing-Allow-Origin: *
Content-Security-Policy-Report-Only: default-src 'none'; style-src 'unsafe-inline' data:; font-src data:; img-src data: https://upload.wikimedia.org/favicon.ico; media-src data:; sandbox; report-uri https://commons.wikimedia.org/w/api.php?reportonly=1&source=image&action=cspreport&format=json&
X-Content-Security-Policy-Report-Only: default-src 'none'; style-src 'unsafe-inline' data:; font-src data:; img-src data: https://upload.wikimedia.org/favicon.ico; media-src data:; sandbox; report-uri https://commons.wikimedia.org/w/api.php?reportonly=1&source=image&action=cspreport&format=json&
X-Webkit-CSP-Report-Only: default-src 'none'; style-src 'unsafe-inline' data:; font-src data:; img-src data: https://upload.wikimedia.org/favicon.ico; media-src data:; sandbox; report-uri https://commons.wikimedia.org/w/api.php?reportonly=1&source=image&action=cspreport&format=json&
Even if there is such a thing, the limit of 100 would probably affect the image headers as well.

Apache sending Transfer-Encoding: chunked when deflate module is enabled

I have a simple web.py code like below, deployed with mod_wsgi in apache.
import web
urls = (
'/', 'index'
)
class index:
def GET(self):
content = 'hello'
web.header('Content-length', len(content))
return content
app = web.application(urls, globals())
application = app.wsgifunc()
This website runs well, except one minor issue. When mod_deflate is turn on, the response is chunked, even it has a very small response body.
Response Header
HTTP/1.1 200 OK
Date: Wed, 20 May 2015 20:14:12 GMT
Server: Apache/2.4.7 (Ubuntu)
Vary: Accept-Encoding
Content-Encoding: gzip
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html
When mod_deflate is turn off, Content-Length header is back.
HTTP/1.1 200 OK
Date: Wed, 20 May 2015 20:30:09 GMT
Server: Apache/2.4.7 (Ubuntu)
Content-Length: 5
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=utf-8
I've searched around and someone said reduce the DeflateBufferSize will help, but this response's size is only 5, far from it's default value: 8096, so I don't think it interferes with this issue.
And someone said apache send chunked response because it doesn't know the response's size before begin to send the response to client, but in my code, I do set Content-Length.
I've also tried Flask and Apache/2.2.15 (CentOS), same result.
How do I set content-length when deflate module is enabled? and I don't like to gzip content in python.
The response Content-Length has to reflect the final length of the data sent after the compression has been done, not the original length. Thus mod_deflate has to remove the original Content-Length header and use chunked transfer encoding. The only way it could otherwise know the content length to be able to send the Content-Length before sending the compressed data, would be to buffer up the complete compressed response in memory or into a file and then calculate the length. Buffering all the compressed content isn't practical and in part defeats the point of compressing the data as the response is streamed.
If you don't want mod_deflate enabled for the whole site, then only enable it for certain URL prefixes by scoping it within a Location block.

Decoding response while opening a URL

I am using the following code to open a url and retrieve it's response :
def get_issue_report(query):
request = urllib2.Request(query)
response = urllib2.urlopen(request)
response_headers = response.info()
print response.read()
The response I get is as follows :
<?xml version='1.0' encoding='UTF-8'?><entry xmlns='http://www.w3.org/2005/Atom' xmlns:gd='http://schemas.google.com/g/2005' xmlns:issues='http://schemas.google.com/projecthosting/issues/2009' gd:etag='W/"DUUFQH47eCl7ImA9WxBbFEg."'><id>http://code.google.com/feeds/issues/p/chromium/issues/full/2</id><published>2008-08-30T16:00:21.000Z</published><updated>2010-03-13T05:13:31.000Z</updated><title>Testing if chromium id works</title><content type='html'><b>What steps will reproduce the problem?</b>
<b>1.</b>
<b>2.</b>
<b>3.</b>
<b>What is the expected output? What do you see instead?</b>
<b>Please use labels and text to provide additional information.</b>
</content><link rel='replies' type='application/atom+xml' href='http://code.google.com/feeds/issues/p/chromium/issues/2/comments/full'/><link rel='alternate' type='text/html' href='http://code.google.com/p/chromium/issues/detail?id=2'/><link rel='self' type='application/atom+xml' href='https://code.google.com/feeds/issues/p/chromium/issues/full/2'/><author><name>rah...#google.com</name><uri>/u/#VBJVRVdXDhZCVgJ%2FF3tbUV5SAw%3D%3D/</uri></author><issues:closedDate>2008-08-30T20:48:43.000Z</issues:closedDate><issues:id>2</issues:id><issues:label>Type-Bug</issues:label><issues:label>Priority-Medium</issues:label><issues:owner><issues:uri>/u/kuchhal#chromium.org/</issues:uri><issues:username>kuchhal#chromium.org</issues:username></issues:owner><issues:stars>4</issues:stars><issues:state>closed</issues:state><issues:status>Invalid</issues:status></entry>
I would like to get rid of the characters like &lt, &gt etc. I tried using
response.read().decode('utf-8')
but this doesn't help much.
Just in case, the response.info() prints the following :
Content-Type: application/atom+xml; charset=UTF-8; type=entry
Expires: Fri, 01 Jul 2011 11:15:17 GMT
Date: Fri, 01 Jul 2011 11:15:17 GMT
Cache-Control: private, max-age=0, must-revalidate, no-transform
Vary: Accept, X-GData-Authorization, GData-Version
GData-Version: 1.0
ETag: W/"DUUFQH47eCl7ImA9WxBbFEg."
Last-Modified: Sat, 13 Mar 2010 05:13:31 GMT
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Connection: close
Here's the URL : https://code.google.com/feeds/issues/p/chromium/issues/full/2
Sentinel has explained how you can decode entity references like < but there's a bit more to the problem than that.
The example you give suggests that you are reading an Atom feed. If you want to do this reliably in Python, then I recommend using Mark Pilgrim's Universal Feed Parser.
Here's how one would read the feed in your example:
>>> import feedparser
>>> d = feedparser.parse('http://code.google.com/feeds/issues/p/chromium/issues/full/2')
>>> len(d.entries)
1
>>> print d.entries[0].title
Testing if chromium id works
>>> print d.entries[0].description
<b>What steps will reproduce the problem?</b>
<b>1.</b>
<b>2.</b>
<b>3.</b>
<b>What is the expected output? What do you see instead?</b>
<b>Please use labels and text to provide additional information.</b>
Using feedparser is likely to be much more reliable and convenient than trying to do your own XML parsing, entity decoding, date parsing, HTML sanitization, and so on.
from HTMLParser import HTMLParser
import urllib2
query="http://code.google.com/feeds/issues/p/chromium/issues/full/2"
def get_issue_report(query):
request = urllib2.Request(query)
response = urllib2.urlopen(request)
response_headers = response.info()
return response.read()
s = get_issue_report(query)
p = HTMLParser()
print p.unescape(s)
p.close()
Use
xml.sax.saxutils.unescape()
http://docs.python.org/library/xml.sax.utils.html#module-xml.sax.saxutils

Upload images from from web-page

I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.
Main problem for me is that it takes too much time for web pages with big number of images.
I'm doing this in Django (using curl or urllib) according to the next scheme:
Grab html of the page (takes about 1 sec for big pages):
file = urllib.urlopen(requested_url)
html_string = file.read()
Parse it with HTML-parser (BeautifulSoup), looking for img tags, and writing all src of images to a list. (takes about 1 sec too for big pages)
Check sizes of all images in my list and if they are big enough return them in json response (takes very long about 15 sec, when there are about 80 images on a web page). Here's the code of the function:
def get_image_size(uri):
file = urllib.urlopen(uri)
p = ImageFile.Parser()
data = file.read(1024)
if not data:
return None
p.feed(data)
if p.image:
return p.image.size
file.close()
#not an image
return None
As you can see, I'm not loading full image to get it's size, only 1kb of it. But it still takes too much time when there are lot of images (i'm calling this function one time for each image found).
So how can I make it work faster?
May be is there any way for not making a request for every single image?
Any help will be highly appreciated.
Thanks!
i can think of few optimisations:
parse as you are reading a file from the stream
use SAX parser (which will be great with point above)
use HEAD to get size of the images
use queue to put your images, then use few threads to connect and get file sizes
example of HEAD request:
$ telnet m.onet.pl 80
Trying 213.180.150.45...
Connected to m.onet.pl.
Escape character is '^]'.
HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1
host: m.onet.pl
HTTP/1.0 200 OK
Server: nginx/0.8.53
Date: Sat, 09 Apr 2011 18:32:44 GMT
Content-Type: image/jpeg
Content-Length: 37545
Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT
Expires: Sat, 16 Apr 2011 18:32:44 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes
Age: 6575
X-Cache: HIT from emka1.m10r2.onet
Via: 1.1 emka1.m10r2.onet:80 (squid)
Connection: close
Connection closed by foreign host.
You can use the headers attribute of the file like object returned by urllib2.urlopen (I don't know about urllib).
Here's a test I wrote for it. As you can see, it is rather fast, though I imagine some websites would block too many repeated requests.
|milo|laurie|¥ cat test.py
import urllib2
uri = "http://download.thinkbroadband.com/1GB.zip"
def get_file_size(uri):
file = urllib2.urlopen(uri)
content_header, = [header for header in file.headers.headers if header.startswith("Content-Length")]
_, str_length = content_header.split(':')
length = int(str_length.strip())
return length
if __name__ == "__main__":
get_file_size(uri)
|milo|laurie|¥ time python2 test.py
python2 test.py 0.06s user 0.01s system 35% cpu 0.196 total

Download several parts of one file concurrently with Python?

I know how to use urllib to download a file. However, it's much faster, if the server allows it, to download several part of the same file simultaneously and then merge them.
How do you do that in Python? If you can't do it easily with the standard lib, any lib that would let you do it?
Although I agree with Gregory's suggestion of using an existing library, it's worth noting that you can do this by using the Range HTTP header. If the server accepts byte-range requests, you can start several threads to download multiple parts of the file in parallel. This snippet, for example, will only download bytes 0..65535 of the specified file:
import urllib2
url = 'http://example.com/test.zip'
req = urllib2.Request(url, headers={'Range':'bytes=0-65535'})
data = urllib2.urlopen(req).read()
You can determine the remote resource size and see whether the server supports ranged requests by sending a HEAD request:
import urllib2
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
url = 'http://sstatic.net/stackoverflow/img/sprites.png'
req = HeadRequest(url)
response = urllib2.urlopen(req)
response.close()
print respose.headers
The above prints:
Cache-Control: max-age=604800
Content-Length: 16542
Content-Type: image/png
Last-Modified: Thu, 10 Mar 2011 06:13:43 GMT
Accept-Ranges: bytes
ETag: "c434b24eeadecb1:0"
Date: Mon, 14 Mar 2011 16:08:02 GMT
Connection: close
From that we can see that the file size is 16542 bytes ('Content-Length') and the server supports byte ranges ('Accept-Ranges: bytes').
PycURL can do it. PycURL is a Python interface to libcurl. It can be used to fetch objects identified by a URL from a Python program, similar to the urllib Python module. PycURL is mature, very fast, and supports a lot of features.

Categories