I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.
Main problem for me is that it takes too much time for web pages with big number of images.
I'm doing this in Django (using curl or urllib) according to the next scheme:
Grab html of the page (takes about 1 sec for big pages):
file = urllib.urlopen(requested_url)
html_string = file.read()
Parse it with HTML-parser (BeautifulSoup), looking for img tags, and writing all src of images to a list. (takes about 1 sec too for big pages)
Check sizes of all images in my list and if they are big enough return them in json response (takes very long about 15 sec, when there are about 80 images on a web page). Here's the code of the function:
def get_image_size(uri):
file = urllib.urlopen(uri)
p = ImageFile.Parser()
data = file.read(1024)
if not data:
return None
p.feed(data)
if p.image:
return p.image.size
file.close()
#not an image
return None
As you can see, I'm not loading full image to get it's size, only 1kb of it. But it still takes too much time when there are lot of images (i'm calling this function one time for each image found).
So how can I make it work faster?
May be is there any way for not making a request for every single image?
Any help will be highly appreciated.
Thanks!
i can think of few optimisations:
parse as you are reading a file from the stream
use SAX parser (which will be great with point above)
use HEAD to get size of the images
use queue to put your images, then use few threads to connect and get file sizes
example of HEAD request:
$ telnet m.onet.pl 80
Trying 213.180.150.45...
Connected to m.onet.pl.
Escape character is '^]'.
HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1
host: m.onet.pl
HTTP/1.0 200 OK
Server: nginx/0.8.53
Date: Sat, 09 Apr 2011 18:32:44 GMT
Content-Type: image/jpeg
Content-Length: 37545
Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT
Expires: Sat, 16 Apr 2011 18:32:44 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes
Age: 6575
X-Cache: HIT from emka1.m10r2.onet
Via: 1.1 emka1.m10r2.onet:80 (squid)
Connection: close
Connection closed by foreign host.
You can use the headers attribute of the file like object returned by urllib2.urlopen (I don't know about urllib).
Here's a test I wrote for it. As you can see, it is rather fast, though I imagine some websites would block too many repeated requests.
|milo|laurie|¥ cat test.py
import urllib2
uri = "http://download.thinkbroadband.com/1GB.zip"
def get_file_size(uri):
file = urllib2.urlopen(uri)
content_header, = [header for header in file.headers.headers if header.startswith("Content-Length")]
_, str_length = content_header.split(':')
length = int(str_length.strip())
return length
if __name__ == "__main__":
get_file_size(uri)
|milo|laurie|¥ time python2 test.py
python2 test.py 0.06s user 0.01s system 35% cpu 0.196 total
Related
I am using SimpleHTTPServer class in my code to respond to client requests (it is actually mininet python script for networking project). The client sends a request every 5 seconds to the server 10.0.0.1:
server.cmd('python -m SimpleHTTPServer 80 &')
def tcp_thread(client_id):
for i in range(180):
client_id.cmd('wget -O - 10.0.0.1')
time.sleep(5)
When tracing using Wireshark, I noticed the server sends a junk page of size 390 bytes something like this:
Hypertext Transfer Protocol
HTTP/1.0 200 OK\r\n
[Expert Info (Chat/Sequence): HTTP/1.0 200 OK\r\n]
Request Version: HTTP/1.0
Status Code: 200
Response Phrase: OK
Server: SimpleHTTP/0.6 Python/2.7.6\r\n
Date: Fri, 08 Jul 2016 16:16:47 GMT\r\n
Content-type: text/html; charset=UTF-8\r\n
Content-Length: 390\r\n
\r\n
[HTTP response 1/1]
[Time since request: 0.000905000 seconds]
[Request in frame: 75]
File Data: 390 bytes
The page contents looks like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><html>\n
<title>Directory listing for /</title>\n
<body>\n
<h2>Directory listing for /</h2>\n
<hr>\n
<ul>\n
<li>experiment.py\n
<li>experiment1.mn\n
<li>experiment1.py\n
<li>README\n
<li>rules.txt\n
</ul>\n
<hr>\n
</body>\n
</html>\n
My question is: How can I change the page contents so that I can increase the size of the page sent to be larger than 390 bytes? I tried searching about customizing the page and non of them address that clearly.
Thank you.
SimpleHTTPServer serves directory listings, files, and index.html, as explained in the documentation: https://docs.python.org/2.7/library/simplehttpserver.html
You can either create index.html file in the same directory, or you can implement the HTTP response yourself by switching to BaseHTTPRequestHandler and overriding do_GET.
this is my first question posed to Stack Overflow, because typically I can find the solutions to my problem here, but for this particular situation, I cannot. I am writing a Python plugin for my compiler that outputs REST calls in various languages for interaction with an API. I am authenticating with the socket and ssl modules by sending a username and password in the request body in JSON form. Upon successful authentication, the API returns a response in the following format with important response data in the body:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Date: Tue, 05 Feb 2013 03:36:18 GMT
Vary: Accept-Charset, Accept-Encoding, Accept-Language, Accept
Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST,OPTIONS,GET
Access-Control-Allow-Headers: Content-Type
Server: Restlet-Framework/2.0m5
Content-Type: text/plain;charset=ISO-8859-1
Content-Length: 94
{"authentication-token":"<token>","authentication-secret":"<secret>"}
This is probably a very elementary question for Pythonistas, given its powerful tools for String manipulation. But alas, I am a new programmer who started with Java. I would like to know what would be the best way to parse this entire response to obtain the "<token>" and "<secret>"? Should I use a search for a "{" and dump the substring into a json object? My intuition is telling me to try and use the re module, but I cannot seem to figure out how it would be used in this situation, since the pattern of the token and secret are obviously not predictable. Because I have opted to authenticate with a low-level module set, this response is one big String obtained by constructing the header and appending JSON data to it in the body, then executing the request and obtaining the response with the following code:
#Socket configuration and connection execution
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
conn = ssl.wrap_socket(sock, ca_certs = pem_file)
conn.connect((host, port))
conn.send(req)
response = conn.recv()
print(response)
The print statement outputs the first code sample. Any help or insight would be greatly appreciated!
HTTP headers are split from the rest of the body by a \r\n\r\n sequence. Do something like:
import json
...
(headers, js) = response.split("\r\n\r\n")
data = json.loads(js)
token = data["authentication-token"]
secret = data["authentication-secret"]
You'll probably want to check the response, etc, and various libraries (e.g. requests) can do all of this a whole lot easier for you.
I am trying to perform a simple action:
POST to a URL
Return HTTP 303 (SeeOther)
GET from new URL
From what I can tell, this is a pretty standard practice:
http://en.wikipedia.org/wiki/Post/Redirect/Get
Also, it would seem that SeeOther is designed to work this way:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.4
I'm using web.py as my server-side controller, but I suspect that it's not the issue. If I GET, SeeOther works flawlessly as expected. If I POST to the same URL, the browser fails to redirect or load anything at all.
Thinking it was a browser issue, I tried both IE9 and Google Chrome (v23 ish). Both have the same issue.
Thinking web.py might be serving the page incorrectly, or generating a bad URL, I used telnet to examine the headers. I found this:
HTTP GET (this works in the browser):
GET /Users/1 HTTP/1.1
HOST: domain.com
HTTP/1.1 303 See Other
Date: Mon, 24 Dec 2012 18:07:55 GMT
Server: Apache/2
Cache-control: no-cache
Location: http://domain.com/Users
Content-Length: 0
Content-Type: text/html
HTTP POST (this does not work in the browser):
POST /Users/1 HTTP/1.1
HOST: domain.com
HTTP/1.1 303 See Other
Date: Mon, 24 Dec 2012 18:12:35 GMT
Server: Apache/2
Cache-control: no-cache
Location: http://domain.com/Users
Content-Length: 0
Content-Type: text/html
Another thing that could be throwing a wrench in the works:
I'm using mod-rewrite so that the user-visible domain.com/Users/1 is actually domain.com/control.py/Users/1
There may be more information/troubleshooting that I have, but I'm drawing a blank right now.
The Question:
Why does this work with a GET request, but not a POST request? Am I missing a response header somewhere?
EDIT:
Using IE9 Developer Tools and Chrome's Inspector, it looks like the 303 isn't coming back to the browser after a POST. However, I can see the 303 come in when I do a GET request.
However, after looking more closely at Chrome's Inspector, I saw the ability to log every request (don't clear w/ each page call). This allowed me to see that for some reason, my POST request looks like it's failing. Again - GET works just fine.
It's entirely possible that this isn't your issue, but since you don't have your code posted I'll take a shot (just in case).
Since you're using web.py, do you have the POST method defined on your object?
i.e.
urls = (
'/page', 'page'
)
class page:
def POST(self):
# Do something
def GET(self):
# Do something else
I know how to use urllib to download a file. However, it's much faster, if the server allows it, to download several part of the same file simultaneously and then merge them.
How do you do that in Python? If you can't do it easily with the standard lib, any lib that would let you do it?
Although I agree with Gregory's suggestion of using an existing library, it's worth noting that you can do this by using the Range HTTP header. If the server accepts byte-range requests, you can start several threads to download multiple parts of the file in parallel. This snippet, for example, will only download bytes 0..65535 of the specified file:
import urllib2
url = 'http://example.com/test.zip'
req = urllib2.Request(url, headers={'Range':'bytes=0-65535'})
data = urllib2.urlopen(req).read()
You can determine the remote resource size and see whether the server supports ranged requests by sending a HEAD request:
import urllib2
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
url = 'http://sstatic.net/stackoverflow/img/sprites.png'
req = HeadRequest(url)
response = urllib2.urlopen(req)
response.close()
print respose.headers
The above prints:
Cache-Control: max-age=604800
Content-Length: 16542
Content-Type: image/png
Last-Modified: Thu, 10 Mar 2011 06:13:43 GMT
Accept-Ranges: bytes
ETag: "c434b24eeadecb1:0"
Date: Mon, 14 Mar 2011 16:08:02 GMT
Connection: close
From that we can see that the file size is 16542 bytes ('Content-Length') and the server supports byte ranges ('Accept-Ranges: bytes').
PycURL can do it. PycURL is a Python interface to libcurl. It can be used to fetch objects identified by a URL from a Python program, similar to the urllib Python module. PycURL is mature, very fast, and supports a lot of features.
I wrote a crawler in python, fetched urls has different types: it can be url with html and url with image or big archives or other files. So i need fast determine this case to prevent of reading of big files such as big archives and continue crawling. How is the best way to determine url type at start of page loading?
i understand what i can do it by url name (end's with .rar .jpg etc) but i think it's not full solution. I need check header or something like that for this? also i need some page size predicition to prevent of large downloads. In other words set limit of downloaded page size, to prevent fast memory eating.
If you use a HTTP HEAD request on the resource, you will get relevant metadata on the resource without the resource data itself. Specifically, the content-length and content-type headers will be of interest.
E.g.
HEAD /stackoverflow/img/favicon.ico HTTP/1.1
host: sstatic.net
HTTP/1.1 200 OK
Cache-Control: max-age=604800
Content-Length: 1150
Content-Type: image/x-icon
Last-Modified: Mon, 02 Aug 2010 06:04:04 GMT
Accept-Ranges: bytes
ETag: "2187d82832cb1:0"
X-Powered-By: ASP.NET
Date: Sun, 12 Sep 2010 13:38:36 GMT
You can do this in python using httplib:
>>> import httplib
>>> conn = httplib.HTTPConnection("sstatic.net")
>>> conn.request("HEAD", "/stackoverflow/img/favicon.ico")
>>> res = conn.getresponse()
>>> print res.getheaders()
[('content-length', '1150'), ('x-powered-by', 'ASP.NET'), ('accept-ranges', 'bytes'), ('last-modified', 'Mon, 02 Aug 2010 06:04:04 GMT'), ('etag', '"2187d82832cb1:0"'), ('cache-control', 'max-age=604800'), ('date', 'Sun, 12 Sep 2010 13:39:26 GMT'), ('content-type', 'image/x-icon')]
This tells you it's an image (image/* mime-type) of 1150 bytes. Enough information for you to decide if you want to fetch the full resource.
Additionally, this header tells you the server accepts HTTP partial content request (accept-ranges header) which allows you to retrieve data in batches.
You will get the same header information if you do a GET directly, but this will also start sending the resource data in the body of the response, something you want to avoid.
If you want to learn more about HTTP headers and their meaning, you can use an online tool such as 'Fetch'