I need to download some pictures from a picture server every day. This server adds thousands of pictures daily, and many pictures have large size. Since the server does not support thumb pics and any pic-description, I have to download the pictures completely to know whether this picture is the one I need. However, my network bandwidth is very low, so it cost considerable time to download every picture. Moreover, the server has strict network flow controlling, thus, I may only download less than 100 pictures every day if the pictures are all large.
I search some related articles and find that picture's file-header contains many useful information, so this is my plan:
Use python code to download all pictures' file-header. If I only download file-header, the network flow will be very small, so I can download all pictures' file-header on the server.
Analyse every pictures' file-header and obtain enough information. From my searching, I know the picture's format(png/jpg/gif), size(XXX,XXX bytes), resolution(XXXX×YYY,such as 1920x1080) can be obtained from picture's file-header which is less than 1000 bytes. Maybe it is possible to get more information from picture's file-header, so if you know more, please help me.
Export the result to an Excel file.
Could you tell me the effective python code to achieve the above three demands?
Added on July 22:
This is some information I got from HTTP header
HTTP/1.1 200 OK
Server: nginx
Date: Sun, 22 Jul 2018 15:13:19 GMT
Content-Type: image/jpeg
Content-Length: 376386
Cache-Control: public,max-age=518400
Expires: Sat, 28 Jul 2018 15:13:19 GMT
Last-Modified: Sun, 22 Jul 2018 15:13:19 GMT
Vary: Origin
ETag: "5be42"
Connection: Keep-alive
Now I can get Content-Type and Content-Length from HTTP header. But it is not enough for me.
I searched and found someone said they could read the image's resolution(XXXX×YYY,such as 1920x1080) from the first 100 bytes of the pic-file data.(100 here is only the maximum number, somebody even said he can get the resolution from the begin 30 bytes of the pic-file data.) I think it is true because many pics I downloaded not-finished can display the resolution and the top of the pic.
Moreover, maybe there is a way to generate thumb without downloading complete pic? I'm not sure it is possible or not, but I think if it can be done, it will be very useful.
You can use PIL library and use getdata.
I don't think that is possible or are there image headers at all? When I do something like
curl -I https://upload.wikimedia.org/wikipedia/de/b/bb/Png-logo.png
to get HTTP headers, I don't see an image size or anything like that:
HTTP/1.1 200 OK
Date: Sat, 21 Jul 2018 17:35:26 GMT
Content-Type: image/png
Content-Length: 811068
Connection: keep-alive
X-Object-Meta-Sha1Base36: tup6ux1u98mkbw32ta64fna0hqw6y09
Last-Modified: Thu, 03 Oct 2013 23:18:32 GMT
Etag: 1f427f6758058528cc0d474a14ee6dc1
X-Timestamp: 1380842311.64879
X-Trans-Id: txdbd33b3337fb497694bd8-005b536ebb
X-Varnish: 185288243, 96001562 108570149, 528370630
Via: 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1)
Accept-Ranges: bytes
Age: 34
X-Cache: cp1062 pass, cp3038 hit/2, cp3039 miss
X-Cache-Status: hit-local
Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
X-Analytics: https=1;nocookies=1
X-Client-IP: 87.152.115.72
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache, X-Varnish
Timing-Allow-Origin: *
Content-Security-Policy-Report-Only: default-src 'none'; style-src 'unsafe-inline' data:; font-src data:; img-src data: https://upload.wikimedia.org/favicon.ico; media-src data:; sandbox; report-uri https://commons.wikimedia.org/w/api.php?reportonly=1&source=image&action=cspreport&format=json&
X-Content-Security-Policy-Report-Only: default-src 'none'; style-src 'unsafe-inline' data:; font-src data:; img-src data: https://upload.wikimedia.org/favicon.ico; media-src data:; sandbox; report-uri https://commons.wikimedia.org/w/api.php?reportonly=1&source=image&action=cspreport&format=json&
X-Webkit-CSP-Report-Only: default-src 'none'; style-src 'unsafe-inline' data:; font-src data:; img-src data: https://upload.wikimedia.org/favicon.ico; media-src data:; sandbox; report-uri https://commons.wikimedia.org/w/api.php?reportonly=1&source=image&action=cspreport&format=json&
Even if there is such a thing, the limit of 100 would probably affect the image headers as well.
Related
I have an Issue with cors.
I am trying to access an mp3 I have on one server on a web page served by a different machine.
The server is setup with. https://gist.github.com/fxsjy/5465353
When I access the path directly it all works.
<httpProtocol>
<customHeaders>
<add name="Access-Control-Allow-Origin" value="*" />
<add name="Access-Control-Allow-Methods" value="POST, GET, OPTIONS" />
<add name="Access-Control-Allow-Headers" value="Content-type, Content-Length,Date,Last-Modified,Server" />
</customHeaders>
</httpProtocol>
I have the previous in my web.config and the following on my html.
<audio id="myAudio"
controls="controls"
src="http://{A IP}:{A PORT}/{A File Path}/{A file}.mp3"
type="audio/mpeg">
Your user agent does not support the HTML5 Audio element.
</audio>
The page response headers Are:
HTTP/1.1 200 OK
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip
Vary: Accept-Encoding
Server: Microsoft-IIS/8.0
X-AspNetMvc-Version: 5.2
X-AspNet-Version: 4.0.30319
X-SourceFiles: =?UTF-8?B?QzpcVXNlcnNcc2Vhbi5oYW5zZm9yZFxEcm9wYm94XE11c2ljbzJcTXVzaWNvMlxNdXNpY28yXEhvbWVcVGVzdA==?=
X-Powered-By: ASP.NET
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST, GET, OPTIONS
Access-Control-Allow-Headers: Content-type, Content-Length,Date,Last-Modified,Server
Date: Tue, 23 Jun 2015 00:20:11 GMT
Content-Length: 1233
When I put the Mp3 file locally it works when I access it remotely I get.
MediaElementAudioSource outputs zeroes due to CORS access restrictions for http://{A IP}:{A PORT}/{A File Path}/{A file}.mp3
As an error in chrome console. Firefox loads without errors but does nothing.
Both requests to the mp3 are returned as 200.
The response headers are.
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.9
Date: Tue, 23 Jun 2015 00:20:12 GMT
Content-type: audio/mpeg
Content-Length: 2882414
Last-Modified: Mon, 07 Feb 2011 10:31:23 GMT
Can anyone see what i'm doing wrong :(?
So dont ask me why but adding crossorigin="anonymous" to my audio tag fixes the issue...
<audio id="myAudio"
controls="controls"
src="TUNE.mp3"
type="audio/mpeg"
crossorigin="anonymous">
See https://bugzilla.mozilla.org/show_bug.cgi?id=937718
Also I have to add to the server response.
I have a simple web.py code like below, deployed with mod_wsgi in apache.
import web
urls = (
'/', 'index'
)
class index:
def GET(self):
content = 'hello'
web.header('Content-length', len(content))
return content
app = web.application(urls, globals())
application = app.wsgifunc()
This website runs well, except one minor issue. When mod_deflate is turn on, the response is chunked, even it has a very small response body.
Response Header
HTTP/1.1 200 OK
Date: Wed, 20 May 2015 20:14:12 GMT
Server: Apache/2.4.7 (Ubuntu)
Vary: Accept-Encoding
Content-Encoding: gzip
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html
When mod_deflate is turn off, Content-Length header is back.
HTTP/1.1 200 OK
Date: Wed, 20 May 2015 20:30:09 GMT
Server: Apache/2.4.7 (Ubuntu)
Content-Length: 5
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=utf-8
I've searched around and someone said reduce the DeflateBufferSize will help, but this response's size is only 5, far from it's default value: 8096, so I don't think it interferes with this issue.
And someone said apache send chunked response because it doesn't know the response's size before begin to send the response to client, but in my code, I do set Content-Length.
I've also tried Flask and Apache/2.2.15 (CentOS), same result.
How do I set content-length when deflate module is enabled? and I don't like to gzip content in python.
The response Content-Length has to reflect the final length of the data sent after the compression has been done, not the original length. Thus mod_deflate has to remove the original Content-Length header and use chunked transfer encoding. The only way it could otherwise know the content length to be able to send the Content-Length before sending the compressed data, would be to buffer up the complete compressed response in memory or into a file and then calculate the length. Buffering all the compressed content isn't practical and in part defeats the point of compressing the data as the response is streamed.
If you don't want mod_deflate enabled for the whole site, then only enable it for certain URL prefixes by scoping it within a Location block.
I want to download on disk the gif image:
http://www.portaportese.it/telefono/es_2014043024395.gif
with all the codes I found out around for downloading pictures, I end up with a error in the final saved picture such as:
GIF image was truncated or incomplete.
in a few words the picture is not being saved correctly.
Is there anybody able to provide a correct solution which will download this picture on disk?
Any code returns an empty image.. I tried this:
import urllib2
picture_page = "http://www.portaportese.it/telefono/es_2014043024395.gif"
opener1 = urllib2.build_opener()
page1 = opener1.open(picture_page)
my_picture = page1.read()
filename = "my_image.gif"
fout = open(filename, "wb")
fout.write(my_picture)
fout.close()
The problem does not lie with your Python code, the image that you are trying to download does not exist. If I use curl to place a request at that URL, you can see that no image is stored there.
~ ❯❯❯ curl -I http://www.portaportese.it/telefono/es_2014043024395.gif
HTTP/1.1 200 OK
Date: Mon, 07 Jul 2014 19:35:05 GMT
Server: Apache/2.2.3 (Red Hat)
Connection: close
Content-Type: text/plain; charset=UTF-8
Compare that with this request to a known image source:
~ ❯❯❯ curl -I http://baconmockup.com/300/200
HTTP/1.1 200 OK
Date: Mon, 07 Jul 2014 19:35:42 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Powered-By: PHP/5.5.9-1ubuntu4.2
Access-Control-Allow-Origin: *
Content-Length: 20564
Content-Disposition: inline; filename=brisket-300-200.jpg
Pragma: public
Cache-Control: public
Expires: Mon, 21 Jul 2014 19:35:43 GMT
Last-Modified: Mon, 20 Aug 2012 19:20:21 GMT
Vary: User-Agent
Content-Type: image/jpeg
If you change the URL in your code to a good image source, then it will work perfectly well.
import urllib2
picture_page = "http://baconmockup.com/300/200"
opener1 = urllib2.build_opener()
page1 = opener1.open(picture_page)
my_picture = page1.read()
filename = "my_image.gif"
fout = open(filename, "wb")
fout.write(my_picture)
fout.close()
I just ran this, and was given a picture of some tasty brisket.
I am tring to post huge .ova file through HTTPPost Method in Python
**ResponseHeaders**
Pragma no-cache
Date Thu, 18 Jul 2013 11:17:13 GMT
Content-Encoding gzip
Vary Accept-Encoding
Server Apache-Coyote/1.1
Transfer-Encoding chunked
Content-Language en-US
Content-Type application/json;charset=UTF-8
Cache-Control no-cache, no-store, max-age=0
Expires Thu, 01 Jan 1970 00:00:00 GMT
**RequestHeaders**
Content-Type application/json
Accept application/json
xyzAPIVersion 1.0
X-Requested-With XMLHttpRequest
How to send such a huge file(500 MB) through HTTPPost method through REST API.
You could use requests library:
import requests # $ pip install requests
with open("file.ova", "rb") as file:
requests.post(url, data=file)
I wrote a crawler in python, fetched urls has different types: it can be url with html and url with image or big archives or other files. So i need fast determine this case to prevent of reading of big files such as big archives and continue crawling. How is the best way to determine url type at start of page loading?
i understand what i can do it by url name (end's with .rar .jpg etc) but i think it's not full solution. I need check header or something like that for this? also i need some page size predicition to prevent of large downloads. In other words set limit of downloaded page size, to prevent fast memory eating.
If you use a HTTP HEAD request on the resource, you will get relevant metadata on the resource without the resource data itself. Specifically, the content-length and content-type headers will be of interest.
E.g.
HEAD /stackoverflow/img/favicon.ico HTTP/1.1
host: sstatic.net
HTTP/1.1 200 OK
Cache-Control: max-age=604800
Content-Length: 1150
Content-Type: image/x-icon
Last-Modified: Mon, 02 Aug 2010 06:04:04 GMT
Accept-Ranges: bytes
ETag: "2187d82832cb1:0"
X-Powered-By: ASP.NET
Date: Sun, 12 Sep 2010 13:38:36 GMT
You can do this in python using httplib:
>>> import httplib
>>> conn = httplib.HTTPConnection("sstatic.net")
>>> conn.request("HEAD", "/stackoverflow/img/favicon.ico")
>>> res = conn.getresponse()
>>> print res.getheaders()
[('content-length', '1150'), ('x-powered-by', 'ASP.NET'), ('accept-ranges', 'bytes'), ('last-modified', 'Mon, 02 Aug 2010 06:04:04 GMT'), ('etag', '"2187d82832cb1:0"'), ('cache-control', 'max-age=604800'), ('date', 'Sun, 12 Sep 2010 13:39:26 GMT'), ('content-type', 'image/x-icon')]
This tells you it's an image (image/* mime-type) of 1150 bytes. Enough information for you to decide if you want to fetch the full resource.
Additionally, this header tells you the server accepts HTTP partial content request (accept-ranges header) which allows you to retrieve data in batches.
You will get the same header information if you do a GET directly, but this will also start sending the resource data in the body of the response, something you want to avoid.
If you want to learn more about HTTP headers and their meaning, you can use an online tool such as 'Fetch'