Downloading images using Python that no longer exist [duplicate]

Downloading images using Python that no longer exist [duplicate] - python

urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example:
urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')
just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage.
Can anybody provide some information about this problem?

Consider using urllib2 if it possible in your case. It is more advanced and easy to use than urllib.
You can detect any HTTP errors easily:
>>> import urllib2
>>> resp = urllib2.urlopen("http://google.com/abc.jpg")
Traceback (most recent call last):
<<MANY LINES SKIPPED>>
urllib2.HTTPError: HTTP Error 404: Not Found
resp is actually HTTPResponse object that you can do a lot of useful things with:
>>> resp = urllib2.urlopen("http://google.com/")
>>> resp.code
200
>>> resp.headers["content-type"]
'text/html; charset=windows-1251'
>>> resp.read()
"<<ACTUAL HTML>>"

I keep it simple:
# Simple downloading with progress indicator, by Cees Timmerman, 16mar12.
import urllib2
remote = r"http://some.big.file"
local = r"c:\downloads\bigfile.dat"
u = urllib2.urlopen(remote)
h = u.info()
totalSize = int(h["Content-Length"])
print "Downloading %s bytes..." % totalSize,
fp = open(local, 'wb')
blockSize = 8192 #100000 # urllib.urlretrieve uses 8192
count = 0
while True:
chunk = u.read(blockSize)
if not chunk: break
fp.write(chunk)
count += 1
if totalSize > 0:
percent = int(count * blockSize * 100 / totalSize)
if percent > 100: percent = 100
print "%2d%%" % percent,
if percent < 100:
print "\b\b\b\b\b", # Erase "NN% "
else:
print "Done."
fp.flush()
fp.close()
if not totalSize:
print

According to the documentation is is undocumented
to get access to the message it looks like you do something like:
a, b=urllib.urlretrieve('http://google.com/abc.jpg', r'c:\abc.jpg')
b is the message instance
Since I have learned that Python it is always useful to use Python's ability to be introspective when I type
dir(b)
I see lots of methods or functions to play with
And then I started doing things with b
for example
b.items()
Lists lots of interesting things, I suspect that playing around with these things will allow you to get the attribute you want to manipulate.
Sorry this is such a beginner's answer but I am trying to master how to use the introspection abilities to improve my learning and your questions just popped up.
Well I tried something interesting related to this-I was wondering if I could automatically get the output from each of the things that showed up in the directory that did not need parameters so I wrote:
needparam=[]
for each in dir(b):
x='b.'+each+'()'
try:
eval(x)
print x
except:
needparam.append(x)

You can create a new URLopener (inherit from FancyURLopener) and throw exceptions or handle errors any way you want. Unfortunately, FancyURLopener ignores 404 and other errors. See this question:
How to catch 404 error in urllib.urlretrieve

I ended up with my own retrieve implementation, with the help of pycurl it supports more protocols than urllib/urllib2, hope it can help other people.
import tempfile
import pycurl
import os
def get_filename_parts_from_url(url):
fullname = url.split('/')[-1].split('#')[0].split('?')[0]
t = list(os.path.splitext(fullname))
if t[1]:
t[1] = t[1][1:]
return t
def retrieve(url, filename=None):
if not filename:
garbage, suffix = get_filename_parts_from_url(url)
f = tempfile.NamedTemporaryFile(suffix = '.' + suffix, delete=False)
filename = f.name
else:
f = open(filename, 'wb')
c = pycurl.Curl()
c.setopt(pycurl.URL, str(url))
c.setopt(pycurl.WRITEFUNCTION, f.write)
try:
c.perform()
except:
filename = None
finally:
c.close()
f.close()
return filename

class MyURLopener(urllib.FancyURLopener):
http_error_default = urllib.URLopener.http_error_default
url = "http://page404.com"
filename = "download.txt"
def reporthook(blockcount, blocksize, totalsize):
pass
...
try:
(f,headers)=MyURLopener().retrieve(url, filename, reporthook)
except Exception, e:
print e

:) My first post on StackOverflow, have been a lurker for years. :)
Sadly dir(urllib.urlretrieve) is deficient in useful information.
So from this thread thus far I tried writing this:
a,b = urllib.urlretrieve(imgURL, saveTo)
print "A:", a
print "B:", b
which produced this:
A: /home/myuser/targetfile.gif
B: Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Cache-Control: max-age=604800
Content-Type: image/gif
Date: Mon, 07 Mar 2016 23:37:34 GMT
Etag: "4e1a5d9cc0857184df682518b9b0da33"
Last-Modified: Sun, 06 Mar 2016 21:16:48 GMT
Server: ECS (hnd/057A)
Timing-Allow-Origin: *
X-Cache: HIT
Content-Length: 27027
Connection: close
I guess one can check:
if b.Content-Length > 0:
My next step is to test a scenario where the retrieve fails...

Results against another server/website - what comes back in "B" is a bit random, but one can test for certain values:
A: get_good.jpg
B: Date: Tue, 08 Mar 2016 00:44:19 GMT
Server: Apache
Last-Modified: Sat, 02 Jan 2016 09:17:21 GMT
ETag: "524cf9-18afe-528565aef9ef0"
Accept-Ranges: bytes
Content-Length: 101118
Connection: close
Content-Type: image/jpeg
A: get_bad.jpg
B: Date: Tue, 08 Mar 2016 00:44:20 GMT
Server: Apache
Content-Length: 1363
X-Frame-Options: deny
Connection: close
Content-Type: text/html
In the 'bad' case (non-existing image file) "B" retrieved a small chunk of (Googlebot?) HTML code and saved it as the target, hence Content-Length of 1363 bytes.

Related

Python3 cgi.FieldStorage parses file name but not contents between boundary tags

I inherited a python3 project where we are trying to parse a 70 MB file with python 3.5.6 . I am using cgi.FieldStorage
File (named: paketti.ipk) I'm trying to send:
kissakissakissa
kissakissakissa
kissakissakissa
Headers:
X-FILE: /tmp/nginx/0000000001
Host: localhost:8082
Connection: close
Content-Length: 21
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: multipart/form-data; boundary=---------------------------264635460442698726183359332565
Origin: http://172.16.8.12
Referer: http://172.16.8.12/
DNT: 1
Sec-GPC: 1
Temporary file /tmp/nginx/0000000001:
-----------------------------264635460442698726183359332565
Content-Disposition: form-data; name="file"; filename="paketti.ipk"
Content-Type: application/octet-stream
kissakissakissa
kissakissakissa
kissakissakissa
-----------------------------264635460442698726183359332565--
Code:
class S(BaseHTTPRequestHandler):
def do_POST(self):
temp_filename = self.headers['X-FILE']
temp_file_pointer=open(temp_filename,"rb")
form = cgi.FieldStorage( fp=temp_file_pointer, headers=self.headers, environ={'REQUEST_METHOD':'POST', 'CONTENT_TYPE':self.headers['Content-Type'], 'CONTENT_LENGTH':self.headers['Content-Length'] }, )
actual_filename = form['file'].filename
logging.info("ACTUAL FILENAME={}".format(actual_filename))
open("/tmp/nginx/{}".format(actual_filename), "wb").write(form['file'].file.read())
logging.info("FORM={}".format(form))
Now the strangest things. Logs show:
INFO:root:ACTUAL FILENAME=paketti.ipk
INFO:root:FORM=FieldStorage(None, None, [FieldStorage('file', 'paketti.ipk', b'')])
Look at the /tmp/nginx directory:
root#am335x-evm:/tmp# ls -la /tmp/nginx/*
-rw------- 1 www www 286 May 18 20:48 /tmp/nginx/0000000001
-rw-r--r-- 1 root root 0 May 18 20:48 /tmp/nginx/paketti.ipk
So, it is like partially working because the name is got. But why it does not parse the data contents? What am I missing?
Is this even doable on python or should I just write a C utility? The file is 70 MB and if I read it in memory, OOM-killer kills the python3 process (and rightfully so, I'd say). But yeah, where does the data contents go?

Instead of the cgi module need a multipart parser that can stream the data instead of reading all of it to RAM. AFAIK there is nothing useful in the standard library but this module could be of use: https://github.com/defnull/multipart
Alternatively, DIY something along these lines should work:
boundary = b"-----whatever"
# Begin and end lines (as per your example, I didn't check the RFCs)
begin = b"\r\n%b\r\n" % boundary
end = b"\r\n%b--\r\n" % boundary
# Prefer with blocks to open files so that they are also closed properly
with open(temp_filename, "rb") as f:
buf = bytearray()
# Search for the boundary
while begin not in buf:
block = f.read(4096)
if not block: raise ValueError("EOF without boundary begin")
buf = buf[-1024:] + block # Keep up to 5 KiB buffered
# Delete buffer contents until the end of the boundary
del buf[:buf.find(begin) + len(begin)]
# Copy data to another file (or do what you need to do with it)
with open("output.dat", "wb") as f2:
while end not in buf:
f2.write(buf[:-1024])
del buf[:-1024]
buf += f.read(4096)
if not buf: raise ValueError("EOF without boundary end")
f2.write(buf[:buf.find(end)])
It is taken that the boundaries are only up to 1024 bytes. You could use the actual lengths instead for perfection.

There were more issues at play than I first thought.
First, /tmp was coming from tmpfs having maximum size of 120MB.
Secondly, my nginx.conf was problematic. I needed to comment out stuff like this to clean it up:
#client_body_in_file_only on
#proxy_set_header X-FILE $request_body_file;
#proxy_set_body $request_body_file;
Then I needed to add these
proxy_redirect off; # Maybe not that importnat
proxy_request_buffering off; # Very important
After this the code
form = cgi.FieldStorage( fp=self.rfile, headers=self.headers, environ={'REQUEST_METHOD':'POST', 'CONTENT_TYPE':self.headers['Content-Type'], })
started to "work". I'm monitoring /tmp usage and it uses first 70MB and then full 120 MB. The uploaded file is truncated to 50 MB.
So, when I am reading and writing parsed cgi.FieldStorage even in a loop of 4096 characters, the system reads it automatically FULLY to somewhere in /tmp once and then tries to write the final file and encounters "No space left on device" error.
To fix this I keep the nginx.conf additions and just read the self.rfile manually myself in a loop, totally reading ['Content-Length'] (anything other makes it go bonkers). This is able to save it cleanly with one pass; there is no more than single time 70MB usage of /tmp .

I seem to lose some characters when converting a byte object into a string object

I am currently working on a python program that will filter out some keywords in the "text" tag of a JSON file. The conversion for my system is the following: .gz --> open with gzip in mode rb --> transform the b'' into a str --> json.load(str)
def gzworker(fullpath, condition):
"""Worker opens one .gz file"""
print('Opening {}'.format(fullpath))
buffer = []
with gzip.open(fullpath, 'rb') as infile:
for _line in infile:
result = filter(json.loads(str(_line).split('|',1)[1][:-5]), condition)
if result:
buffer.append(result)
print('Closing {}'.format(fullpath))
return buffer
With the filter function taking into argument a JSON file
After running through this code multiple times I realised that actually the reason for it not working is that some commas seem to disappear. Does anybody know if it is possible that in the process some information is lost?
Result of what I get using the previous method (invalid JSON) [same result if I use decode]
{"created_at":"Thu Apr 17 04:45:03 +0000 2014","id":456654551114735616,"id_str":"456654551114735616","text":"#cam_clay1 come visit us soon plz \\ud83d\\ude18","source":"\\u003ca href=\\"http:\\/\\/twitter.com\\/download\\/iphone\\" rel=\\"nofollow\\"\\u003eTwitter for iPhone\\u003c\\/a\\u003e","truncated":false,"in_reply_to_status_id":456654343781892098,"in_reply_to_status_id_str":"456654343781892098","in_reply_to_user_id":427007607,"in_reply_to_user_id_str":"427007607","in_reply_to_screen_name":"cam_clay1","user":{"id":335107310,"id_str":"335107310","name":"Roger Krick","screen_name":"roger_krick","location":"Atlanta GA","url":null,"description":"I pushed Regina George in front of the bus.","protected":false,"followers_count":772,"friends_count":235,"listed_count":3,"created_at":"Thu Jul 14 04:49:29 +0000 2011","favourites_count":7192,"utc_offset":-18000,"time_zone":"Quito","geo_enabled":true,"verified":false,"statuses_count":9518,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\\/\\/pbs.twimg.com\\/profile_background_images\\/378800000021719152\\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_image_url_https":"https:\\/\\/pbs.twimg.com\\/profile_background_images\\/378800000021719152\\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_tile":true,"profile_image_url":"http:\\/\\/pbs.twimg.com\\/profile_images\\/453031044393222144\\/7vIvMWvk_normal.jpeg","profile_image_url_https":"https:\\/\\/pbs.twimg.com\\/profile_images\\/453031044393222144\\/7vIvMWvk_normal.jpeg","profile_banner_url":"https:\\/\\/pbs.twimg.com\\/profile_banners\\/335107310\\/1352964715","profile_link_color":"0084B4","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[33.75781394,-84.38479358]},"coordinates":{"type":"Point","coordinates":[-84.38479358,33.75781394]},"place":{"id":"8173485c72e78ca5","url":"https:\\/\\/api.twitter.com\\/1.1\\/geo\\/id\\/8173485c72e78ca5.json","place_type":"city","name":"Atlanta","full_name":"Atlanta, GA","country_code":"US","country":"United States","contained_within":[],"bounding_box":{"type":"Polygon","coordinates":[[[-84.5464728,33.647845],[-84.5464728,33.8868859],[-84.289385,33.8868859],[-84.289385,33.647845]]]},"attributes":{}},"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"cam_clay1","name":"Cameron Clay","id":427007607,"id_str":"427007607","indices":[0,10]}]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"}
What I should be getting (valid JSON):
{"created_at":"Thu Apr 17 04:45:03 +0000 2014","id":456654551114735616,"id_str":"456654551114735616","text":"#cam_clay1 come visit us soon plz \ud83d\ude18","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":456654343781892098,"in_reply_to_status_id_str":"456654343781892098","in_reply_to_user_id":427007607,"in_reply_to_user_id_str":"427007607","in_reply_to_screen_name":"cam_clay1","user":{"id":335107310,"id_str":"335107310","name":"Roger Krick","screen_name":"roger_krick","location":"Atlanta GA","url":null,"description":"I pushed Regina George in front of the bus.","protected":false,"followers_count":772,"friends_count":235,"listed_count":3,"created_at":"Thu Jul 14 04:49:29 +0000 2011","favourites_count":7192,"utc_offset":-18000,"time_zone":"Quito","geo_enabled":true,"verified":false,"statuses_count":9518,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/378800000021719152\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/378800000021719152\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/453031044393222144\/7vIvMWvk_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/453031044393222144\/7vIvMWvk_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/335107310\/1352964715","profile_link_color":"0084B4","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[33.75781394,-84.38479358]},"coordinates":{"type":"Point","coordinates":[-84.38479358,33.75781394]},"place":{"id":"8173485c72e78ca5","url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/8173485c72e78ca5.json","place_type":"city","name":"Atlanta","full_name":"Atlanta, GA","country_code":"US","country":"United States","contained_within":[],"bounding_box":{"type":"Polygon","coordinates":[[[-84.5464728,33.647845],[-84.5464728,33.8868859],[-84.289385,33.8868859],[-84.289385,33.647845]]]},"attributes":{}},"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"cam_clay1","name":"Cameron Clay","id":427007607,"id_str":"427007607","indices":[0,10]}]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"}

You are decoding your bytes wrong:
str(_line)
That converts the object to a representation, that is useful for debugging but not for handling the data:
>>> 'Føo'.encode('utf8')
b'F\xc3\xb8o'
>>> str('Føo'.encode('utf8'))
"b'F\\xc3\\xb8o'"
Note the b' prefix, the ' suffix, and the escape sequences!
Decode bytes objects:
_line.decode('utf8')
I'm assuming that since this is JSON data, it is using the UTF-8 encoding (the JSON standard states that that is the default, the only other permitted options being UTF-16 and UTF-32).
Better yet, use a io.TextIOWrapper() object to handle the decoding for you.
Next, you appear to have reversed your condition and data. filter() takes a condition first, data sequence second.
Corrected code:
def gzworker(fullpath, condition):
"""Worker opens one .gz file"""
print('Opening {}'.format(fullpath))
buffer = []
with gzip.open(fullpath, 'rb') as infile:
decoded = io.TextIOWrapper(infile, encoding='utf8')
for line in decoded:
json_data = line.split('|', 1)[1][:-4]
result = filter(condition, json.loads(json_data))
if result:
buffer.append(result)
print('Closing {}'.format(fullpath))
return buffer
I adjusted your slicing operation, assuming you previously sliced off the ' character introduced by the str() call.

Can I POST data with python requests lib with http-gzip or deflate compression?

I use the request-module of python 2.7 to post a bigger chunk of data to a service I can't change. Since the data is mostly text, it is large but would compress quite well. The server would accept gzip- or deflate-encoding, however I do not know how to instruct requests to do a POST and encode the data correctly automatically.
Is there a minimal example available, that shows how this is possible?

# Works if backend supports gzip
additional_headers['content-encoding'] = 'gzip'
request_body = zlib.compress(json.dumps(post_data))
r = requests.post('http://post.example.url', data=request_body, headers=additional_headers)

I've tested the solution proposed by Robᵩ with some modifications and it works.
PSEUDOCODE (sorry I've extrapolated it from my code so I had to cut out some parts and haven't tested, anyway you can get your idea)
additional_headers['content-encoding'] = 'gzip'
s = StringIO.StringIO()
g = gzip.GzipFile(fileobj=s, mode='w')
g.write(json_body)
g.close()
gzipped_body = s.getvalue()
request_body = gzipped_body
r = requests.post(endpoint_url, data=request_body, headers=additional_headers)

For python 3:
from io import BytesIO
import gzip
def zip_payload(payload: str) -> bytes:
btsio = BytesIO()
g = gzip.GzipFile(fileobj=btsio, mode='w')
g.write(bytes(payload, 'utf8'))
g.close()
return btsio.getvalue()
headers = {
'Content-Encoding': 'gzip'
}
zipped_payload = zip_payload(payload)
requests.post(url, zipped_payload, headers=headers)

I needed my posts to be chunked, since I had several very large files being uploaded in parallel. Here is a solution I came up with.
import requests
import zlib
"""Generator that reads a file in chunks and compresses them"""
def chunked_read_and_compress(file_to_send, zlib_obj, chunk_size):
compression_incomplete = True
with open(file_to_send,'rb') as f:
# The zlib might not give us any data back, so we have nothing to yield, just
# run another loop until we get data to yield.
while compression_incomplete:
plain_data = f.read(chunk_size)
if plain_data:
compressed_data = zlib_obj.compress(plain_data)
else:
compressed_data = zlib_obj.flush()
compression_incomplete = False
if compressed_data:
yield compressed_data
"""Post a file to a url that is content-encoded gzipped compressed and chunked (for large files)"""
def post_file_gzipped(url, file_to_send, chunk_size=5*1024*1024, compress_level=6, headers={}, requests_kwargs={}):
headers_to_send = {'Content-Encoding': 'gzip'}
headers_to_send.update(headers)
zlib_obj = zlib.compressobj(compress_level, zlib.DEFLATED, 31)
return requests.post(url, data=chunked_read_and_compress(file_to_send, zlib_obj, chunk_size), headers=headers_to_send, **requests_kwargs)
resp = post_file_gzipped('http://httpbin.org/post', 'somefile')
resp.raise_for_status()

I can't get this to work, but you might be able to insert the gzip data into a prepared request:
#UNPROVEN
r=requests.Request('POST', 'http://httpbin.org/post', data={"hello":"goodbye"})
p=r.prepare()
s=StringIO.StringIO()
g=gzip.GzipFile(fileobj=s,mode='w')
g.write(p.body)
g.close()
p.body=s.getvalue()
p.headers['content-encoding']='gzip'
p.headers['content-length'] = str(len(p.body)) # Not sure about this
r=requests.Session().send(p)

The accepted answer is probably wrong due to incorrect or missing headers:
additional_headers['content-encoding'] = 'gzip'
request_body = zlib.compress(json.dumps(post_data))
Using the zlib module's compressobj method that provides the wbits argument to specify the header format should work.
The default value is MAX_WBITS=15 which means zlib header format. This is correct for Content-Encoding: deflate.
For the compress method this argument is not available and the documentation does not mention which header (if any) is used unfortunately.
For Content-Encoding: gzip wbits should be something between 16 + (9 to 15), so 16+zlib.MAX_WBITS would be a good choice.
I checked how urllib3 decodes the response for these two cases and it implements a try-and-error mechanism for deflate (it tries raw and zlib header formats). That could explain why some people had problems with the solution from the accepted answer which others didn't have.
tl;dr
gzip
additional_headers['Content-Encoding'] = 'gzip'
compress = zlib.compressobj(wbits=16+zlib.MAX_WBITS)
body = compress.compress(data) + compress.flush()
deflate
additional_headers['Content-Encoding'] = 'deflate'
compress = zlib.compressobj()
body = compress.compress(data) + compress.flush()

header If-Modified-Since does not give 304 code

I am using below code to save an html file with a time stamp in its name:
import contextlib
import datetime
import urllib2
import lxml.html
import os
import os.path
timestamp=''
filename=''
for dirs, subdirs, files in os.walk("/home/test/Desktop/"):
for f in files:
if "_timestampedfile.html" in f.lower():
timestamp=f.split('_')[0]
filename=f
break
if timestamp is '':
timestamp=datetime.datetime.now()
with contextlib.closing(urllib2.urlopen(urllib2.Request(
"http://www.google.com",
headers={"If-Modified-Since": timestamp}))) as u:
if u.getcode() != 304:
myfile="/home/test/Desktop/"+str(datetime.datetime.now())+"_timestampedfile.html"
file(myfile, "w").write(urllib2.urlopen("http://www.google.com").read())
if os.path.isfile("/home/test/Desktop/"+filename):
os.remove("/home/test/Desktop/"+filename)
html = lxml.html.parse(myfile)
else:
html = lxml.html.parse("/home/test/Desktop/"+timestamp+"_timestampedfile.html")
links=html.xpath("//a/#href")
print u.getcode()
When I run this code every time I get the code 200 from If-Modified-since header. Where am I doing mistake? My goal here is to save and use an html file and if it is modified after last time it is accessed, html file should be overwritten.

The problem is that If-Modified-Since is supposed to be a formatted date string:
If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT
but you're passing in a datetime tuple.
Try something like this:
timestamp = time.time()
...
time.strftime('%a, %d %b %Y %H:%M:%S GMT', time.gmtime(timestamp))
The second reason your code isn't working as you expect:
http://www.google.com/ does not seem to honor If-modified-since. That's allowed per the RFC, and they may have various reasons for choosing that behavior.
c) If the variant has not been modified since a valid If-
Modified-Since date, the server SHOULD return a 304 (Not
Modified) response.
If you try http://www.stackoverflow.com/, for example, you'll see a 304. (I just tried it.)

why are python double-quotes converted to hyphen in filename?

I'm generating some pdfs using ReportLab in Django. I followed and experimented with the answer given to this question, and realised that the double-quotes therein don't make sense:
response['Content-Disposition'] = 'inline; filename=constant_"%s_%s".pdf'\
% ('foo','bar')
gives filename constant_-foo_bar-.pdf
response['Content-Disposition'] = 'inline; filename=constant_%s_%s.pdf' \
% ('foo','bar')
gives filename constant_foo_bar.pdf
Why is this? Is it just to do with slug-esque sanitisation for filesystems?

It seems from the research in this question that it's actually the browser doing the encoding/escaping. I used cURL to confirm that Django itself does not escape these headers. First, I set up a minimal test view:
# views.py
def index(request):
response = render(request, 'template.html')
response['Content-Disposition'] = 'inline; filename=constant"a_b".html'
return response
then ran:
carl#chaffinch:~$ HEAD http://localhost:8003
200 OK
Date: Thu, 16 Aug 2012 19:28:54 GMT
Server: WSGIServer/0.1 Python/2.7.3
Vary: Cookie
Content-Type: text/html; charset=utf-8
Client-Date: Thu, 16 Aug 2012 19:28:54 GMT
Client-Peer: 127.0.0.1:8003
Client-Response-Num: 1
Content-Disposition: inline; filename=constant"a_b".html
Check out the header: filename=constant"a_b".html. The quotes are still there!

Python does not convert double quotes to hyphens in filenames:
>>> with open('constant_"%s_%s".pdf' % ('foo', 'bar'), 'w'): pass
$ ls
...
constant_"foo_bar".pdf
...
Probably it's django that will not allow you to use too strange names.
Anyway I'd recommend to use only the following characters in filenames, to avoid portability issues:
Letters [a-z][A-Z]
digits [0-9]
hyphen(-), underscore(_), plus(+)
Note: I've excluded the whitespace in the list, because there are a lot of scripts that don't use proper quoting, and break with such filenames.
If you restrict yourself to this set of characters you probably wont ever have any problems with pathnames. Obviously other people or other programs may still not follow this "guideline" so you shouldn't assume this convention is shared by paths you obtain from users or other external sources.

Your usage is slightly incorrect. You would want the quotes around the entire filename in order to account for spaces, etc.
change:
response['Content-Disposition'] = 'inline; filename=constant_"%s_%s".pdf'\
% ('foo','bar')
to:
response['Content-Disposition'] = 'inline; filename="constant_%s_%s.pdf"'\
% ('foo','bar')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading images using Python that no longer exist [duplicate] - python

You can create a new URLopener (inherit from FancyURLopener) and throw exceptions or handle errors any way you want. Unfortunately, FancyURLopener ignores 404 and other errors. See this question: How to catch 404 error in urllib.urlretrieve

Related

Python3 cgi.FieldStorage parses file name but not contents between boundary tags

I seem to lose some characters when converting a byte object into a string object

Can I POST data with python requests lib with http-gzip or deflate compression?

header If-Modified-Since does not give 304 code

why are python double-quotes converted to hyphen in filename?

Categories

Resources