File object in memory using Python

File object in memory using Python - python

I'm not sure how to word this exactly but I have a script that downloads an SSL certificate from a web server to check it's expiration date.
To do this, I need to download the CA certificates. Currently I write them to a temporary file in the /tmp directory and read it back later but I am sure there must be a way to do this without writing to disk.
Here's the portion that's downloading the certificates
CA_FILE = '/tmp/ca_certs.txt'
root_cert = urllib.urlopen('https://www.cacert.org/certs/root.txt')
class3_cert = urllib.urlopen('https://www.cacert.org/certs/class3.txt')
temp_file = open(CA_FILE, 'w')
temp_file.write(root_cert.read())
temp_file.write(class3_cert.read())
temp_file.close()
EDIT
Here's the portion that uses the file to get the certificate
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
ssl_sock = ssl.wrap_socket(s, ca_certs=CA_FILE, cert_reqs=ssl.CERT_REQUIRED)
ssl_sock.connect(('mail.google.com', 443))
date = ssl_sock.getpeercert()['notAfter']

the response from urllib is a file object. just use those wherever you are using the actual files instead. This is assuming that the code that consumes the file objects doesn't need to write to them of course.

Wow, don't do this. You're hitting cacert's site every time? That's INCREDIBLY rude and needlessly eats their resources. It's also ridiculously bad security practice. You're supposed to get the root certificate once and validate that it's the correct root cert and not some forgery, otherwise you can't rely on the validity of certificates signed by it.
Cache their root cert, or better yet, install it with the rest of the root certificates on your system like you're supposed to.

In the following line:
temp_file.write(root_cert.read())
you are actually reading the certificate into memory, and writing it out again. That line is equivalent to:
filedata = root_cert.read()
temp_file.write(filedata)
Now filedata is a variable containing the bytes of the root certificate, that you can use in any way you like (including not writing it to temp_file and doing something else with it instead).

Related

writing decompressed file to disk fetched from web server

I can get a file that has content-encoding as gzip.
So does that mean that the server is storing it as compressed file or it is also true for files stored as compressed zip or 7z files too?
and if so (where durl is a zip file)
>>> durl = 'https://db.tt/Kq0byWzW'
>>> dresp = requests.get(durl, allow_redirects=True, stream=True)
>>> dresp.headers['content-encoding']
'gzip'
>>> r = requests.get(durl, stream=True)
>>> data = r.raw.read(decode_content=True)
but data is coming out to be empty while I want to extract the zip file to disk on the go !!

So first of all durl is not a zip file, it is a drop box landing page. So what you are looking at is HTML which is being sent using gzip encoding. If you where to decode the data from the raw socket using gzip you would simply get the HTML. So the use of raw is really just hiding that you accidentally go an other file than the one you thought.
Based on https://plus.google.com/u/0/100262946444188999467/posts/VsxftxQnRam where you ask
Does anyone has any idea about writing compressed file directy to disk to decompressed state?
I take it you are really trying to fetch a zip and decompress it directly to a directory without first storing it. To do this you need to use https://docs.python.org/2/library/zipfile.html
Though at this point the problem becomes that the response from requests isn't actually seekable, which zipfile requires in order to work (one of the first things it will do is seek to the end of the file to determine how long it is).
To get around this you need to wrap the response in a file like object. Personally I would recommend using tempfile.SpooledTemporaryFile with a max size set. This way your code would switch to writing things to disk if the file was bigger than you expected.
import requests
import tempfile
import zipfile
KB = 1<<10
MB = 1<<20
url = '...' # Set url to the download link.
resp = requests.get(url, stream=True)
with tmp as tempfile.SpooledTemporaryFile(max_size=500*MB):
for chunk in resp.iter_content(4*KB):
tmp.write(chunk)
archive = zipfile.ZipFile(tmp)
archive.extractall(path)
Same code using io.BytesIO:
resp = requests.get(url, stream=True)
tmp = io.BytesIO()
for chunk in resp.iter_content(4*KB):
tmp.write(chunk)
archive = zipfile.ZipFile(tmp)
archive.extractall(path)

You need the content from the requests file to write it.
Confirmed working:
import requests
durl = 'https://db.tt/Kq0byWzW'
dresp = requests.get(durl, allow_redirects=True, stream=True)
dresp.headers['content-encoding']
file = open('test.html', 'w')
file.write(dresp.text)

You have to differentiate between content-encoding (not to be confused with transfer-encoding) and content-type.
The gist of it is that content-type is the media-type (the real file-type) of the resource you are trying to get. And content-encoding is any kind of modification applied to it before sending it to the client.
So let's assume you'd like to get a resource named "foo.txt". It will probably have a content-type of text/plain.In addition to that, the data can be modified when sending over the wire. This is the content-encoding. So, with the above example, you can have a content-type of text/plain and a content-encoding of gzip. This means that before the server sends the file out onto the wire, it will compress it using gzip on the fly. So the only bytes which traverse the net are zipped. Not the raw-bytes of the original file (foo.txt).
It is the job of the client to process these headers accordingly.
Now, I am not 100% sure if requests, or the underlying python libs do this but chances are they do. If not, Python ships with a default gzip library, so you could do it on your own without a problem.
With the above in mind, to respond to your question: No, having a "content-encoding" of gzip does not mean that the remote resource is a zip-file. The field containing that information is content-type (based on your question this has probably a value of application/zip or application/x-7z-compressed depending of actual compression algorithm used).
If you cannot determine the real file-type based on the content-type field (f.ex. if it is application/octet-stream), you could just save the file to disk, and open it up with a hex editor. In the case of a 7z file you should see the byte sequence 37 7a bc af 27 1c somewhere. Most likely at the beginning of the file or at EOF-112 bytes. In the case of a gzip file, it should be 1f 8b at the beginning of the file.
Given that you have gzip in the content-encoding field: If you get a 7z file, you can be certain that requests has parsed content-encoding and properly decoded it for you. If you get a gzip file, it could mean two things. Either requests has not decoded anything, of the file is indeed a gzip file, as it could be a gzip file sent with the gzip encoding. Which would mean that it's doubly compressed. This would not make any sense, but, depending on the server this could still happen.
You could simply try to run gunzip on the console and see what you get.

Is this python code safe against injections?

I have a server/client socket pair in Python. The server receives specific commands, then prepares the response and send it to the client.
In this question, my concern is just about possible injections in the code: if it could be possible to ask the server doing something weird with the 2nd parameter -- if the control on the command contents is not sufficient to avoid undesired behaviour.
EDIT:
according to advices received
added parameter shell=True when calling check_output on windows. Should not be dangerous since the command is a plain 'dir'.
.
self.client, address = self.sock.accept()
...
cmd = bytes.decode(self.client.recv(4096))
ls: executes a system command but only reads the content of a directory.
if cmd == 'ls':
if self.linux:
output = subprocess.check_output(['ls', '-l'])
else:
output = subprocess.check_output('dir', shell=True)
self.client.send(output)
cd: just calls os.chdir.
elif cmd.startswith('cd '):
path = cmd.split(' ')[1].strip()
if not os.path.isdir(path):
self.client.send(b'is not path')
else:
os.chdir(path)
self.client.send( os.getcwd().encode() )
get: send the content of a file to the client.
elif cmd.startswith('get '):
file = cmd.split(' ')[1].strip()
if not os.path.isfile(file):
self.client.send(b'ERR: is not a file')
else:
try:
with open(file) as f: contents = f.read()
except IOError as er:
res = "ERR: " + er.strerror
self.client.send(res.encode())
continue
... (send the file contents)

Except in implementation details, I cannot see any possibilities of direct injection of arbitrary code because you do not use received parameters in the only commands you use (ls -l and dir).
But you may still have some security problems :
you locate commands through the path instead of using absolute locations. If somebody could change the path environment variable what could happen ... => I advice you to use directly os.listdir('.') which is portable and has less risks.
you seem to have no control on allowed files. If I correctly remember reading CON: or other special files on older Windows version gave weird results. And you should never give any access to sensible files, configuration, ...
you could have control on length of asked files to avoid users to try to break the server with abnormally long file names.

Typical issues in a client-server scenario are:
Tricking the server into running a command that is determined by the client. In the most obvious form this happens if the server allows the client to run commands (yes, stupid). However, this can also happen if the client can supply only command parameters but shell=True is used. E.g. using subprocess.check_output('dir %s' % dir, shell=True) with a client-supplied dir variable would be a security issue, dir could have a value like c:\ && deltree c:\windows (a second command has been added thanks to the flexibility of the shell's command line interpreter). A relatively rare variation of this attack is the client being able to influence environment variables like PATH to trick the server into running a different command than intended.
Using unexpected functionality of built-in programming language functions. For example, fopen() in PHP won't just open files but fetch URLs as well. This allows passing URLs to functionality expecting file names and playing all kinds of tricks with the server software. Fortunately, Python is a sane language - open() works on files and nothing else. Still, database commands for example can be problematic if the SQL query is generated dynamically using client-supplied information (SQL Injection).
Reading data outside the allowed area. Typical scenario is a server that is supposed to allow only reading files from a particular directory, yet by passing in ../../../etc/passwd as parameter you can read any file. Another typical scenario is a server that allows reading only files with a particular file extension (e.g. .png) but passing in something like passwords.txt\0harmless.png still allows reading files of other types.
Out of these issues only the last one seems present in your code. In fact, your server doesn't check at all which directories and files the client should be allowed to read - this is a potential issue, a client might be able to read confidential files.

Python: Upload huge amount of files via FTP

I'm developing a python script that monitors a directory (using libinotify) for new files and for each new file it does some processing and then copies it to a storage server. We were using a NFS mount but had some performance issues and now we are testing with FTP. It looks that FTP is using far less resources than nfs ( the load is always under 2, with nfs it was above 5 ).
The problem we are having now is the amount of connections that keeps open in TIME_WAIT state. The storage has peaks of about 15k connections in time wait.
I was wondering if there is some way to re-use previous connection for new transfers.
Anyone knows if there is some way of doing that?
Thanks

Here's a new answer, based on the comments to the previous one.
We'll use a single TCP socket, and send each file by alternating sending name and contents, as netstrings, for each file, all in one big stream.
I'm assuming Python 2.6, that the filesystems on both sides use the same encoding, and that you don't need lots of concurrent clients (but you might occasionally need, say, two—e.g., the real one, and a tester). And I'm again assuming you've got a module filegenerator whose generate() method registers with inotify, queues up notifications, and yields them one by one.
client.py:
import contextlib
import socket
import filegenerator
sock = socket.socket()
with contextlib.closing(sock):
sock.connect((HOST, 12345))
for filename in filegenerator.generate():
with open(filename, 'rb') as f:
contents = f.read()
buf = '{0}:{1},{2}:{3},'.format(len(filename), filename,
len(contents), contents)
sock.sendall(buf)
server.py:
import contextlib
import socket
import threading
def pairs(iterable):
return zip(*[iter(iterable)]*2)
def netstrings(conn):
buf = ''
while True:
newbuf = conn.recv(1536*1024)
if not newbuf:
return
buf += newbuf
while True:
colon = buf.find(':')
if colon == -1:
break
length = int(buf[:colon])
if len(buf) >= colon + length + 2:
if buf[colon+length+1] != ',':
raise ValueError('Not a netstring')
yield buf[colon+1:colon+length+1]
buf = buf[colon+length+2:]
def client(conn):
with contextlib.closing(conn):
for filename, contents in pairs(netstrings(conn)):
with open(filename, 'wb') as f:
f.write(contents)
sock = socket.socket()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
with contextlib.closing(sock):
sock.bind(('0.0.0.0', 12345))
sock.listen(1)
while True:
conn, addr = sock.accept()
t = threading.Thread(target=client, args=[conn])
t.daemon = True
t.start()
If you need more than about 200 clients on Windows, 100 on linux and BSD (including Mac), a dozen on less good platforms, you probably want to go with an event loop design instead of a threaded design, using epoll on linux, kqueue on BSD, and IO completion ports on Windows. This an be painful, but fortunately, there are frameworks that wrap everything up for you. Two popular (and very different) choices are Twisted and gevent.
One nice thing about gevent in particular is that you can write threaded code today, and with a handful of simple changes turn it into event-based code like magic.
On the other hand, if you're eventually going to want event-based code, it's probably better to learn and use a framework from the start, so you don't have to deal with all the fiddly bits of accepting and looping around recv until you get a full message and shutting down cleanly and so on, and just write the parts you care about. After all, more than half the code above is basically boilerplate for stuff that every server shares, so if you don't have to write it, why bother?
In a comment, you said:
Also the files are binary, so it's possible that I'll have problems if client encodings are diferent from server's.
Notice that I opened each file in binary mode ('rb' and 'wb'), and intentionally chose a protocol (netstrings) that can handle binary strings without trying to interpret them as characters or treat embedded NUL characters as EOF or anything like that. And, while I'm using str.format, in Python 2.x that won't do any implicit encoding unless you feed it unicode strings or give it locale-based format types, neither of which I'm doing. (Note that in 3.x, you'd need to use bytes instead of str, which would change a bit of the code.)
In other words, the client and server encodings don't enter into it; you're doing a binary transfer exactly the same as FTP's I mode.
But what if you wanted the opposite, to transfer text and reencode automatically for the target system? There are three easy ways to do that:
Send the client's encoding (either once at the top, or once per file), and on the server, decode from the client and reencode to the local file.
Do everything in text/unicode mode, even the socket. This is silly, and in 2.x it's hard to do as well.
Define an wire encoding—say, UTF-8. The client is responsible for decoding files and encoding to UTF-8 for send; the server is responsible for decoding UTF-8 on receive and encoding files.
Going with the third option, assuming that the files are going to be in your default filesystem encoding, the changed client code is:
with io.open(filename, 'r', encoding=sys.getfilesystemencoding()) as f:
contents = f.read().encode('utf-8')
And on the server:
with io.open(filename, 'w', encoding=sys.getfilesystemencoding()) as f:
f.write(contents.decode('utf-8'))
The io.open function also, by default, uses universal newlines, so the client will translate anything into Unix-style newlines, and the server will translate to its own native newline type.
Note that FTP's T mode actually doesn't do any re-encoding; it only does newline conversion (and a more limited version of it).

Yes, you can reuse connections with ftplib. All you have to do is not close them and keep using them.
For example, assuming you've got a module filegenerator whose generate() method registers with inotify, queues up notifications, and yields them one by one:
import ftplib
import os
import filegenerator
ftp = ftplib.FTP('ftp.example.com')
ftp.login()
ftp.cwd('/path/to/store/stuff')
os.chdir('/path/to/read/from/')
for filename in filegenerator.generate():
with open(filename, 'rb') as f:
ftp.storbinary('STOR {}'.format(filename), f)
ftp.close()
I'm a bit confused by this:
The problem we are having now is the amount of connections that keeps open in TIME_WAIT state.
It sounds like your problem is not that you create a new connection for each file, but that you never close the old ones. In which case the solution is easy: just close them.
Either that, or you're trying to do them all in parallel, but don't realize that's what you're doing.
If you want some parallelism, but not unboundedly so, you can easily, e.g. create a pool of 4 threads, each with an open ftplib connection, each reading from a queue, and then an inotify thread that just pushed onto that queue.

Winzip cannot open an archive created by python shutil.make_archive on windows. On ubuntu archive manager does fine

I am trying to return a zip file in django http response, the code goes something like...
archive = shutil.make_archive('testfolder', 'zip', MEDIA_ROOT, 'testfolder')
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
response['Content-Length'] = getsize(archive)
response['Content-Disposition'] = "attachment; filename=test %s.zip" % datetime.now()
return response
Now when this code is executed on ubuntu the resulting downloaded file opens without any issue, but when its executed on windows the file created does not open in winzip (gives error 'Unsupported Zip Format').
Is there something very obvious I am missing here? Isn't python code supposed to be portable?
EDIT:
Thanks to J.F. Sebastian for his comment...
There was no problem in creating the archive, it was reading it back into the request. So, the solution is to change second line of my code from,
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
to,
response = HttpResponse(FileWrapper(open(archive, 'rb')), # notice extra 'rb'
content_type=mimetypes.guess_type(archive)[0])
checkout, my answer to this question for more details...

The code you have written should work correctly. I've just run the following line from your snippet to generate a zip file and was able to extract on Linux and Windows.
archive = shutil.make_archive('testfolder', 'zip', MEDIA_ROOT, 'testfolder')
There is something funny and specific going on. I recommend you check the following:
Generate the zip file outside of Django with a script that just has that one liner. Then try and extract it on a Windows machine. This will help you rule out anything going on relating to Django, web server or browser
If that works then look at exactly what is in the folder you compressed. Do the files have any funny characters in their names, are there strange file types, or super long filenames.
Run a md5 checksum on the zip file in Windows and Linux just to make absolutely sure that the two files are byte by byte identical. To rule out any file corruption that might have occured.

Thanks to J.F. Sebastian for his comment...
I'll still write the solution here in detail...
There was no problem in creating the archive, it was reading it back into the request. So, the solution is to change second line of my code from,
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
to,
response = HttpResponse(FileWrapper(open(archive, 'rb')), # notice extra 'rb'
content_type=mimetypes.guess_type(archive)[0])
because apparently, hidden somewhere in python 2.3 documentation on open:
The most commonly-used values of mode are 'r' for reading, 'w' for
writing (truncating the file if it already exists), and 'a' for
appending (which on some Unix systems means that all writes append to
the end of the file regardless of the current seek position). If mode
is omitted, it defaults to 'r'. The default is to use text mode, which
may convert '\n' characters to a platform-specific representation on
writing and back on reading. Thus, when opening a binary file, you
should append 'b' to the mode value to open the file in binary mode,
which will improve portability. (Appending 'b' is useful even on
systems that don’t treat binary and text files differently, where it
serves as documentation.) See below for more possible values of mode.
So, in simple terms while reading binary files, using open(file, 'rb') increases portability of your code (it certainly did in this case)
Now, it extracts without troubles, on windows...

Download file using partial download (HTTP)

Is there a way to download huge and still growing file over HTTP using the partial-download feature?
It seems that this code downloads file from scratch every time it executed:
import urllib
urllib.urlretrieve ("http://www.example.com/huge-growing-file", "huge-growing-file")
I'd like:
To fetch just the newly-written data
Download from scratch only if the source file becomes smaller (for example it has been rotated).

It is possible to do partial download using the range header, the following will request a selected range of bytes:
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, end)
f = urllib2.urlopen(req)
For example:
>>> req = urllib2.Request('http://www.python.org/')
>>> req.headers['Range'] = 'bytes=%s-%s' % (100, 150)
>>> f = urllib2.urlopen(req)
>>> f.read()
'l1-transitional.dtd">\n\n\n<html xmlns="http://www.w3.'
Using this header you can resume partial downloads. In your case all you have to do is to keep track of already downloaded size and request a new range.
Keep in mind that the server need to accept this header for this to work.

This is quite easy to do using TCP sockets and raw HTTP. The relevant request header is "Range".
An example request might look like:
mysock = connect(("www.example.com", 80))
mysock.write(
"GET /huge-growing-file HTTP/1.1\r\n"+\
"Host: www.example.com\r\n"+\
"Range: bytes=XXXX-\r\n"+\
"Connection: close\r\n\r\n")
Where XXXX represents the number of bytes you've already retrieved. Then you can read the response headers and any content from the server. If the server returns a header like:
Content-Length: 0
You know you've got the entire file.
If you want to be particularly nice as an HTTP client you can look into "Connection: keep-alive". Perhaps there is a python library that does everything I have described (perhaps even urllib2 does it!) but I'm not familiar with one.

If I understand your question correctly, the file is not changing during download, but is updated regularly. If that is the question, rsync is the answer.
If the file is being updated continually including during download, you'll need to modify rsync or a bittorrent program. They split files into separate chunks and download or update the chunks independently. When you get to the end of the file from the first iteration, repeat to get the appended chunk; continue as necessary. With less efficiency, one could just repeatedly rsync.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.