Sanitizing arbitrary user input

Sanitizing arbitrary user input - python

I'm working on a web service that accepts STL files, does some simple processing on them (count facets, calculate total volume, etc) and returns some stats to users. There's no database or persistence planned (although that might be added at some point in the future.) Users can either upload files or point to a URL.
What should I be thinking about in order to sanitize use input and secure the Tornado server?
I'm using the templating system which auto-escapes html.
I can also impliment logic that checks that input "looks like" valid STL format as I parse it: binary STL is just floats; I also know what the format for ascii STL looks like.
I've done a bit of initial research including:
When is it Best to Sanitize User Input?
What is the best way to sanitize user inputs?
Many others.
Am I missing anything obvious?

Without you giving the code, all we can do is to speculate:
Think about the boundary conditions while converting between representations (binary -> float, ascii -> binary)
Think about the consequent stats calculated which will be rendered in a web browser? Is it possible to print some UTF-8 code, or is it possible to insert javascript?

Related

integrating a web server into a python script

I have written a program to generate sequences that pass certain filters (the exact sequences etc don't matter). Each sequence is generated by making a random string of 40 characters made up of C, G, T or A. When each string is generated, it is put through a set of filters, and if it passes the filters it is saved to a list.
I am trying to make one of those filters include an online tool, BPROM, which doesn't appear to have a python library implementation. This means I will need to get my python script to send the sequence string described above to the online tool, and save the output as a python variable.
My question is, if I have a url to the tool (http://www.softberry.com/berry.phtml?topic=bprom&group=programs&subgroup=gfindb), how can I interface my script that generates the sequences, with the online tool - is there a way to send data to the web tool and save the tool's output as a variable? I've been looking into requests but i'm not sure it is the right way to approach this (as a massive python/coding noob).
Thanks for reading, I'm a bit brain dead so I hope this made sense :P

Of course, you can use requests or urllib
Here is demo code:
with urllib.request.urlopen('http://www.softberry.com/berry.phtml?topic=bprom&group=programs&subgroup=gfindb') as response:
html = response.read()

create pdf from python

I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?

borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch

As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.

I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.

How to reliable tell the uploaded file type (text or binary)?

I have an application where users should be able to upload a wide variety of files, but I need to know for each file, if I can safely display its textual representation as plain text.
Using python-magic like
m = Magic(mime=True).from_buffer(cgi.FieldStorage.file.read())
gives me the correct MIME type.
But sometimes, the MIME type for scripts is application/*, so simply looking for m.startswith('text/') is not enough.
Another site suggested using
m = Magic().from_buffer(cgi.FieldStorage.file.read())
and checking for 'text' in m.
Would the second approach be reliable enough for a collection of arbitrary file uploads or could someone give me another idea?
Thanks a lot.

What is your goal? Do you want the real mime type? Is that important for security reasons? Or is it "nice to have"?
The problem is that the same file can have different mime types. When a script file has a proper #! header, python-magic can determine the script type and tell you. If the header is missing, text/plain might be the best you can get.
This means there is no general "will always work" magic solution (despite the name of the module). You will have to sit down and think what information you can get, what it means and how you want to treat it.
The secure solution would be to create a list of mime types that you accept and check them with:
allowed_mime_types = [ ... ]
if m in allowed_mime_types:
That means only perfect matches are accepted. It also means that your server will reject valid files which don't have the correct mime type for some reason (missing header, magic failed to recognize the file, you forgot to mention the mime type in your list).
Or to put it another way: Why do you check the mime type of the file if you don't really care?
[EDIT] When you say
I need to know for each file, if I can safely display its textual representation as plain text.
then this isn't as easy as it sounds. First of all, "text" files have no encoding stored in them, so you will need to know the encoding that the user used when they created the file. This isn't a trivial task. There are heuristics to do so but things get hairy when encodings like ISO 8859-1 and 8859-15 are used (the latter has the Euro symbol).
To fix this, you will need to force your users to either save the text files in a specific encoding (UTF-8 is currently the best choice) or you need to supply a form into which users will have to paste the text.
When using a form, the user can see whether the text is encoded correctly (they see it on the screen), they can fix any problems and you can make sure that the browser sends you the text encoded with UTF-8.
If you can't do that, your only choice is to check for any bytes below 0x20 in the input with the exception of \r, \n and \t. That is a pretty good check for "is this a text document".
But when users use umlauts (like when you write an application that is being used world wide), this approach will eventually fail unless you can enforce a specific encoding on the user's side (which you probably can't since you don't trust the user).
[EDIT2] Since you need this to check actual source code: If you want to make sure the source code is "safe", then parse it. Most languages allow to parse the code without actually executing it. That would give you some real information (because the parsers know what to look for) and you wouldn't need to make wild guesses :-)

After playing around a bit, I discovered that I can propably use the Magic(mime_encoding=True) results!
I ran a simple script on my Dropbox folder and grouped the results both by encoding and by extension to check for irregularities.
But it does seem pretty usable by looking for 'binary' in encoding.
I think I will hang on to that, but thank you all.

Link encryption with django and python

I'm having a download application and I want to encrypt the links for file downloads, so that the user doesn't know the id of the file. Furthermore I'd like to include date/time in the link, and check when serving the file if the link is still valid.
There's a similar question asked here, but I'm running into problems with the character encodings, since I'd like to have urls like /file/encrypted_string/ pointing to the views for downloading, so best would be if the encrypted result only contains letters and numbers. I prefer not using a hash, because I do not want to store a mapping hash <> file somewhere. I do not know if there's a good encryption out there that fulfills my needs...

Sounds like it would be easy, especially if you don't mind using the same encryption key forever. Just delimit a string (/ or : works as well as anything) for the file name, the date/time, and anything else you want to include, then encrypt and b64 it! Remember to use urlsafe_b64encode, not the regular b64encode, which will produce broken urls. It'll be a long string, but so what?
I've done this a few times, using a slight variation: Add a few random characters as the last piece of the key and include that at the beginning or end of the string - more secure than always reusing the same key, without the headaches of a database mapping. As long as your key is complex enough the exposed bits won't be enough to let crackers generate requests at will.
Of course, if the file doesn't exist, don't let them see the decoded result...

By far the easiest way to handle this is to generate a random string for each file, and store a mapping between the key strings and the actual file name or file id. No complex encryption required.
Edit:
You will need to store the date anyway to implement expiring the links. So, you can store the expiration date, a long with the key, and periodically cull out expired links from the table.

If your problem is just one of encryption and decryption of short strings, Python's Crypto module makes it a breeze.

You can encode any character into the url, with django, you may use it's urlencode filter.
However, generating a random string and saving the mapping is more secure.

Caching system for dynamically created files?

I have a web server that is dynamically creating various reports in several formats (pdf and doc files). The files require a fair amount of CPU to generate, and it is fairly common to have situations where two people are creating the same report with the same input.
Inputs:
raw data input as a string (equations, numbers, and
lists of words), arbitrary length, almost 99% will be less than about 200 words
the version of the report creation tool
When a user attempts to generate a report, I would like to check to see if a file already exists with the given input, and if so return a link to the file. If the file doesn't already exist, then I would like to generate it as needed.
What solutions are already out there? I've cached simple http requests before, but the keys were extremely simple (usually database id's)
If I have to do this myself, what is the best way. The input can be several hundred words, and I was wondering how I should go about transforming the strings into keys sent to the cache.
//entire input, uses too much memory, one to one mapping
cache['one two three four five six seven eight nine ten eleven...']
//short keys
cache['one two'] => 5 results, then I must narrow these down even more
Is this something that should be done in a database, or is it better done within the web app code (python in my case)
Thanks you everyone.

This is what Apache is for.
Create a directory that will have the reports.
Configure Apache to serve files from that directory.
If the report exists, redirect to a URL that Apache will serve.
Otherwise, the report doesn't exist, so create it. Then redirect to a URL that Apache will serve.
There's no "hashing". You have a key ("a string (equations, numbers, and lists of words), arbitrary length, almost 99% will be less than about 200 words") and a value, which is a file. Don't waste time on a hash. You just have a long key.
You can compress this key somewhat by making a "slug" out of it: remove punctuation, replace spaces with _, that kind of thing.
You should create an internal surrogate key which is a simple integer.
You're simply translating a long key to a "report" which either exists as a file or will be created as a file.

The usual thing is to use a reverse proxy like Squid or Varnish

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.