Securing data in the google app engine datastore

Securing data in the google app engine datastore - python

Our google app engine app stores a fair amount of personally identifying information (email, ssn, etc) to identify users. I'm looking for advice as to how to secure that data.
My current strategy
Store the sensitive data in two forms:
Hashed - using SHA-2 and a salt
Encrypted - using public/private key RSA
When we need to do look ups:
Do look-ups on the hashed data (hash the PII in a query, compare it to the hashed PII in the datastore).
If we ever need to re-hash the data or otherwise deal with it in a raw form:
Decrypt the encrypted version with our private key. Never store it in raw form, just process it then re-hash & re-encrypt it.
My concerns
Keeping our hash salt secret
If an attacker gets ahold of the data in the datastore, as well as our hash salt, I'm worried they could brute force the sensitive data. Some of it (like SSN, a 9-digit number) does not have a big key space, so even with a modern hash algorithm I believe it could be done if the attacker knew the salt.
My current idea is to keep the salt out of source control and in it's own file. That file gets loaded on to GAE during deployment and the app reads the file when it needs to hash incoming data.
In between deployments the salt file lives on a USB key protected by an angry bear (or a safe deposit box).
With the salt only living in two places
The USB key
Deployed to google apps
and with code download permanently disabled, I can't think of a way for someone to get ahold of the salt without stealing that USB key. Am I missing something?
Keeping our private RSA key secret
Less worried about this. It will be rare that we'll need to decrypt the encrypted version (only if we change the hash algorithm or data format).
The private key never has to touch the GAE server, we can pull down the encrypted data, decrypt it locally, process it, and re-upload the encrypted / hashed versions.
We can keep our RSA private key on a USB stick guarded by a bear AND a tiger, and only bring it out when we need it.
I realize this question isn't exactly google apps specific, but I think GAE makes the situation somewhat unique.
If I had total control, I'd do things like lock down deployment access and access to the datastore viewer with two-factor authentication, but those options aren't available at the moment (Having a GAE specific password is good, but I like having RSA tokens involved).
I'm also neither a GAE expert nor a security expert, so if there's a hole I'm missing or something I'm not thinking of specific to the platform, I would love to hear it.

When deciding on a security architecture, the first thing in your mind should always be threat models. Who are your potential attackers, what are their capabilities, and how can you defend against them? Without a clear idea of your threat model, you've got no way to assess whether or not your proposed security measures are sufficient, or even if they're necessary.
From your text, I'm guessing you're seeking to protect against some subset of the following:
An attacker who compromises your datastore data, but not your application code.
An attacker who obtains access to credentials to access the admin console of your app and can deploy new code.
For the former, encrypting or hashing your datastore data is likely sufficient (but see the caveats later in this answer). Protecting against the latter is tougher, but as long as your admin users can't execute arbitrary code without deploying a new app version, storing your keys in a module that's not checked in to source control, as you suggest, ought to work just fine, since even with admin access, they can't recover the keys, nor can they deploy a new version that reveals the keys to them. Make sure to disable downloading of source, obviously.
You rightly note some concerns about hashing of data with a limited amount of entropy - and you're right to be concerned. To some degree, salts can help with this by preventing precomputation attacks, and key stretching, such as that employed in PBKDF2, scrypt, and bcrypt, can make your attacker's life harder by increasing the amount of work they have to do. However, with something like SSN, your keyspace is simply so small that no amount of key stretching is going to help - if you hash the data, and the attacker gets the hash, they will be able to determine the original SSN.
In such situations, your only viable approach is to encrypt the data with a secret key. Now your attacker is forced to brute-force the key in order to get the data, a challenge that is orders of magnitude harder.
In short, my recommendation would be to encrypt your data using a standard (private key) cipher, with the key stored in a module not in source control. Using hashing instead will only weaken your data, while using public key cryptography doesn't provide appreciable security against any plausible threat model that you don't already have by using a standard cipher.
Of course, the number one way to protect your users' data is to not store it in the first place, if you can. :)

You can increase your hashing algorithm security by using HMAC, a secret key, and a unique salt per entry (I know people will disagree with me on this but it's my belief from my research that it helps avoid certain attacks). You can also use bcrypt or scrypt to hash which will make reversing the hash an extremely time consuming process (but you'll also have to factor this in as time it takes your app to compute the hash).
By disabling code downloads and keeping your secret key protected, I can't imagine how someone can get a hold of it. Just make sure your code is kept protected under similar safe guards or that you remove the secret key from your code during development and only pull it out to deploy. I assume you will keep your secret key in your code (I've heard many people say to keep it in memory to be ultra secure but given the nature of AppEngine and instances, this isn't feasible).
Update:
Be sure to enable 2-factor authentication for all Google accounts that have admin rights to your app. Google offers this so not sure if your restriction for enabling this was imposed by an outside force or not.

Interesting approach to encrypt data on a datastore. After going through this, one question that comes to my mind is how do you query data on your hashes? Are you using comparison of two hashes or more fine grained hashing? Again how do you accomplish operations like greater than value, less than value after hashing and encrypting the data in your table?
Fine grained hashing meaning, do you hash consecutive bytes of a data stream to get the accumulated hash. i.e hash(abcd) = hash(a,b) + hash (b,c) + etc. This type of hashing would tell how similar the underlying data are rather than a match.

Related

Limit access to a specific file from only a specific Python script in Linux

Problem:
Customer would like to make sure that the script I've developed in Python, running under CentOS7, can sufficiently obscure the credentials required to access a protected web service (that only supports Basic Auth) such that someone with access to the CentOS login cannot determine those credentials.
Discussion:
I have a Python script that needs to run as a specific CentOS user, say "joe".
That script needs to access a protected Web Service.
In order to attempt to make the credentials external to the code I have put them into a configuration file.
I have hidden the file (name starts with a period "."), and base64 encoded the credentials for obscurity, but the requirement is to only allow the Python script to be able to "see" and open the file, vs anyone with access to the CentOS account.
Even though the file is hidden, "joe" can still do an ls -a and see the file, and then cat the contents.
As a possible solution, is there a way to set the file permissions in CentOS such that they will only allow that Python script to see and open the file, but still have the script run under the context of a CentOS account?

Naive solution
For this use-case I would probably create with a script (sh or zsh or whatever, but I guess u use the default one here) a temporal user iamtemporal-and-wontstayafterifinish. Then creating the config file for being able to read ONLY by specifically this user (and none permission for all the others). Read here for the how: https://www.thegeekdiary.com/understanding-basic-file-permissions-and-ownership-in-linux/
Getting harder
If the problem still raises in case someone would have root-rights (for any such reason), then just simply forget everything above, and start planning for a vacation, cuz' this will be a lot longer then anyone would think.
Is not anymore a simple python problem, but needs a different business logic. The best u could do is to implement (at least this credentials handling part) in a low-level language so could handle memory in a customized way and ask for them runtime only, don't store them...
Or maybe if u could limit the scope of this user accesses towards the protected Web Service as u say.
Bonus
Even tho it wasn't explicitly asked, I would discourage you from storing credentials with using a simple base64...
For this purpose a simple solution could be the following one at least (without the knowledge of the whole armada of cryptography):
encrypt the passw with a asymmetric cryptographic algorithm (probably RSA with a huge key
inject the key for decryption as a env var while you have an open ssh session to the remote terminal
ideally u use this key only while u decrypt and send it, afterwards make sure u delete the references to the variables
Sidenote: it's still filled with 'flaws'. If security is really a problem, I would consider changing technology or using some sort of lib that handles these stuff more securely. I would start probably here: Securely Erasing Password in Memory (Python)
Not to mention memory dumps can be read 'easily' (if u know what u are looking for...): https://cyberarms.wordpress.com/2011/11/04/memory-forensics-how-to-pull-passwords-from-a-memory-dump/
So yeah, having a key-server which sends you the private key to decrypt is not enough, if you read these last two web entries...

Storing sensitive data in database, recommendation

I'm searching for best solution to store sensitive data in database.
I know that this is common problem and i have done my homework (at least this is what i think), but i wanted to ask here before i will make a decision.
Assumptions:
Encrypted data needs to be decrypted. We are talking about SMTP credentials like username, password, host, port itp.
I was thinking about 2 concepts:
Encrypt data with help of passlib.totp library. To make those data a bit safer i will keep key in separate file. Then from what i can see i can use this library to decrypt data to plain text using my key.
The other concept was to encrypt and decrypt data during query request with help of postgres:
insert into demo(pw) values ( encrypt( 'data', 'key', 'aes') );
And:
decrypt(pw, 'key', 'aes'), 'utf-8')
Here the key will be stored also in separate file.
So my questions are:
What is better approach to encrypt / decrypt data, in code or in database?
Are there any better (stronger) libraries to use than passlib.totp -> i have no experience with that library (i'm aware that encryption / decryption is not the moste secure way of storing password -> password supposed to be hased but i need it in plain text to use users smtp gate).

2) The other concept was to encrypt and decrypt data during query request with help of postgres: insert into demo(pw) values ( encrypt( 'data', 'key', 'aes') ); and decrypt(pw, 'key', 'aes'), 'utf-8') Here the key will be stored also in separate file.
I wouldn't recommend that, because it's way too easy for the keys to get exposed in pg_stat_activity, the logs, etc. PostgreSQL doesn't have log masking features that would protect against that.
I strongly advise you to use app-side crypto. Use a crypto offload device if security is crucial, so key extraction isn't possible for most attackers. Or require the key to be unlocked by the admin entering a passphrase at app start, so the key is never stored unencrypted on disk - then the attacker has to steal it from memory. But even an unencrypted key file somewhere non-obvious is better than in-db crypto, IMO, since it at least separates the key from the data.

Ultimately, your application needs to be able to recover the plaintext passwords using some sort of key. If your system is compromised, you have to assume that the malicious user will simply find your key (whether in the database or on disk) and perform the exact same decryption that your application performs.
When your system needs to store passwords in a recoverable form, i.e. to authenticate with external systems, at best you can only obfuscate that information. This is what people mean when they refer to "security through obscurity", and it's a bad idea.
When you give the appearance of security, without actually securing something, then that can make things even more dangerous. You, or other people administering the system, may overlook other important security measures because they believe that there is already a layer of protection for the sensitive information. It can create a social situation where sensitive information may be more likely to leak, than it would if it was assumed that the only way to safeguard the info is to safeguard the systems that hold it. It can also cause people to believe that if data is stolen from the database, "maybe it's okay because they can't decrypt it". This is how you end up responsible for leaking credentials to the wide world, because you must assume that if your application can get the plaintext data, so can an attacker that has compromised your application.
There may be a very very small advantage to encrypting a multipart key on multiple systems (i.e. part on the file system, part in the database), so that somebody who gains access to one system doesn't necessarily gain access to the other. It's reasonable to say that this might, in fact, delay an attack, or deter a lazy attacker. Generally speaking however, the place where your application lives has access to both these things anyway so if your application is compromised then you must assume that the data is compromised. (Edit: You mentioned in another comment you have users who a) shouldn't know the passwords stored in the DB, but b) will have access to the DB directly. Believe me, there is a non-zero chance of one of those users getting all of those passwords if you do this. By going down this path, you're putting your faith in a faulty layer of protection.)
tl;dr reversible encryption when storing sensitive data rarely has practical, real security value. If you're trying to tick a compliance check box, and you have no power to overrule someone up the chain who needs you to tick that box, then by all means implement something that "encrypts" the data. But if you're actually trying to secure your system, look elsewhere: here be dragons.

You may check Vault project, a tool for managing secrets:
General Secret Storage
At a bare minimum, Vault can be used for the storage of any secrets.
For example, Vault would be a fantastic way to store sensitive
environment variables, database credentials, API keys, etc.
Compare this with the current way to store these which might be
plaintext in files, configuration management, a database, etc. It
would be much safer to query these using vault read or the API. This
protects the plaintext version of these secrets as well as records
access in the Vault audit log.
Employee Credential Storage
While this overlaps with "General Secret Storage", Vault is a good
mechanism for storing credentials that employees share to access web
services. The audit log mechanism lets you know what secrets an
employee accessed and when an employee leaves, it is easier to roll
keys and understand which keys have and haven't been rolled.
Vault server stores data in encrypted form. Data can be retrieved via command line or REST API. Server must be in unsealed state to return decrypted data - unsealing needs specific number of shards of the master key. Once server is restarted, you need to unseal it again.

Storing client secrets on Django app on App Engine

I have a Django app that uses some secret keys (for example for OAuth2 / JWT authentication). I wonder where is the right place to store these keys.
Here are the methods I found so far:
Hardcoding: not an option, I don't want my secrets on the source control.
Hardcoding + obfuscating: same as #1 - attackers can just run my code to get the secret.
Storing in environment variables: my app.yaml is also source-controlled.
Storing in DB: Not sure about that. DB is not reliable enough in terms of availability and security.
Storing in a non-source-controlled file: my favorite method so far. The problem is that I need some backup for the files, and manual backup doesn't sound right.
Am I missing something? Is there a best practice for storing secret keys for Django apps or App Engine apps?

You can hardly hide the secret keys from an attacker that can access your server, since the server needs to know the keys. But you can make it hard for an attacker with low privileges.
Obfuscating is generally not considered as a good practice.
Your option 5 seems reasonable. Storing the keys in a non-source controlled file allows to keep the keys in a single and well-defined place. You can set appropriate permissions on that file so that an attacker would need high privileges to open it. Also make sure that high privileges are required to edit the rest of the project, otherwise, the attacker could modify a random file of the project to access the keys.
I myself use your option 5 in my projects.

A solution I've seen is to store an encrypted copy of the secret configuration in your repository using gpg. Depending on the structure of your team you could encrypt it symmetrically and share the password to decrypt it or encrypt it with the public keys of core members / maintainers.
That way your secrets are backed up the same way your code is without making them as visible.

Is there a way to speed up the authenticate function in django?

We're using django to make a json webservice front-end for mysql. We have apache and django running on an EC2 instance and MySQL running on an RDS instance. We've started benchmarking performance using apache bench and got some really poor performance numbers. We also noticed that while running the tests, our apache/django instance goes to 100% cpu usage at very low load and the MySQL instance never gets above 2% cpu usage.
We're trying to make sense of this and isolate the problem, so we did several ab tests:
A request for a static html page from apache -- ~2000 requests/second.
A request that executes a small python function in django, and no db interaction -- ~1000 requests/second.
A request that executes one of our django webservice functions that calls authenticate and then does a very simple query to fetch one record from a table -- 11 requests/second
Same as 3, but commented the call to authenticate -- 95 requests/second.
Why is authenticate so slow? Is it writing data to the db, finding a billion digits of pi, what?
We would like to keep the call to authenticate in these functions, because we don't want to leave them open to anyone that can guess the url, etc. Has anyone here noticed that authenticate is slow, and can anyone suggest a way to remedy it?
Thank you very much!

I am no expert in authentication and security but the following are some ideas as to why this might be happening and possibly how you can increase the performance somewhat.
Since passwords are stored in the db, to make their storage secure, plaintext password are not stored but their hash is stored instead. This way you can still validate user logging in by comparing the computed hash from the typed password to the one stored in the db. This increases security so that if a malicious party will get a copy of the db, the only way to decode the plaintext passwords is by either using rainbow-tables or doing a brute-force attack.
This is where things get interesting. According to Moore's Law, computers are becoming exponentially faster, hence computing hash functions becomes much cheaper in terms of time, especially quick hash functions like md5 or sha1. This poses a problem because having all of the computing power available today combined with fast hash functions, hackers can brute-force hashed passwords relatively easy. To combat this, two things can be done. One it to loop the hash function multiple times (output of the hash is fed back into the hash). This however is not very effective because it only increases the complexity of the hash function by a constant. That's why the second approach is preferred which is to make the actual hash function be more complex and computationally expensive. Having more complex function, it takes more time for the hash to be computed. Even if it takes a second to compute, it is not a big deal for end-users, but it is a big deal for brute-force attack because millions of hashes have to be computed. That's why starting with Django 1.4, it uses a pretty computationally expensive function called PBKDF2.
To get back to your answer. It's because of this function, when you enable authentication, your benchmark number drastically goes down and your CPU goes up.
Here are some ways you can increase the performance.
Starting with Django 1.4, you can change the default authentication function (docs). If you don't need much security, you can change the default function to be either SHA1 or MD5. This should increase the performance however keep in mind that the security will be much weaker. My personal opinion is that security is important and is worth the extra time but if it not warranted in your application, it's something you might want to consider.
Use sessions. The expensive hash function is only computed on the initial login. Once the user logs in, a session is created for that session and a cookie is send to the user with the session id. Then on subsequent requests, user uploads a cookie and if the session has not expired yet, the user is automatically authenticated (don't worry about security since session data is signed...). The point is that verifying session is A LOT less computationally expensive compared to computing the expensive hash function. I guess that in ab tests you did not send a session cookie. Try to do some tests with an addition of sending a session cookie and see how it performs. If sending cookies is not really an option since you are making a JSON API, then you can modify the session back-end to accept the session data via a session GET parameter instead of a cookie. Not sure however what are security ramifications of doing that.
Switch to nginx. I am not an expert in deployment but in my experience nginx is much faster and more friendly to Django compared to Apache. One advantage which I think might be of particular interest to you is nginx ability to have multiple worker processes and its ability to use proxy_pass to hand of requests to Django process(es). If you will have multiple worker processes, you can point each worker to a separate Django process via proxy_pass which will effectively add multiprocessing to Django. Another alternative is if you use something like gevent WSGI server, you can make a pool in Django process which also might increase performance. Not sure if any of these will increase your performance drastically since your CPU load is already at 100% but it might be something to look into.

Is it more efficient to parse external XML or to hit the database?

I was wondering when dealing with a web service API that returns XML, whether it's better (faster) to just call the external service each time and parse the XML (using ElementTree) for display on your site or to save the records into the database (after parsing it once or however many times you need to each day) and make database calls instead for that same information.

First off -- measure. Don't just assume that one is better or worse than the other.
Second, if you really don't want to measure, I'd guess the database is a bit faster (assuming the database is relatively local compared to the web service). Network latency usually is more than parse time unless we're talking a really complex database or really complex XML.

Everyone is being very polite in answering this question: "it depends"... "you should test"... and so forth.
True, the question does not go into great detail about the application and network topographies involved, but if the question is even being asked, then it's likely a) the DB is "local" to the application (on the same subnet, or the same machine, or in memory), and b) the webservice is not. After all, the OP uses the phrases "external service" and "display on your own site." The phrase "parsing it once or however many times you need to each day" also suggests a set of data that doesn't exactly change every second.
The classic SOA myth is that the network is always available; going a step further, I'd say it's a myth that the network is always available with low latency. Unless your own internal systems are crap, sending an HTTP query across the Internet will always be slower than a query to a local DB or DB cluster. There are any number of reasons for this: number of hops to the remote server, outage or degradation issues that you can't control on the remote end, and the internal processing time for the remote web service application to analyze your request, hit its own persistence backend (aka DB), and return a result.
Fire up your app. Do some latency and response times to your DB. Now do the same to a remote web service. Unless your DB is also across the Internet, you'll notice a huge difference.
It's not at all hard for a competent technologist to scale a DB, or for you to completely remove the DB from caching using memcached and other paradigms; the latency between servers sitting near each other in the datacentre is monumentally less than between machines over the Internet (and more secure, to boot). Even if achieving this scale requires some thought, it's under your control, unlike a remote web service whose scaling and latency are totally opaque to you. I, for one, would not be too happy with the idea that the availability and responsiveness of my site are based on someone else entirely.
Finally, what happens if the remote web service is unavailable? Imagine a world where every request to your site involves a request over the Internet to some other site. What happens if that other site is unavailable? Do your users watch a spinning cursor of death for several hours? Do they enjoy an Error 500 while your site borks on this unexpected external dependency?
If you find yourself adopting an architecture whose fundamental features depend on a remote Internet call for every request, think very carefully about your application before deciding if you can live with the consequences.

Consuming the webservices is more efficient because there are a lot more things you can do to scale your webservices and webserver (via caching, etc.). By consuming the middle layer, you also have the options to change the returned data format (e.g. you can decide to use JSON rather than XML). Scaling database is much harder (involving replication, etc.) so in general, reduce hits on DB if you can.

There is not enough information to be able to say for sure in the general case. Why don't you do some tests and find out? Since it sounds like you are using python you will probably want to use the timeit module.
Some things that could effect the result:
Performance of the web service you are using
Reliability of the web service you are using
Distance between servers
Amount of data being returned
I would guess that if it is cacheable, that a cached version of the data will be faster, but that does not necessarily mean using a local RDBMS, it might mean something like memcached or an in memory cache in your application.

It depends - who is calling the web service? Is the web service called every time the user hits the page? If that's the case I'd recommend introducing a caching layer of some sort - many web service API's throttle the amount of hits you can make per hour.
Whether you choose to parse the cached XML on the fly or call the data from a database probably won't matter (unless we are talking enterprise scaling here). Personally, I'd much rather make a simple SQL call than write a DOM Parser (which is much more prone to exceptional scenarios).

It depends from case to case, you'll have to measure (or at least make an educated guess).
You'll have to consider several things.
Web service
it might hit database itself
it can be cached
it will introduce network latency and might be unreliable
or it could be in local network and faster than accessing even local disk
DB
might be slow since it needs to access disk (although databases have internal caches, but those are usually not targeted)
should be reliable
Technology itself doesn't mean much in terms of speed - in one case database parses SQL, in other XML parser parses XML, and database is usually acessed via socket as well, so you have both parsing and network in either case.
Caching data in your application if applicable is probably a good idea.

As a few people have said, it depends, and you should test it.
Often external services are slow, and caching them locally (in a database in memory, e.g., with memcached) is faster. But perhaps not.
Fortunately, it's cheap and easy to test.

Test definitely. As a rule of thumb, XML is good for communicating between apps, but once you have the data inside of your app, everything should go into a database table. This may not apply in all cases, but 95% of the time it has for me. Anytime I ever tried to store data any other way (ex. XML in a content management system) I ended up wishing I would have just used good old sprocs and sql server.

It sounds like you essentially want to cache results, and are wondering if it's worth it. But if so, I would NOT use a database (I assume you are thinking of a relational DB): RDBMSs are not good for caching; even though many use them. You don't need persistence nor ACID.
If choice was between Oracle/MySQL and external web service, I would start with just using service.
Instead, consider real caching systems; local or not (memcache, simple in-memory caches etc).
Or if you must use a DB, use key/value store, BDB works well. Store response message in its serialized form (XML), try to fetch from cache, if not, from service, parse. Or if there's a convenient and more compact serialization, store and fetch that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.