We're using django to make a json webservice front-end for mysql. We have apache and django running on an EC2 instance and MySQL running on an RDS instance. We've started benchmarking performance using apache bench and got some really poor performance numbers. We also noticed that while running the tests, our apache/django instance goes to 100% cpu usage at very low load and the MySQL instance never gets above 2% cpu usage.
We're trying to make sense of this and isolate the problem, so we did several ab tests:
A request for a static html page from apache -- ~2000 requests/second.
A request that executes a small python function in django, and no db interaction -- ~1000 requests/second.
A request that executes one of our django webservice functions that calls authenticate and then does a very simple query to fetch one record from a table -- 11 requests/second
Same as 3, but commented the call to authenticate -- 95 requests/second.
Why is authenticate so slow? Is it writing data to the db, finding a billion digits of pi, what?
We would like to keep the call to authenticate in these functions, because we don't want to leave them open to anyone that can guess the url, etc. Has anyone here noticed that authenticate is slow, and can anyone suggest a way to remedy it?
Thank you very much!
I am no expert in authentication and security but the following are some ideas as to why this might be happening and possibly how you can increase the performance somewhat.
Since passwords are stored in the db, to make their storage secure, plaintext password are not stored but their hash is stored instead. This way you can still validate user logging in by comparing the computed hash from the typed password to the one stored in the db. This increases security so that if a malicious party will get a copy of the db, the only way to decode the plaintext passwords is by either using rainbow-tables or doing a brute-force attack.
This is where things get interesting. According to Moore's Law, computers are becoming exponentially faster, hence computing hash functions becomes much cheaper in terms of time, especially quick hash functions like md5 or sha1. This poses a problem because having all of the computing power available today combined with fast hash functions, hackers can brute-force hashed passwords relatively easy. To combat this, two things can be done. One it to loop the hash function multiple times (output of the hash is fed back into the hash). This however is not very effective because it only increases the complexity of the hash function by a constant. That's why the second approach is preferred which is to make the actual hash function be more complex and computationally expensive. Having more complex function, it takes more time for the hash to be computed. Even if it takes a second to compute, it is not a big deal for end-users, but it is a big deal for brute-force attack because millions of hashes have to be computed. That's why starting with Django 1.4, it uses a pretty computationally expensive function called PBKDF2.
To get back to your answer. It's because of this function, when you enable authentication, your benchmark number drastically goes down and your CPU goes up.
Here are some ways you can increase the performance.
Starting with Django 1.4, you can change the default authentication function (docs). If you don't need much security, you can change the default function to be either SHA1 or MD5. This should increase the performance however keep in mind that the security will be much weaker. My personal opinion is that security is important and is worth the extra time but if it not warranted in your application, it's something you might want to consider.
Use sessions. The expensive hash function is only computed on the initial login. Once the user logs in, a session is created for that session and a cookie is send to the user with the session id. Then on subsequent requests, user uploads a cookie and if the session has not expired yet, the user is automatically authenticated (don't worry about security since session data is signed...). The point is that verifying session is A LOT less computationally expensive compared to computing the expensive hash function. I guess that in ab tests you did not send a session cookie. Try to do some tests with an addition of sending a session cookie and see how it performs. If sending cookies is not really an option since you are making a JSON API, then you can modify the session back-end to accept the session data via a session GET parameter instead of a cookie. Not sure however what are security ramifications of doing that.
Switch to nginx. I am not an expert in deployment but in my experience nginx is much faster and more friendly to Django compared to Apache. One advantage which I think might be of particular interest to you is nginx ability to have multiple worker processes and its ability to use proxy_pass to hand of requests to Django process(es). If you will have multiple worker processes, you can point each worker to a separate Django process via proxy_pass which will effectively add multiprocessing to Django. Another alternative is if you use something like gevent WSGI server, you can make a pool in Django process which also might increase performance. Not sure if any of these will increase your performance drastically since your CPU load is already at 100% but it might be something to look into.
Related
would it be possible to implement a rate limiting feature on my tornado app? like limit the number of HTTP request from a specific client if they are identified to send too many requests per second (which red flags them as bots).
I think I could it manually by storing the requests on a database and analyzing the requests per IP address but I was just checking if there is already an existing solution for this feature.
I tried checking the github page of tornado, I have the same questions as this post but no explicit answer was provided. checked tornado's wiki links as well but I think rate limiting is not handled yet.
Instead of storing them in the DB, would be better to have them in a dictionary stored in memory for easy usability.
Also can you share the details whether the api has a load-balancer and which web-server is used.
The enterprise grade solution to your problem is ambassador.
You can use ambassador's solutions like envoy proxy and edge stack and have it set up that can do the needful.
additional to tore the data, you can use any popular cached db, or d that store as key:value pairs, for example redis.
if you doing this for a very small project, can use some npm/pip packages.
Read the docs: https://www.getambassador.io/products/edge-stack/api-gateway/
You should probably do this before your requests reach Tornado.
But if it's an application level feature (limiting requests depending on level of subscription), then you can do it in Tornado in lots of ways, depending on how complex you want the rate limiting to be.
Probably the simplest way is to have a dict on your tornado.web.Application that uses the ip as the key and the timestamp of the last request as the value and check every request against it in prepare- if not enough time passed since last request, raise a tornado.web.HTTPError(429) (ideally with a Retry-After header). If you do this, you will still need to clean up this dict now and then to remove entries that have not made a request recently or it will grow (you could do it finish on every request).
If you have another fast/in-memory storage attached (memcache, redis, sqlite), you should use that, but you definitely should not use an RDBMS as all those writes will not be great for its performance.
This question is more on architecture and libs, than on implementation.
I am currently working at project, which requires a local long-term cache storage (updated once a day) at client kept in sync with a remote db at server. For client side sqlite has been chosen as a lightweight approach and postgresql as feature rich db at server. Native replication mechanisms of postgres are no-opt cause I need to keep client really lightweight and free of relying on external components like db servers.
The implementation language would be Python. Now I'm looking at ORMs like SQLAlchemy, but haven't worked with any before.
Does SQLAlchemy have any tools to keep sqlite and postgres dbs in sync?
If not, are there any other Python libraries which have such tools?
Any ideas about how should the architecture look like, if the task must be solved "by hand"?
Added:
It's like telemetry, cause client would have internet connection only for approximately 20 minutes a day
So, the main question is about architecure of such a system
It doesn't usually fall within the tasks of an ORM to sync data between databases, so you will likely have to implement it yourself. I don't know of any solution that will handle syncing for you given your choice of databases.
There are a couple important design choices to consider:
how do you figure out what data changed ( i.e. inserted, updated or deleted )
what is the most efficient way to package the change-log
will you have to deal with conflicts ? and how will you do that.
The most efficient way to figure out what changed is to have the database tell you that directly. Bottled water can offer some inspiration in this regard. The idea is to tap into the event log postgres would use for replication. You will need something like Kafka to keep track of what each of your clients already knows. This will allow you to optimize your server for writes, as you won't have clients querying trying to figure out what changed since they were last online.
The same can be achieved on the sqlight end with event callbacks, you'll just have to trade some storage space on the client to retain the changes to be sent to the server. If that sounds like too much infrastructure for your needs, it's something that you can easily implement with SQL and pooling as well, but I would still think of it as an event log, and consider how it's implemented a detail - possibly allowing for a more efficient implementation lather on.
The best way to structure and package your change log will depend on your applications requirements, available band-with, etc. You could use standard formats such as json, compress and encrypt if needed.
It will be much simpler to design your application as such to avoid conflicts, and possibly flow data in a single direction, or partition your data so that it always flows in a single direction for a specific partition.
One final taught is that with such an architecture you would be getting incremental updates, some of which might be missed for unplanned reasons ( system failure, bugs, dropped messages, etc ). You could have some built in heuristic to check that your data matches, like at least checking the number of records on each side, with some way to recover such a fault, at a minimal a way to manually re-fetch the data from the authoritative source, i.e. if the server is authoritative, the client should be able to discard it's data and re-fetch it. You might need such a mechanism anyway for cases wen the client is reinstalled, etc.
Our google app engine app stores a fair amount of personally identifying information (email, ssn, etc) to identify users. I'm looking for advice as to how to secure that data.
My current strategy
Store the sensitive data in two forms:
Hashed - using SHA-2 and a salt
Encrypted - using public/private key RSA
When we need to do look ups:
Do look-ups on the hashed data (hash the PII in a query, compare it to the hashed PII in the datastore).
If we ever need to re-hash the data or otherwise deal with it in a raw form:
Decrypt the encrypted version with our private key. Never store it in raw form, just process it then re-hash & re-encrypt it.
My concerns
Keeping our hash salt secret
If an attacker gets ahold of the data in the datastore, as well as our hash salt, I'm worried they could brute force the sensitive data. Some of it (like SSN, a 9-digit number) does not have a big key space, so even with a modern hash algorithm I believe it could be done if the attacker knew the salt.
My current idea is to keep the salt out of source control and in it's own file. That file gets loaded on to GAE during deployment and the app reads the file when it needs to hash incoming data.
In between deployments the salt file lives on a USB key protected by an angry bear (or a safe deposit box).
With the salt only living in two places
The USB key
Deployed to google apps
and with code download permanently disabled, I can't think of a way for someone to get ahold of the salt without stealing that USB key. Am I missing something?
Keeping our private RSA key secret
Less worried about this. It will be rare that we'll need to decrypt the encrypted version (only if we change the hash algorithm or data format).
The private key never has to touch the GAE server, we can pull down the encrypted data, decrypt it locally, process it, and re-upload the encrypted / hashed versions.
We can keep our RSA private key on a USB stick guarded by a bear AND a tiger, and only bring it out when we need it.
I realize this question isn't exactly google apps specific, but I think GAE makes the situation somewhat unique.
If I had total control, I'd do things like lock down deployment access and access to the datastore viewer with two-factor authentication, but those options aren't available at the moment (Having a GAE specific password is good, but I like having RSA tokens involved).
I'm also neither a GAE expert nor a security expert, so if there's a hole I'm missing or something I'm not thinking of specific to the platform, I would love to hear it.
When deciding on a security architecture, the first thing in your mind should always be threat models. Who are your potential attackers, what are their capabilities, and how can you defend against them? Without a clear idea of your threat model, you've got no way to assess whether or not your proposed security measures are sufficient, or even if they're necessary.
From your text, I'm guessing you're seeking to protect against some subset of the following:
An attacker who compromises your datastore data, but not your application code.
An attacker who obtains access to credentials to access the admin console of your app and can deploy new code.
For the former, encrypting or hashing your datastore data is likely sufficient (but see the caveats later in this answer). Protecting against the latter is tougher, but as long as your admin users can't execute arbitrary code without deploying a new app version, storing your keys in a module that's not checked in to source control, as you suggest, ought to work just fine, since even with admin access, they can't recover the keys, nor can they deploy a new version that reveals the keys to them. Make sure to disable downloading of source, obviously.
You rightly note some concerns about hashing of data with a limited amount of entropy - and you're right to be concerned. To some degree, salts can help with this by preventing precomputation attacks, and key stretching, such as that employed in PBKDF2, scrypt, and bcrypt, can make your attacker's life harder by increasing the amount of work they have to do. However, with something like SSN, your keyspace is simply so small that no amount of key stretching is going to help - if you hash the data, and the attacker gets the hash, they will be able to determine the original SSN.
In such situations, your only viable approach is to encrypt the data with a secret key. Now your attacker is forced to brute-force the key in order to get the data, a challenge that is orders of magnitude harder.
In short, my recommendation would be to encrypt your data using a standard (private key) cipher, with the key stored in a module not in source control. Using hashing instead will only weaken your data, while using public key cryptography doesn't provide appreciable security against any plausible threat model that you don't already have by using a standard cipher.
Of course, the number one way to protect your users' data is to not store it in the first place, if you can. :)
You can increase your hashing algorithm security by using HMAC, a secret key, and a unique salt per entry (I know people will disagree with me on this but it's my belief from my research that it helps avoid certain attacks). You can also use bcrypt or scrypt to hash which will make reversing the hash an extremely time consuming process (but you'll also have to factor this in as time it takes your app to compute the hash).
By disabling code downloads and keeping your secret key protected, I can't imagine how someone can get a hold of it. Just make sure your code is kept protected under similar safe guards or that you remove the secret key from your code during development and only pull it out to deploy. I assume you will keep your secret key in your code (I've heard many people say to keep it in memory to be ultra secure but given the nature of AppEngine and instances, this isn't feasible).
Update:
Be sure to enable 2-factor authentication for all Google accounts that have admin rights to your app. Google offers this so not sure if your restriction for enabling this was imposed by an outside force or not.
Interesting approach to encrypt data on a datastore. After going through this, one question that comes to my mind is how do you query data on your hashes? Are you using comparison of two hashes or more fine grained hashing? Again how do you accomplish operations like greater than value, less than value after hashing and encrypting the data in your table?
Fine grained hashing meaning, do you hash consecutive bytes of a data stream to get the accumulated hash. i.e hash(abcd) = hash(a,b) + hash (b,c) + etc. This type of hashing would tell how similar the underlying data are rather than a match.
I was wondering when dealing with a web service API that returns XML, whether it's better (faster) to just call the external service each time and parse the XML (using ElementTree) for display on your site or to save the records into the database (after parsing it once or however many times you need to each day) and make database calls instead for that same information.
First off -- measure. Don't just assume that one is better or worse than the other.
Second, if you really don't want to measure, I'd guess the database is a bit faster (assuming the database is relatively local compared to the web service). Network latency usually is more than parse time unless we're talking a really complex database or really complex XML.
Everyone is being very polite in answering this question: "it depends"... "you should test"... and so forth.
True, the question does not go into great detail about the application and network topographies involved, but if the question is even being asked, then it's likely a) the DB is "local" to the application (on the same subnet, or the same machine, or in memory), and b) the webservice is not. After all, the OP uses the phrases "external service" and "display on your own site." The phrase "parsing it once or however many times you need to each day" also suggests a set of data that doesn't exactly change every second.
The classic SOA myth is that the network is always available; going a step further, I'd say it's a myth that the network is always available with low latency. Unless your own internal systems are crap, sending an HTTP query across the Internet will always be slower than a query to a local DB or DB cluster. There are any number of reasons for this: number of hops to the remote server, outage or degradation issues that you can't control on the remote end, and the internal processing time for the remote web service application to analyze your request, hit its own persistence backend (aka DB), and return a result.
Fire up your app. Do some latency and response times to your DB. Now do the same to a remote web service. Unless your DB is also across the Internet, you'll notice a huge difference.
It's not at all hard for a competent technologist to scale a DB, or for you to completely remove the DB from caching using memcached and other paradigms; the latency between servers sitting near each other in the datacentre is monumentally less than between machines over the Internet (and more secure, to boot). Even if achieving this scale requires some thought, it's under your control, unlike a remote web service whose scaling and latency are totally opaque to you. I, for one, would not be too happy with the idea that the availability and responsiveness of my site are based on someone else entirely.
Finally, what happens if the remote web service is unavailable? Imagine a world where every request to your site involves a request over the Internet to some other site. What happens if that other site is unavailable? Do your users watch a spinning cursor of death for several hours? Do they enjoy an Error 500 while your site borks on this unexpected external dependency?
If you find yourself adopting an architecture whose fundamental features depend on a remote Internet call for every request, think very carefully about your application before deciding if you can live with the consequences.
Consuming the webservices is more efficient because there are a lot more things you can do to scale your webservices and webserver (via caching, etc.). By consuming the middle layer, you also have the options to change the returned data format (e.g. you can decide to use JSON rather than XML). Scaling database is much harder (involving replication, etc.) so in general, reduce hits on DB if you can.
There is not enough information to be able to say for sure in the general case. Why don't you do some tests and find out? Since it sounds like you are using python you will probably want to use the timeit module.
Some things that could effect the result:
Performance of the web service you are using
Reliability of the web service you are using
Distance between servers
Amount of data being returned
I would guess that if it is cacheable, that a cached version of the data will be faster, but that does not necessarily mean using a local RDBMS, it might mean something like memcached or an in memory cache in your application.
It depends - who is calling the web service? Is the web service called every time the user hits the page? If that's the case I'd recommend introducing a caching layer of some sort - many web service API's throttle the amount of hits you can make per hour.
Whether you choose to parse the cached XML on the fly or call the data from a database probably won't matter (unless we are talking enterprise scaling here). Personally, I'd much rather make a simple SQL call than write a DOM Parser (which is much more prone to exceptional scenarios).
It depends from case to case, you'll have to measure (or at least make an educated guess).
You'll have to consider several things.
Web service
it might hit database itself
it can be cached
it will introduce network latency and might be unreliable
or it could be in local network and faster than accessing even local disk
DB
might be slow since it needs to access disk (although databases have internal caches, but those are usually not targeted)
should be reliable
Technology itself doesn't mean much in terms of speed - in one case database parses SQL, in other XML parser parses XML, and database is usually acessed via socket as well, so you have both parsing and network in either case.
Caching data in your application if applicable is probably a good idea.
As a few people have said, it depends, and you should test it.
Often external services are slow, and caching them locally (in a database in memory, e.g., with memcached) is faster. But perhaps not.
Fortunately, it's cheap and easy to test.
Test definitely. As a rule of thumb, XML is good for communicating between apps, but once you have the data inside of your app, everything should go into a database table. This may not apply in all cases, but 95% of the time it has for me. Anytime I ever tried to store data any other way (ex. XML in a content management system) I ended up wishing I would have just used good old sprocs and sql server.
It sounds like you essentially want to cache results, and are wondering if it's worth it. But if so, I would NOT use a database (I assume you are thinking of a relational DB): RDBMSs are not good for caching; even though many use them. You don't need persistence nor ACID.
If choice was between Oracle/MySQL and external web service, I would start with just using service.
Instead, consider real caching systems; local or not (memcache, simple in-memory caches etc).
Or if you must use a DB, use key/value store, BDB works well. Store response message in its serialized form (XML), try to fetch from cache, if not, from service, parse. Or if there's a convenient and more compact serialization, store and fetch that.
I'm looking at sessions in Django, and by default they are stored in the database. What are the benefits of filesystem and cache sessions and when should I use them?
The filesystem backend is only worth looking at if you're not going to use a database for any other part of your system. If you are using a database then the filesystem backend has nothing to recommend it.
The memcache backend is much quicker than the database backend, but you run the risk of a session being purged and some of your session data being lost.
If you're a really, really high traffic website and code carefully so you can cope with losing a session then use memcache. If you're not using a database use the file system cache, but the default database backend is the best, safest and simplest option in almost all cases.
I'm no Django expert, so this answer is about session stores generally. Downvote if I'm wrong.
Performance and Scalability
Choice of session store has an effect on performance and scalability. This should only be a big problem if you have a very popular application.
Both database and filesystem session stores are (usually) backed by disks so you can have a lot of sessions cheaply (because disks are cheap), but requests will often have to wait for the data to be read (because disks are slow). Memcached sessions use RAM, so will cost more to support the same number of concurrent sessions (because RAM is expensive), but may be faster (because RAM is fast).
Filesystem sessions are tied to the box where your application is running, so you can't load balance between multiple application servers if your site gets huge. Database and memcached sessions let you have multiple application servers talking to a shared session store.
Simplicity
Choice of session store will also impact how easy it is to deploy your site. Changing away from the default will cost some complexity. Memcached and RDBMSs both have their own complexities, but your application is probably going to be using an RDBMS anyway.
Unless you have a very popular application, simplicity should be the larger concern.
Bonus
Another approach is to store session data in cookies (all of it, not just an ID). This has the advantage that the session store automatically scales with the number of users, but it has disadvantages too. You (or your framework) need to be careful to stop users forging session data. You also need to keep each session small because the whole thing will be sent with every request.
As of Django 1.1 you can use the cached_db session back end.
This stores the session in the cache (only use with memcached), and writes it back to the DB. If it has fallen out of the cache, it will be read from the DB.
Although this is slower than just using memcached for storing the session, it adds persistence to the session.
For more information, see: Django Docs: Using Cached Sessions
One thing that has to be considered when choosing session backend is "how often session data is modified"? Even sites with moderate traffic will suffer if session data is modified on each request, making many database trips to store and retrieve data.
In my previous work we used memcache as session backend exclusively and it worked really well. Our administrative team put really great effort in making two special memcached instances stable as a rock, but after bit of twiddling with initial setup, we did not have any interrupts of session backends operations.
If the database have a DBA that isn't you, you may not be allowed to use a database-backed session (it being a front-end matter only). Until django supports easily merging data from several databases, so that you can have frontend-specific stuff like sessions and user-messages (the messages in django.contrib.auth are also stored in the db) in a separate db, you need to keep this in mind.