how to add rate limiting on tornado python app

how to add rate limiting on tornado python app - python

would it be possible to implement a rate limiting feature on my tornado app? like limit the number of HTTP request from a specific client if they are identified to send too many requests per second (which red flags them as bots).
I think I could it manually by storing the requests on a database and analyzing the requests per IP address but I was just checking if there is already an existing solution for this feature.
I tried checking the github page of tornado, I have the same questions as this post but no explicit answer was provided. checked tornado's wiki links as well but I think rate limiting is not handled yet.

Instead of storing them in the DB, would be better to have them in a dictionary stored in memory for easy usability.
Also can you share the details whether the api has a load-balancer and which web-server is used.

The enterprise grade solution to your problem is ambassador.
You can use ambassador's solutions like envoy proxy and edge stack and have it set up that can do the needful.
additional to tore the data, you can use any popular cached db, or d that store as key:value pairs, for example redis.
if you doing this for a very small project, can use some npm/pip packages.
Read the docs: https://www.getambassador.io/products/edge-stack/api-gateway/

You should probably do this before your requests reach Tornado.
But if it's an application level feature (limiting requests depending on level of subscription), then you can do it in Tornado in lots of ways, depending on how complex you want the rate limiting to be.
Probably the simplest way is to have a dict on your tornado.web.Application that uses the ip as the key and the timestamp of the last request as the value and check every request against it in prepare- if not enough time passed since last request, raise a tornado.web.HTTPError(429) (ideally with a Retry-After header). If you do this, you will still need to clean up this dict now and then to remove entries that have not made a request recently or it will grow (you could do it finish on every request).
If you have another fast/in-memory storage attached (memcache, redis, sqlite), you should use that, but you definitely should not use an RDBMS as all those writes will not be great for its performance.

Related

How to reliably check if a domain has been registered or is available?

Objective
I need a reliable way to check in Python if a domain of any TLD has been registered or is available. The bold phrases are the key points that I'm struggling with.
What I tried?
WHOIS is the obvious way to do the check and an existing Python library like the popular python-whois was my first try. The problem is that it doesn't seem to be able to retrieve information for some of the TLDs, e.g. .run, while it works mostly fine for older ones, e.g. .com.
So if python-whois is not reliable, maybe just a wrapper for the Linux's whois would be better. I tried whois library and unfortunately it supports only a limited set of TLDs, apparently to make sure it can always parse the results.
As I don't really need to parse the results, I ripped the code out of the whois library and tried to do the query by calling Linux's whois myself:
p = subprocess.Popen(['whois', 'example.com'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
r = p.communicate()[0]
print(r.decode())
That works much better. Except it's not that reliable either. I tried one particular domain and got "Your connection limit exceeded. Please slow down and try again later." Well, it's not me who is exceeding the limit. Being behind a single IP in a huge office means that somebody else might hit the limit before I make a query.
Another thought was not to use WHOIS and instead do a DNS lookup. However, I need to deal with domains that are registered or in the protected phase after expiry and don't have DNS records so this is apparently not possible.
Last idea was to do the queries via an API of some 3rd party service. The problem is trust in those services as they might snatch an available domain that I check.
Similar questions
There are already similar questions:
a stable way to check domain availability with pywhois
Testing domain-name availability with pythonwhois
...but they either deal only with a limited set of TLDs or are not that bothered by reliability.

If you do not have specific access (like being a registrar), and if you do not target a specific TLD (as some TLDs do have a specific public service called domain availability), the only tool that makes sense is to query whois servers.
You have then at least the following two problems:
use the appropriate whois server based on the given domain name
taking into account that whois servers are rate-limited so if you are bulk querying them without care you will first hit delays and then even risk your IP to be blacklisted, for some time.
For the second point the usual methods apply (handling delays on your side, using multiple endpoints, etc.)
For the first point, in another of my reply here: https://unix.stackexchange.com/a/407030/211833 you could find some explanations of what you observe depending on the wrapper around whois you use and some counter measures. See also my other reply here: https://webmasters.stackexchange.com/a/111639/75842 and specifically point
2.
Note that depending on your specific requirements and if you are able to at least change part of them, you may have other solutions. For example, for gTLDs, if you tolerate 24 hours delay, you may use the published zonefiles of registries to find domain names registered (those published so not all of them).
Also, why you are right in a generic sense that using a third party has its weaknesses, if you find a worthy registrar that both has access to many registries and that provides you with an API, you could then use it for your needs.
In short, I do not believe you can achieve this task with all cases (100% reliability, 100% TLDs, etc.). You will need some compromises but they depend on your initial needs.
Also very important: do not shell out to run a whois command, this will create many security and performance problems. Use the appropriate libraries from your programming language to do whois queries or just open a TCP socket on port 43 and send your queries on one line terminated by CR+LF, reading back a blob of text, this is basically only what is defined in RFC3912.

Etags used in RESTful APIs are still susceptible to race conditions

Maybe I'm overlooking something simple and obvious here, but here goes:
So one of the features of the Etag header in a HTTP request/response it to enforce concurrency, namely so that multiple clients cannot override each other's edits of a resource (normally when doing a PUT request). I think that part is fairly well known.
The bit I'm not so sure about is how the backend/API implementation can actually implement this without having a race condition; for example:
Setup:
RESTful API sits on top of a standard relational database, using an ORM for all interactions (SQL Alchemy or Postgres for example).
Etag is based on 'last updated time' of the resource
Web framework (Flask) sits behind a multi threaded/process webserver (nginx + gunicorn) so can process multiple requests concurrently.
The problem:
Client 1 and 2 both request a resource (get request), both now have the same Etag.
Both Client 1 and 2 sends a PUT request to update the resource at the same time. The API receives the requests, proceeds to uses the ORM to fetch the required information from the database then compares the request Etag with the 'last updated time' from the database... they match so each is a valid request. Each request continues on and commits the update to the database.
Each commit is a synchronous/blocking transaction so one request will get in before the other and thus one will override the others changes.
Doesn't this break the purpose of the Etag?
The only fool-proof solution I can think of is to also make the database perform the check, in the update query for example. Am I missing something?
P.S Tagged as Python due to the frameworks used but this should be a language/framework agnostic problem.

This is really a question about how to use ORMs to do updates, not about ETags.
Imagine 2 processes transferring money into a bank account at the same time -- they both read the old balance, add some, then write the new balance. One of the transfers is lost.
When you're writing with a relational DB, the solution to these problems is to put the read + write in the same transaction, and then use SELECT FOR UPDATE to read the data and/or ensure you have an appropriate isolation level set.
The various ORM implementations all support transactions, so getting the read, check and write into the same transaction will be easy. If you set the SERIALIZABLE isolation level, then that will be enough to fix race conditions, but you may have to deal with deadlocks.
ORMs also generally support SELECT FOR UPDATE in some way. This will let you write safe code with the default READ COMMITTED isolation level. If you google SELECT FOR UPDATE and your ORM, it will probably tell you how to do it.
In both cases (serializable isolation level or select for update), the database will fix the problem by getting a lock on the row for the entity when you read it. If another request comes in and tries to read the entity before your transaction commits, it will be forced to wait.

Etag can be implemented in many ways, not just last updated time. If you choose to implement the Etag purely based on last updated time, then why not just use the Last-Modified header?
If you were to encode more information into the Etag about the underlying resource, you wouldn't be susceptible to the race condition that you've outlined above.
The only fool proof solution I can think of is to also make the database perform the check, in the update query for example. Am I missing something?
That's your answer.
Another option would be to add a version to each of your resources which is incremented on each successful update. When updating a resource, specify both the ID and the version in the WHERE. Additionally, set version = version + 1. If the resource had been updated since the last request then the update would fail as no record would be found. This eliminates the need for locking.

You are right that you can still get race conditions if the 'check last etag' and 'make the change' aren't in one atomic operation.
In essence, if your server itself has a race condition, sending etags to the client won't help with that.
You already mentioned a good way to achieve this atomicity:
The only fool-proof solution I can think of is to also make the database perform the check, in the update query for example.
You could do something else, like using a mutex lock. Or using an architecture where two threads cannot deal with the same data.
But the database check seems good to me. What you describe about ORM checks might be an addition for better error messages, but is not by itself sufficient as you found.

Why does Openstack Swift requests are blocked by eventlet.green.httplib?

Openstack-Swift is using evenlet.green.httplib for BufferedHttpconnections.
When I do performance benchmark of it for write operations, I could observer that write throughput drops even only one replica node is overloaded.
As I know write quorum is 2 out of 3 replicas, therefore overloading only one replica cannot affect for the throughput.
When I dig deeper what I observed was, the consequent requests are blocked until the responses are reached for the previous requests. Its mainly because of the BufferedHttpConnection which stops issuing new request until the previous response is read.
Why Openstack-swift use such a method?
Is this the usual behaviour of evenlet.green.httplib.HttpConnection?
This does not make sense in write quorum point of view, because its like waiting for all the responses not a quorum.
Any ideas, any workaround to stop this behaviour using the same library?

Its not a problem of the library but a limitation due to the Openstack Swift configuration where the "Workers" configuration in all Account/Container/Object config of Openstack Swift was set to 1
Regarding the library
When new connections are made using evenlet.green.httplib.HttpConnection
it does not block.
But if requests are using the same connection, subsequent requests are blocked until the response is fully read.

Is the preferred method of storing "session" data using Tornado through secure_cookie?

For example, I log in to my server. I want to store things like a username. Would the best way of going about it be:
self.set_secure_cookie('username', "foobar70")

Just my oppinion. Secure cookies usually does good job to store data secure and works OK, if you need to store small data chunks, but passing lots of data back and forward with bigger cookies is annoying :) So the answer depends of your data volumes.
I usually use this implementation of sessions in Tornado, based on redis https://gist.github.com/1735032

Is it more efficient to parse external XML or to hit the database?

I was wondering when dealing with a web service API that returns XML, whether it's better (faster) to just call the external service each time and parse the XML (using ElementTree) for display on your site or to save the records into the database (after parsing it once or however many times you need to each day) and make database calls instead for that same information.

First off -- measure. Don't just assume that one is better or worse than the other.
Second, if you really don't want to measure, I'd guess the database is a bit faster (assuming the database is relatively local compared to the web service). Network latency usually is more than parse time unless we're talking a really complex database or really complex XML.

Everyone is being very polite in answering this question: "it depends"... "you should test"... and so forth.
True, the question does not go into great detail about the application and network topographies involved, but if the question is even being asked, then it's likely a) the DB is "local" to the application (on the same subnet, or the same machine, or in memory), and b) the webservice is not. After all, the OP uses the phrases "external service" and "display on your own site." The phrase "parsing it once or however many times you need to each day" also suggests a set of data that doesn't exactly change every second.
The classic SOA myth is that the network is always available; going a step further, I'd say it's a myth that the network is always available with low latency. Unless your own internal systems are crap, sending an HTTP query across the Internet will always be slower than a query to a local DB or DB cluster. There are any number of reasons for this: number of hops to the remote server, outage or degradation issues that you can't control on the remote end, and the internal processing time for the remote web service application to analyze your request, hit its own persistence backend (aka DB), and return a result.
Fire up your app. Do some latency and response times to your DB. Now do the same to a remote web service. Unless your DB is also across the Internet, you'll notice a huge difference.
It's not at all hard for a competent technologist to scale a DB, or for you to completely remove the DB from caching using memcached and other paradigms; the latency between servers sitting near each other in the datacentre is monumentally less than between machines over the Internet (and more secure, to boot). Even if achieving this scale requires some thought, it's under your control, unlike a remote web service whose scaling and latency are totally opaque to you. I, for one, would not be too happy with the idea that the availability and responsiveness of my site are based on someone else entirely.
Finally, what happens if the remote web service is unavailable? Imagine a world where every request to your site involves a request over the Internet to some other site. What happens if that other site is unavailable? Do your users watch a spinning cursor of death for several hours? Do they enjoy an Error 500 while your site borks on this unexpected external dependency?
If you find yourself adopting an architecture whose fundamental features depend on a remote Internet call for every request, think very carefully about your application before deciding if you can live with the consequences.

Consuming the webservices is more efficient because there are a lot more things you can do to scale your webservices and webserver (via caching, etc.). By consuming the middle layer, you also have the options to change the returned data format (e.g. you can decide to use JSON rather than XML). Scaling database is much harder (involving replication, etc.) so in general, reduce hits on DB if you can.

There is not enough information to be able to say for sure in the general case. Why don't you do some tests and find out? Since it sounds like you are using python you will probably want to use the timeit module.
Some things that could effect the result:
Performance of the web service you are using
Reliability of the web service you are using
Distance between servers
Amount of data being returned
I would guess that if it is cacheable, that a cached version of the data will be faster, but that does not necessarily mean using a local RDBMS, it might mean something like memcached or an in memory cache in your application.

It depends - who is calling the web service? Is the web service called every time the user hits the page? If that's the case I'd recommend introducing a caching layer of some sort - many web service API's throttle the amount of hits you can make per hour.
Whether you choose to parse the cached XML on the fly or call the data from a database probably won't matter (unless we are talking enterprise scaling here). Personally, I'd much rather make a simple SQL call than write a DOM Parser (which is much more prone to exceptional scenarios).

It depends from case to case, you'll have to measure (or at least make an educated guess).
You'll have to consider several things.
Web service
it might hit database itself
it can be cached
it will introduce network latency and might be unreliable
or it could be in local network and faster than accessing even local disk
DB
might be slow since it needs to access disk (although databases have internal caches, but those are usually not targeted)
should be reliable
Technology itself doesn't mean much in terms of speed - in one case database parses SQL, in other XML parser parses XML, and database is usually acessed via socket as well, so you have both parsing and network in either case.
Caching data in your application if applicable is probably a good idea.

As a few people have said, it depends, and you should test it.
Often external services are slow, and caching them locally (in a database in memory, e.g., with memcached) is faster. But perhaps not.
Fortunately, it's cheap and easy to test.

Test definitely. As a rule of thumb, XML is good for communicating between apps, but once you have the data inside of your app, everything should go into a database table. This may not apply in all cases, but 95% of the time it has for me. Anytime I ever tried to store data any other way (ex. XML in a content management system) I ended up wishing I would have just used good old sprocs and sql server.

It sounds like you essentially want to cache results, and are wondering if it's worth it. But if so, I would NOT use a database (I assume you are thinking of a relational DB): RDBMSs are not good for caching; even though many use them. You don't need persistence nor ACID.
If choice was between Oracle/MySQL and external web service, I would start with just using service.
Instead, consider real caching systems; local or not (memcache, simple in-memory caches etc).
Or if you must use a DB, use key/value store, BDB works well. Store response message in its serialized form (XML), try to fetch from cache, if not, from service, parse. Or if there's a convenient and more compact serialization, store and fetch that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.