How do you call a service which limits access to another service - python

We have a lot of small services (python based) which all access the same external resource (REST API). The external resource is a payed service with a given set of tickets per day. Every request, no matter how small or big decreases the available tickets by one. Even if the request fails due to some error in the request parameters or timeout the tickets get reduced. This is usually not a big problem and can be dealt with on a per service level. The problem starts with the limitation about maximum parallel request. We have a certain amount of parallel request, and if we reach that limit the next requests fail and again decrease our available tickets.
So for me a solution where each service handles this error and retries after a certain amount of time is no option. This would be to costly in terms of tickets and also way too inefficient.
My solution now would be to have special internal service which all other service call which acts as a kind of proxy or middleman which receives the requests and puts them in a queue and processes them in a way that we never exceed the parallel requests limit.
Before i now start to code i would like to know if there is a proper name for such a service and if there are already some solutions out there, because i can imagine someone else could also have these problems. I also think that someone (probably not me) could create such a service completely independent from the actual external API.
Thank you very much and please stackoverflow gods be kind with me.

Related

How do websites detect bots?

I am learning python and i am currently scraping reddit. Somehow reddit has figured out that I am a bot (which my software actually is) but how do they know that? And how we trick them into thinking that we are normal users.
I found practical solution for that, but I am asking for bit more in depth theoretical understanding.
There's a large array of techniques that internet service providers use to detect and combat bots and scrapers. At the core of all of them is to build heuristics and statistical models that can identify non-human-like behavior. Things such as:
Total number of requests from a certain IP per specific time frame, for example, anything more than 50 requests per second, or 500 per minute, or 5000 per day may seem suspicious or even malicious. Counting number of requests per IP per unit of time is a very common, and arguably effective, technique.
Regularity of incoming requests rate, for example, a sustained flow of 10 requests per second may seem like a robot programmed to make a request, wait a little, make the next request, and so on.
HTTP Headers. Browsers send predictable User-Agent headers with each request that helps the server identify their vendor, version, and other information. In combination with other headers, a server might be able to figure out that requests are coming from an unknown or otherwise exploitative source.
A stateful combination of authentication tokens, cookies, encryption keys, and other ephemeral pieces of information that require subsequent requests to be formed and submitted in a special manner. For example, the server may send down a certain key (via cookies, headers, in the response body, etc) and expect that your browser include or otherwise use that key for the subsequent request it makes to the server. If too many requests fail to satisfy that condition, it's a telltale sign they might be coming from a bot.
Mouse and keyboard tracking techniques: if the server knows that a certain API can only be called when the user clicks a certain button, they can write front-end code to ensure that the proper mouse-activity is detected (i.e. the user did actually click on the button) before the API request is made.
And many many more techniques. Imagine you are the person trying to detect and block bot activity. What approaches would you take to ensure that requests are coming from human users? How would you define human behavior as opposed to bot behavior, and what metrics can you use to discern the two?
There's a question of practicality as well: some approaches are more costly and difficult to implement. Then the question will be: to what extent (how reliably) would you need to detect and block bot activity? Are you combatting bots trying to hack into user accounts? Or do you simply need to prevent them (perhaps in a best-effort manner) from scraping some data from otherwise publicly visible web pages? What would you do in case of false-negative and false-positive detections? These questions inform the complexity and ingenuity of the approach you might take to identify and block bot activity.

How best for an API running on a web server to check its public availability / status

I have an AWS server that handles end-user registration/ It runs an EC2 linux instance that serves our API via Apache & Python, and which is connected to its data on a separate Amazon RDS instance running mysql.
To remotely admin the system, I set states in a mysql table to control the availability of the registration API to the public user, and also the level of logging for our Python API, which may reference up to 5 concurrent admin preferences (i.e. not a single "log level")
Because our API provides almost two dozen different functions, we need to check the state of the system's availability before any individual function is accessed. That means there's an SQL Select statement from that table (which only has one record), but for every session of user transaction,s which might involve a half-dozen API calls. We need to check to see if the availability status has changed, so the user doesn't start an API call and have the database become unavailable in the middle of the process. Same for the logging preferences.
The API calls return the server's availability, and estimated downtime, back to the calling program (NOT a web browser interface) which handles that situation gracefully.
Is this a commonly accepted approch for handling this? Should I care if I'm over-polling the status table? And should I set up mysql with my status table in such a way to make my constant checking more efficient (e.g. cached?) when Python obtains its data?
I should note that we might have thousands of simultaneous users making API requests, not tens of thousands, or millions.
Your strategy seems off-track, here.
Polling a status table should not be a major hot spot. A small table, with proper indexes, queried outside a transaction, is a lightweight operation. With an appropriately-provisioned server, such a query should be done entirely in memory, requiring no disk access.
But that doesn't mean it's a fully viable strategy.
We need to check to see if the availability status has changed, so the user doesn't start an API call and have the database become unavailable in the middle of the process.
This will prove impossible. You need time travel capability for this strategy to succeed.
Consider this: the database becoming unavailable in the middle of a process wouldn't be detected by your approach. Only the lack of availability at the beginning would be detected. And that's easy enough to detect, anyway -- you will realize that as soon as you try to do something.
Set appropriate timeouts. The MySQL client library should have support for a connect timeout, as well as a timeout which will cause your application to see an error if a query runs longer than is acceptable or a network disruption causes the connection to be lost mid-query. I don't know whether this exists or what it's called in Python but in the C client library, this is MYSQL_OPT_READ_TIMEOUT and is very handy for preventing a hang when for whatever reason you get no response from the database within an acceptable period of time.
Use database transactions, so that a failure to process a request results in no net change to the database. A MySQL transaction is implicitly rolled back if the connection between the application and the database is lost.
Implementing error handling and recovery -- written into your code -- is likely the more viable approach than trying to prevent your code from running when the service is unavailable is more likely to be a good design, because there is no check interval small enough to fully avoid a database becoming unavailable "in the middle" of a request.
In any event, polling a database table with each request seems like the wrong approach, not to mention the fact that an outage on the health status table's server makes your service fail unnecessarily when the service itself might have been healthy but failed to prove that.
On the other hand, I don't know your architecture, but assuming your front-end involves something like Amazon Application Load Balancer or HAProxy, the health checks against the API service endpoint can actually perform the test. If you configure your check interval for, say, 10 seconds, and making a request to the check endpoint (say GET /health-check) actually verifies end-to-end availability of the necessary components (e.g. database access) then the API service can effectively take itself offline when a problem occurs. It remains offline until it starts returning success again.
The advantage here is that your workload involved in healthy checking is consistent -- it happens every 10 seconds, increasing with the number of nodes providing the service, but not increasing with actual request traffic, because you don't have to perform a check for each request. This means you have a window of a few seconds between the actual loss of availability and the detection of the loss of availability, but the requests that get through in the mean time will fail, anyway.
HAProxy -- and presumably other tools like Varnish or Nginx -- can help you handle graceful failures in other ways as well, by timing out failed requests at a layer before the API endpoint so that the caller gets a response even though the service itself didn't respond. An example from one of my environments is a shopping page where an external API call is made by the application when a site visitor is browsing items by category. If this request runs longer than it should, the proxy can interrupt the request and return a preconfigured static error page to the system making the request with an error -- say, in JSON or XML, that the requesting application will understand -- so that the hard failure becomes a softer one. This fake response can, for example in this case, return an empty JSON array of "items found."
It isn't entirely clear to me, now, whether these APIs are yours, or are external APIs that you are aggregating. If the latter, then HAProxy is a good solution here, too, but facing the other direction -- the back-end faces outward and your service contacts its front-end. You access the external service through the proxy and the proxy checks the remote service and will immediately return an error back to your application if the target API is unhealthy. I use this solution to access an external trouble ticketing system from one of my apps. An additional advantage, here, is that the proxy logs allow me to collect usage, performance, and reliability data about all of the many requests passed to that external service regardless of which of dozens of internal systems may access it, with far better visibility than I could accomplish than if I tried to collect it from all of the internal application servers that access that external service.

Facebook's Graph API's request limit on a locally run program? How to get specific data in real time without reaching it?

I've been writing a program in Python which needs to have the datum of the number of likes of a specific Facebook page in real time. The program itself works, but it's based on a loop that is constantly requesting the number of likes and updating it on a variable, and I was afraid that this way it will soon reach the API's limit of requests.
I read that Graph API's request limit per user for an application is 200 requests per hour. Is a program locally run as this one considered an application with one user, or what is it considered?
Also, I read that some users say the API can handle 600 requests per 600 seconds without returning an error, does this still apply? (Could I, for example, delay the loop for one second and still be able to make all the requests?) If not, is there a solution to get that info in real time in a local program? (I saw that Graph can send you updates with a POST on a specified URL, but is there a way to receive those updates without owning an URL? Maybe a way to renew the token or something?). I need to have this program running for almost a whole day, so not being rejected from the API is quite important.
Sorry if it sounds silly or anything, this is the first time I'm using the Graph API (and a web-based API in general).

What happens to the supernumerary requests when a Django app gets flooded?

I have a web app using Django. The app has a maximum capacity of N RPS, while the client sends M RPS, where M>N. In other words, the app receives more requests than it can handle, and the number of unprocessed requests will grow linearly over time (after t sec, the number of requests that are waiting to be processed is (M-N) * t)
I would like to know what will happen to these requests. Will they accumulate in memory until the memory is full? Will they get "canceled" after some conditions are met?
It's hard to answer to your question directly without details about your configuration. Moreover for a extremely high usage of your app it's realy hard to determine what will happen there. But surely, you can't be sure that all those request will be handled correctly.
If you are able to count how many requests per second your application can handle and you want to make it reliable for more than N requests, then maybe it's a good start to think of some kind of load balancer, which will spread your request over multiple server machines.
To answer your question, I can of think of few posibilities when request can't be handled correclty:
Client cancelled he's request (maybe a browser, which can have maximum time execution limit).
Time execution of request was above the timeout limit set in web server configuration (because of lack of resources, too many I/O operations, ...).
Maybe other service (like some blocked PostgreSQL query or maybe Memcache server failed to work) was timeouted
Your server machine is overloaded and TCP connection can't be established.
Web server of your choice is able to handle only specified in configuration amount of requests/queue length and rejects those over limit (in Apache for example this configuration: http://httpd.apache.org/docs/2.2/mod/mpm_common.html#listenbacklog).
Try to read something about C10K problem, could be useful to think about it more deeply.

Will a website recover from a Denial of Service attack on its own?

I'm in a rather peculiar situation right now. To make a long story short, I'm part of a (real life) volunteer organization of about 2000 members. Our current website was built and maintained by a member who is no longer part of the organization (he quit). Unfortunately, he was the only one that actually had access to the server, and hasn't been cooperative in handing over the reigns to someone else after he left. As a result, myself and a small team of people have been working on creating a new website for ourselves from scratch. The data on the original website would be awesome to have for the new one, so without direct access to the database we have been screen-scraping what we need.
Which brings me to my current conundrum. The screen-scraping script I was using was being really slow, so I had the brilliant (not) idea of parallelizing it. I assumed the bottleneck was caused by my slow internet, so I foolishly decided to run 250 threads at once. After I tried that, the web server mysteriously went down and hasn't gone up since (it's been about 30 minutes now).
I'm not any kind of hacker or security expert, but I'm pretty sure I just accidentally caused a Denial of Service attack on the server. Which brings me to my question - assuming the owner of the website does nothing to help us, will the server come back to life of its own accord? (it's being hosted by Django on Linode if that matters). How do websites typically recover from DoS attacks? Have I potentially misdiagnosed what's going on, and could there be an alternative explanation? Or is the website lost forever?
Edit: All 250 of the requests were simple http requests going to pages within the Django admin panel if that changes anything.
More than likely the system is not truly down for good, unless the guy might have gotten pissed or the hosting provider, etc disabled it due to the traffic load. But there are a number of things to think of. But 250 connections isn't that much load, even for a shared hosting account, unless you were just flooding the server with requests.
Depending on what technology is used, there are a number of things that "could" have happened.
You could have simply hit throttling limits on the webserver side for queuing, etc, that might need the application to restart. This could be automatic after a period of time or need intervention from the hosting provider.
You could have overloaded the application and had it use too much memory where it was forcefully shut down. Some hosting providers will do this, but typiclaly provider for small windows of time and will allow the application to start back up. (Give it an hour or so)
You could have pushed it over the monthly limit for bandwidth, in that case, it could be down until the next billing cycle...
WIthout knowing the hosting provider or environment these are just guesses.
I would strongly recommend though turning off your scraper!
You should stop your screen-scraping software if you have not already.
Depending on what part of the system is down (Either the database, the server, the network or all), there is a chance it will recover by itself when the loads come back down.
If your application cannot sustain 250 simultaneous connections, you will want to investigate why. The culprit is usually database load (no indexes, un-optimized queries).
Linode could also have restrictions in place to limit how much bandwidth can be used within a certain period of time. You should probably contact them (or whoever is in charge).

Categories