How do websites detect bots? - python

I am learning python and i am currently scraping reddit. Somehow reddit has figured out that I am a bot (which my software actually is) but how do they know that? And how we trick them into thinking that we are normal users.
I found practical solution for that, but I am asking for bit more in depth theoretical understanding.

There's a large array of techniques that internet service providers use to detect and combat bots and scrapers. At the core of all of them is to build heuristics and statistical models that can identify non-human-like behavior. Things such as:
Total number of requests from a certain IP per specific time frame, for example, anything more than 50 requests per second, or 500 per minute, or 5000 per day may seem suspicious or even malicious. Counting number of requests per IP per unit of time is a very common, and arguably effective, technique.
Regularity of incoming requests rate, for example, a sustained flow of 10 requests per second may seem like a robot programmed to make a request, wait a little, make the next request, and so on.
HTTP Headers. Browsers send predictable User-Agent headers with each request that helps the server identify their vendor, version, and other information. In combination with other headers, a server might be able to figure out that requests are coming from an unknown or otherwise exploitative source.
A stateful combination of authentication tokens, cookies, encryption keys, and other ephemeral pieces of information that require subsequent requests to be formed and submitted in a special manner. For example, the server may send down a certain key (via cookies, headers, in the response body, etc) and expect that your browser include or otherwise use that key for the subsequent request it makes to the server. If too many requests fail to satisfy that condition, it's a telltale sign they might be coming from a bot.
Mouse and keyboard tracking techniques: if the server knows that a certain API can only be called when the user clicks a certain button, they can write front-end code to ensure that the proper mouse-activity is detected (i.e. the user did actually click on the button) before the API request is made.
And many many more techniques. Imagine you are the person trying to detect and block bot activity. What approaches would you take to ensure that requests are coming from human users? How would you define human behavior as opposed to bot behavior, and what metrics can you use to discern the two?
There's a question of practicality as well: some approaches are more costly and difficult to implement. Then the question will be: to what extent (how reliably) would you need to detect and block bot activity? Are you combatting bots trying to hack into user accounts? Or do you simply need to prevent them (perhaps in a best-effort manner) from scraping some data from otherwise publicly visible web pages? What would you do in case of false-negative and false-positive detections? These questions inform the complexity and ingenuity of the approach you might take to identify and block bot activity.

Related

How do you call a service which limits access to another service

We have a lot of small services (python based) which all access the same external resource (REST API). The external resource is a payed service with a given set of tickets per day. Every request, no matter how small or big decreases the available tickets by one. Even if the request fails due to some error in the request parameters or timeout the tickets get reduced. This is usually not a big problem and can be dealt with on a per service level. The problem starts with the limitation about maximum parallel request. We have a certain amount of parallel request, and if we reach that limit the next requests fail and again decrease our available tickets.
So for me a solution where each service handles this error and retries after a certain amount of time is no option. This would be to costly in terms of tickets and also way too inefficient.
My solution now would be to have special internal service which all other service call which acts as a kind of proxy or middleman which receives the requests and puts them in a queue and processes them in a way that we never exceed the parallel requests limit.
Before i now start to code i would like to know if there is a proper name for such a service and if there are already some solutions out there, because i can imagine someone else could also have these problems. I also think that someone (probably not me) could create such a service completely independent from the actual external API.
Thank you very much and please stackoverflow gods be kind with me.

How does a server detect whether sort of automated http request is being made

I was making some test http requests using Python's request library. When searching for Walmart's Canadian site (www.walmart.ca), I got this:
How do servers like Walmart's detect that my request is being made programatically? I understand browsers send all sorts of metadata to the server. I was hoping to get a few specific examples of how this is commonly done. I've found a similar question, albeit related to Selenium Web Driver, here where it claims that there are some vendors that provide this service but I was hoping to get something a bit more specific.
Appreciate any insights, thanks.
As mentioned in the comments, a real browsers send many different values - headers, cookies, data. It reads from server not only HTML but also images, CSS, JS, fonts. Browser can also run JavaScript which can get other information about browser - version, extensions, data in local storage, etc (i.e how you move mouse). And real human loads/visits pages with random delays and in rather in random order. And all these elements can be used to detect a script. Servers may use very complex systems even Machine Learning (Artificial Intelligence) and use data from few mintues or hours to compare your behavior.

How do I sanitise a user-provided URL before a GET request?

I am writing a Telegram bot that notifies me when a web page changes. I provide a URL, which is periodically fetched via requests.get and hashed. If the hash is different from the previous one, I'm notified.
I am willing to open it to the rest of the community, but then I need to guard against malicious usage and abuse. So far I guarded against users providing links to gigantic files but, given my minimal knowledge of the subject, I suspect there is more to it than that.
My questions are then:
What kinds of attacks am I exposing myself to?
How do I defend?

Facebook's Graph API's request limit on a locally run program? How to get specific data in real time without reaching it?

I've been writing a program in Python which needs to have the datum of the number of likes of a specific Facebook page in real time. The program itself works, but it's based on a loop that is constantly requesting the number of likes and updating it on a variable, and I was afraid that this way it will soon reach the API's limit of requests.
I read that Graph API's request limit per user for an application is 200 requests per hour. Is a program locally run as this one considered an application with one user, or what is it considered?
Also, I read that some users say the API can handle 600 requests per 600 seconds without returning an error, does this still apply? (Could I, for example, delay the loop for one second and still be able to make all the requests?) If not, is there a solution to get that info in real time in a local program? (I saw that Graph can send you updates with a POST on a specified URL, but is there a way to receive those updates without owning an URL? Maybe a way to renew the token or something?). I need to have this program running for almost a whole day, so not being rejected from the API is quite important.
Sorry if it sounds silly or anything, this is the first time I'm using the Graph API (and a web-based API in general).

How to response different responses to the same multiple requests based on whether it has been responded?

Let's say my python server has three different responses available. And one user send three HTTP requests at the same time.
How can I make sure that one requests get one unique response out of my three different responses?
I'm using python and mysql.
The problem is that even though I store already responded status in mysql, it's a bit too late by the time the next request came in.
For starters, if MySQL isn't handling your performance requirements (and it rightly shouldn't, that doesn't sound like a very sane use-case),
consider using something like in-memory caching, or for more flexibility, Redis:
It's built for stuff like this, and will likely respond much, much quicker.
As an added bonus, it has an even simpler implementation than SQL.
Second, consider hashing some user and request details and storing that hash with the response to be able to identify it.
Upon receiving a request, store an entry with a 'pending' status, and only handle 'pending' requests - never ones that are missing entirely.

Categories