Python Web Scraping HTTP 400 - python

I'm doing a web scrape with Python (using the Scrapy framework). The scrape works successfully until it gets about an hour into the process and then every request comes back with a HTTP400 error code.
Is this just likely to be a IP based rate limiter or scrape detection tool? Any advice on how I might investigate the root cause further?

I think the problem with the request rate. try with some download_delay. if you are able to request more pages before 400 error, then you can adjust download_delay and get full web content. Some website give info about download_delay in their robots.txt file

It could be a rate limiter.
However a 400 error generally means that the client request was malformed and therefore rejected by the server.
You should start investigating this first. When your requests start failing, exit your program and immediately start it again. If it starts working, you know that you aren't being rate-limited and that there is in fact something wrong with how your requests are formed later on.

Related

How to send get requests to nclt website using Python requests

I have an issue related to technical support. I am trying to send a get requests using Python which code is given below
import requests
res=requests.get('https://nclt.gov.in/')
but this request got stuck for long time where it is working fine in local system and I get response within a second in my local system but not able to send get request from my droplet server. I don't know what going on this site.
I had test with different website and I am getting response from all the website instead of this website and I don't have any idea why.
I had also tried in such way:
I had set the user-agents in header
used cookies
but I am not getting response. I had tried this for last 24 hours and not able to get the exact reason behind this.
Is there any issue in droplet and should I have to configure anything. I think there is not any validation in 'http://nclt.gov.in' because I am sending just get request and it is working fine in my local machine without any problem.

How can I bypass the 429-error from www.instagram.com?

i'm solliciting you today because i've a problem with selenium.
my goal is to make a full automated bot that create an account with parsed details (mail, pass, birth date...) So far, i've managed to almost create the bot (i just need to access to gmail and get the confirmation code).
My problem is here, because i've tried a lot of things, i have a Failed to load resource: the server responded with a status of 429 ()
So, i guess, instagram is blocking me.
how could i bypass this ?
The answer is in the description of the HTTP error code. You are being blocked because you made too many requests in a short time.
Reduce the rate at which your bot makes requests and see if that helps. As far as I know there's no way to "bypass" this check by the server.
Check if the response header has a Retry-After value to tell you when you can try again.
Status code of 429 means that you've bombarded Instagram's server too many times ,and that is why Instagram has blocked your ip.
This is done mainly to prevent from DDOS attacks.
Best thing would be to try after some time ( there might be a Retry-After header in the response).
Also, increase the time interval between each request and set the specific count of number of requests made within a specified time (let's say 1 hr).
Retry-After header is the best practice. However, there's no such response header in this scenario.

Requests Library producing Error Code 444 with Amazon EC2

I'm making a bunch of webcalls to scrape for data utilizing Joblib and Requests which work fine on my own personal computer but when I moved the code to Amazon EC2 t3a.micro service, I run into a request error 444. requests.exceptions.HTTPError: 444 Client Error: Unknown for url. I believe this is because I am making too many requests in a limited time frame but was not sure how to exactly fix this problem or what the problem is.
According to this httpstatus page the error comes from:
A non-standard status code used to instruct nginx to close the
connection without sending a response to the client, most commonly
used to deny malicious or malformed requests.
So if you are 100% sure you use the same payload on EC2 and it is also correct, then it could indeed be the amount for requests. The way you limit the amount of requests is usually dependent on the code you wrote. I have seen people use the sleep(5) code to avoid flooding the server with requests. Not sure if it would work in your specific case though.

PYTHON : How to make my request.get() last for a few second?

To test my API, I need to send a request on my viewer url on which there is a tracking service that tell my API how many time I've spent on the page (classical).
I have this small function in my tests :
def does_it_track(response, **kwargs):
# some unrelated actions
r = requests.get('my_viewer_url')
This request works fine but it only last for less than a second and it doesn't allow me to test my statistic generator, neither the my tracker precision.
I've tried :
This SO issue : how to make python request.get wait a few seconds? it didn't help
The sleep method (but I got has no attribute 'sleep'
To repeat the request send, but it obviously create several stats and I only need a longer one
Does someone know about a "not-to-complicated-way" to make my request wait on my page ?
I'm python 2.7
Thank you !
"how many time you've spent on the page" has nothing to do with the HTTP request/response cycle, but with your browser.
From the server's point of view, the server gets a request, returns a response and the job is over, period - and from the client's point of view once the server returned a response the HTTP transaction is over too. There's not even a notion of "page" here, only HTTP request and response.
Your "tracker" is (obviously) using javascript to send data from the browser itself (most likely by sending a request each X seconds indicating the page is still displayed in the browser). IOW, the only way to test this is to use a headless browser that will execute javascript.
Try VCR, it might help you to solve your issue.
Indeed you would be able to save your request and see what's happening.
VCR

Troubleshooting 404 received by python script

I have a python script that pings 12 pages on someExampleSite.com every 3 minutes. It's been working for a couple months but today I started receiving 404 errors for 6 of the pages every time it runs.
So I tried going to those urls on the pc that the script is running on and they load fine in Chrome and Safari. I've also tried changing the user agent string the script is using and that also didn't change anything. Also I tried removing the ['If-Modified-Since'] header which also didn't change anything.
Why would the server be sending my script a 404 for these 6 pages but on that same computer I can load them in Chrome and Safari just fine? (I made sure to do a hard refresh in Chrome and Safari and they still loaded)
I'm using urllib2 to make the request.
There could be multiple reasons for this, such as the server is rejecting your request based on missing headers, or throttling.
You could try and record your request header in chrome using HTTP Headers then use Python requests library to by adding all your browser headers in your request. Then you could try either changing or removing headers to see what exactly is happening.
So I figured out what the problem was.
The website is returning an erroneous response code for these 6 pages. Even though it's returning a 404, it's also returning the web page. Chrome and Safari seem to ignore the response code and display the page anyways, my script aborts on the 404.

Categories