How to send get requests to nclt website using Python requests - python

I have an issue related to technical support. I am trying to send a get requests using Python which code is given below
import requests
res=requests.get('https://nclt.gov.in/')
but this request got stuck for long time where it is working fine in local system and I get response within a second in my local system but not able to send get request from my droplet server. I don't know what going on this site.
I had test with different website and I am getting response from all the website instead of this website and I don't have any idea why.
I had also tried in such way:
I had set the user-agents in header
used cookies
but I am not getting response. I had tried this for last 24 hours and not able to get the exact reason behind this.
Is there any issue in droplet and should I have to configure anything. I think there is not any validation in 'http://nclt.gov.in' because I am sending just get request and it is working fine in my local machine without any problem.

Related

Selenium Python get data from HTTP request

I am running automation with Selenium and Python on Opera web driver, when I enter the specific page that I need, a request is sent to the server, it is authenticated with anti-content which blocks me from requesting it, then the only solution is to get the returned JSON after sending the request, I had checked selenium-wire, but I think it doesn't fit my needs, I thought if there is another way to do that, any suggestions?
You can try to use Titanium Web Proxy. It is a proxy server and can be installed via Nuget package and used with Selenium.
string body = await e.GetResponseBodyAsString();
Reference:
https://github.com/justcoding121/Titanium-Web-Proxy/issues/176
https://www.automatetheplanet.com/webdriver-capture-modify-http-traffic/#tab-con-9
Hello there are some pages which is created to be impossible automatize the request.
That rule works in JavaScript and there are companies which makes this detection and close the access for a bot.
So I am sorry to cannot solve your problem, I tried to do the same as You and there are not way.

Requests Library producing Error Code 444 with Amazon EC2

I'm making a bunch of webcalls to scrape for data utilizing Joblib and Requests which work fine on my own personal computer but when I moved the code to Amazon EC2 t3a.micro service, I run into a request error 444. requests.exceptions.HTTPError: 444 Client Error: Unknown for url. I believe this is because I am making too many requests in a limited time frame but was not sure how to exactly fix this problem or what the problem is.
According to this httpstatus page the error comes from:
A non-standard status code used to instruct nginx to close the
connection without sending a response to the client, most commonly
used to deny malicious or malformed requests.
So if you are 100% sure you use the same payload on EC2 and it is also correct, then it could indeed be the amount for requests. The way you limit the amount of requests is usually dependent on the code you wrote. I have seen people use the sleep(5) code to avoid flooding the server with requests. Not sure if it would work in your specific case though.

Why is an Apache server responding differently to my NodeJS vs Python requests?

I've been hitting a REST API for a website for months with the Python requests library without issue. It's on a private network that I have no control over. It also requires client certificate authentication, so the call looks something like:
r = requests.get('https://example.com/some/endpoint?param=something', cert=(key_path, cert_path), verify=ca_path)
Upon inspection of the request, I was able to see that the server initially replies with a 302 redirect code to another URI on the server in order to set an authentication cookie, and then redirects back to the resource I was requesting, which is then fulfilled successfully. Hakuna matata.
We've been standing up a new NodeJS server component and attempting to connect it to the same endpoint. I've tried with a few different libraries, and they all get identical results. Here's an example of one of my attempts using the popular got library with the tough-cookie CookieJar:
let opts = {
key: fs.readFileSync(keyPath),
cert: fs.readFileSync(certPath),
ca: [fs.readFileSync(caPath)],
cookieJar: new CookieJar()
}
// Using Promise API here because I'm running on an older version of NodeJS which
// unfortunately doesn't have async/await :(
got('https://example.com/some/endpoint?param=something', opts).then((res) => {
// ...
});
Strangely, the server responds with a 200 code, and an HTML page that basically says (and I'm paraphrasing of course):
"You seem to have lost your session info, dude. Maybe your browser restarted after a plugin installation or something."
It has embedded links to the site's login page. Unfortunately I can't copy and paste the output because everything's on a closed private network.
I can't for the life of me figure out why this server (which the response headers indicates is an Apache server), would be responding differently to Node's HTTP request vs Python's. I've even changed the HTTP request headers on both the requests and the got calls to be absolutely identical, to no avail. It always works for Python, and never works for Node.
This makes no sense to me. Is there anyone familiar with Python's requests library and Node's HTTP module who can identify the subtle differences in the way these might be connecting to the server that might be causing this issue? The server obviously seems to be able to distinguish between the two requests, even though the headers are identical. Anyone familiar with Apache have any ideas or things to look for which could shed light on the issue?
In case it's relevant, we're using Python v3.6 and Node v8.x (Can't remember exactly which minor version... and unfortunately I can't access the machine at the moment, will update later)
Any suggestions? What other things can I try to get the requests to complete successfully in Node like they do in Python?

Python Web Scraping HTTP 400

I'm doing a web scrape with Python (using the Scrapy framework). The scrape works successfully until it gets about an hour into the process and then every request comes back with a HTTP400 error code.
Is this just likely to be a IP based rate limiter or scrape detection tool? Any advice on how I might investigate the root cause further?
I think the problem with the request rate. try with some download_delay. if you are able to request more pages before 400 error, then you can adjust download_delay and get full web content. Some website give info about download_delay in their robots.txt file
It could be a rate limiter.
However a 400 error generally means that the client request was malformed and therefore rejected by the server.
You should start investigating this first. When your requests start failing, exit your program and immediately start it again. If it starts working, you know that you aren't being rate-limited and that there is in fact something wrong with how your requests are formed later on.

Troubleshooting 404 received by python script

I have a python script that pings 12 pages on someExampleSite.com every 3 minutes. It's been working for a couple months but today I started receiving 404 errors for 6 of the pages every time it runs.
So I tried going to those urls on the pc that the script is running on and they load fine in Chrome and Safari. I've also tried changing the user agent string the script is using and that also didn't change anything. Also I tried removing the ['If-Modified-Since'] header which also didn't change anything.
Why would the server be sending my script a 404 for these 6 pages but on that same computer I can load them in Chrome and Safari just fine? (I made sure to do a hard refresh in Chrome and Safari and they still loaded)
I'm using urllib2 to make the request.
There could be multiple reasons for this, such as the server is rejecting your request based on missing headers, or throttling.
You could try and record your request header in chrome using HTTP Headers then use Python requests library to by adding all your browser headers in your request. Then you could try either changing or removing headers to see what exactly is happening.
So I figured out what the problem was.
The website is returning an erroneous response code for these 6 pages. Even though it's returning a 404, it's also returning the web page. Chrome and Safari seem to ignore the response code and display the page anyways, my script aborts on the 404.

Categories