I'm making a bunch of webcalls to scrape for data utilizing Joblib and Requests which work fine on my own personal computer but when I moved the code to Amazon EC2 t3a.micro service, I run into a request error 444. requests.exceptions.HTTPError: 444 Client Error: Unknown for url. I believe this is because I am making too many requests in a limited time frame but was not sure how to exactly fix this problem or what the problem is.
According to this httpstatus page the error comes from:
A non-standard status code used to instruct nginx to close the
connection without sending a response to the client, most commonly
used to deny malicious or malformed requests.
So if you are 100% sure you use the same payload on EC2 and it is also correct, then it could indeed be the amount for requests. The way you limit the amount of requests is usually dependent on the code you wrote. I have seen people use the sleep(5) code to avoid flooding the server with requests. Not sure if it would work in your specific case though.
Related
I have an issue related to technical support. I am trying to send a get requests using Python which code is given below
import requests
res=requests.get('https://nclt.gov.in/')
but this request got stuck for long time where it is working fine in local system and I get response within a second in my local system but not able to send get request from my droplet server. I don't know what going on this site.
I had test with different website and I am getting response from all the website instead of this website and I don't have any idea why.
I had also tried in such way:
I had set the user-agents in header
used cookies
but I am not getting response. I had tried this for last 24 hours and not able to get the exact reason behind this.
Is there any issue in droplet and should I have to configure anything. I think there is not any validation in 'http://nclt.gov.in' because I am sending just get request and it is working fine in my local machine without any problem.
I apologize if this question is real entry level type programmer..
But if I am posting data with the requests package, is the data secure? OR while the http message is 'in the air' between my PC and http bin; could someone intercept/replicate what I am doing?... Basically corrupt my data and create havoc for what I am trying to do...
import time, requests
stuff = {}
stamp = time.time()
data = 120.2
stuff['Date'] = stamp
stuff['meter_reading'] = data
print("sending this dict",stuff)
r = requests.post('https://httpbin.org/post', data=stuff)
print("Status code: ", r.status_code)
print("Printing Entire Post Request")
print(r.text)
With the script above on the level of security would it matter if I am posting to a server that is running http or https? The code above is similar to my real world example (that I run on a rasp pi scheduled task) where I am posting data with a time stamp to an http (Not https) server (flask app on pythonanywhere cloud site) which then saves the data to sql. This data can then be rendered thru typical javacript front end web development...
Thanks for any advice I am still learning how to make this 'secure' on the data transfer from the rasp to to cloud server.. Asking about client side web browsing security to view the data that has already been transferred maybe a totally different question/topic..
This is a question about protocols mainly. The HTTP protocol is less secure as someone can 'listen' to what you are sending over it. That's why you should always use the newer HTTPS protocol, since it uses TLS (encrypted) connection. You can read more about it e.g. here.
Requests verifies SSL certificates for HTTPS requests, just like a web browser. By default, SSL verification is enabled, and Requests will throw a SSLError if it’s unable to verify the certificate.
https://requests.readthedocs.io/en/master/user/advanced/#ssl-cert-verification
If you're transmitting data that you do not want others to be able to see, use https. For this use case I can't imagine it would matter too much.
I've developed a fairly simple web service using Flask (Python 2.7, current Flask and dependencies), where clients POST a hunk of JSON to the server and get a response.
This is working 100% of the time when no authentication is enabled; straight up POST to my service works great.
Adding HTTP Digest authentication to the endpoint results in the client producing a 'Broken Pipe' error after the 401 - Authentication Required response is sent back... but only if the JSON hunk is more than about 22k.
If the JSON hunk being transmitted in the POST is under ~22k, the client gets its 401 response and cheerfully replies with authentication data as expected.
I'm not sure exactly what the size cut-off is... the largest I've tested successfully with is 21766 bytes, and the smallest that's failed is 43846 bytes. You'll note that 32k is right in that range, and 32k might be a nice default size for a buffer... and this smells like a buffer size problem.
The problem has been observed using a Python client (built with the 'requests' module) and a C++ client (using one of Qt's HTTP client classes). The problem is also observed both when running the Flask app "stand-alone" (that is, via app.run()) and when running behind Apache via mod_wsgi. No SSL is enabled in either case.
It goes as follows:
your client POSTs JSON data without authentication
server receives the request (not necessarily in one long chunk, it might come in parst)
server evaluates the requests and finds it is not providing credentials, so decides stopping processing the request and replies 401.
With short POST size server consumes all and does not have time to break the POST requests in the middle. With growing POST size the chances to interrupt unauthorized POST request is higher.
You client has two options:
Either start sending credentials right away.
Or try / catch broken pipe and react to it by forming proper Digest based request.
The first feeling is, something is broken, but it is rather reasonable approach - imagine, someone could post huge POST request, consume resources on your server while not being authorized to do so. The reaction of the server seems reasonable in this context.
I ran a bunch of HTTP GET requests to the iTunes API endpoint below to query for a list of artist ids (200 at a time).
My script worked well for the past few months, but lately every single request coming from my server is served with a 403 status. I re-ran the same queries on my local machine and they worked fine. Then I reallocated a different IP to my server and the requests were sporadically served (most still return as 403s). Even requests for a single artist id were served 403s.
import requests
artist_ids = "<id1>,<id2>,..."
itunes_search_url = "https://itunes.apple.com/lookup?id={0}".format(artist_ids)
r = requests.get(itunes_search_url)
print r.status_code
=> 403
Does anyone know if Apple started enforcing stricter rules around the number of requests that can be made to their Search API from a single IP? I wonder if they have different rules for EC2 instances IPs than regular IPs?!
Closing this now, it looks like Apple recommends caching of the search API results, which is the only way I got this to work.
Originally, I tried to post an ajax request from my client side to a third party url, but it seems that the browser have security issues with that. I thought about sending an ajax to the server side, from there to send a GET request to the third party, get the response and send it back to the client side. How can I do that with flask?
Install the requests module (much nicer than using urllib2) and then define a route which makes the necessary request - something like:
import requests
from flask import Flask
app = Flask(__name__)
#app.route('/some-url')
def get_data():
return requests.get('http://example.com').content
Depending on your set up though, it'd be better to configure your webserver to reverse proxy to the target site under a certain URL.
Flask alone does not have this capability, but it is a simple matter to write a request handler that makes a request to another server using an HTTP client library and then return that response.
# third-party HTTP client library
import requests
# assume that "app" below is your flask app, and that
# "Response" is imported from flask.
#app.route("/proxy-example")
def proxy_example():
r = requests.get("http://example.com/other-endpoint")
return Response(
r.text,
status=r.status_code,
content_type=r.headers['content-type'],
)
However, this will not achieve exactly the same result as you might expect from a client-side request. Since your server cannot "see" any cookies that the client browser has stored for the target site, your proxied request will be effectively anonymous and so, depending on the target site, may fail or give you a different response than you'd get requesting that resource in the browser.
If you have a relationship with the third-party URL (that is, if you control it or are able to work with the people who do) they can give access for cross-domain requests in the browser using CORS (which is only supported in modern browsers) or JSON-P (an older workaround that predates CORS).
The third-party provider could also give you access to the data you want at an endpoint that is designed to accept requests from other servers and that provides a mechanism for you to authenticate your app. The most popular protocol for this is OAuth.
As the other answers have stated using the requests module for python would be the best way to tackle this from the coding perspective. However as the comments mentioned (and the reason I came to this question) this can give an error that the request was denied. This error is likely cause by SELinux.
To check if this is the issue first make sure SELinux is enabled with this command:
sestatus
If 'Current Mode' is 'enforcing' then SELinux is enabled.
Next get the current bool values that apply directly to apache with this command:
getsebool -a | grep httpd
Look for the setting 'httpd_can_network_connect' this determines if apache is allowed to make TCP requests out to the network. If it is on then all apache TCP requests will be allowed. To turn it on run the following as root:
setsebool -P httpd_can_network_connect 1
If you only need database access (I had this problem before which is why I suspected SELinux here) then it would probably be better to only turn on 'httpd_cna_network_connect'.
The purpose of this policy is that if a hacker was to hijack your apache server they would not be able to get out through the server to the rest of your internal network.
This probably would've been better as a comment but I don't have enough rep..
Sources:
https://tag1consulting.com/blog/stop-disabling-selinux
https://wiki.centos.org/TipsAndTricks/SelinuxBooleans