Scrapy: no route to host and persistent support enabled - python

If I'm running a crawler with persistent support enabled and I temporarily loose internet connection. Will the crawler retry the URLs that get a no route to host error during the temporary internet loss?

Yes.
Scrapy uses an HTTP 1.1 client which have persistent support by default, and under the hood (thanks to Twisted) this uses a pool of persistent connections with automatic retry when the connection is lost.
Besides that, when Scrapy gets a connection error for a request (timeout, dns error, no route, etc), the RetryMiddleware takes care of retrying the request. See http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.retry

Related

find out why my web server becomes unavailable

I have a python web server (django)
It talks to other services such as elasticsearch
I notice when elasticsearch goes down, the web server soon (after a few minutes) stops responding to client requests.
I use https://github.com/elastic/elasticsearch-py and it implements timeout and don't think it's blocking.
My hunch is that requests piles up during the timeout period and server becomes unavailable but it's a just guess.
What's the reason for the server not being able to handle requests in such a scenario and how do I fix it?
I have nginx - uwsgi - django on unix (amazon ecs) setup if that makes difference

How make a persistent connection on your web server using sockets?

I make a web server on sockets. I want to support persistent connections.
When I write in the address bar of the browser a request to the server (on the localhost) in the headers I see "Connection: keep-alive", but the browser displays the data sent only after the connection is closed. I even do a "flush" on the connection (in python you can create a connection file and make a "flush" on it). I guess I don’t quite understand how sockets should behave in the persistent connection.
Please, help me figure out. If it is possible with a Python code examples. Sorry for my bad English.
I guess I don’t quite understand how sockets should behave in the persistent connection
This seems to be the case. Persistent HTTP connection just means that the server is may keep the TCP connection open after sending the HTTP response in order to process another HTTP request and that the client may send another request on the same TCP connection if the TCP connection is still open by the server. Both server and client might decide to not send/receive another request and close the connection whenever it is idle (i.e. no outstanding HTTP response).
Persistent HTTP connection in no way change the semantics of HTTP from a request-response protocol to "anything sockets can do". This means the way you want to use persistence is wrong.

When do we need socket and when do we need request?

Why do we need socket despite of request library in python?
If we wanna socket to connect to other server so what is request library for?
Request is a higher level API for handling HTTP requests (which uses socket internally). There are dozens of other network protocols not covered by it. Of course, you could handle HTTP by using socket directly, but unless you have an extremely good reason to do so, you'd just be reinventing the wheel.
Requests is a Python HTTP library, whereas sockets are used for sending or receiving data on a computer network. HTTP is an application layer protocol that specifies how request and replies from client and server should be made. In socket programming, you make connection by specifying destination IP/Port and send your data to remote host.

Serving Python (Flask) REST API over HTTP2

I have a Python REST service and I want to serve it using HTTP2. My current server setup is nginx -> Gunicorn. In other words, nginx (port 443 and 80 that redirects to port 443) is running as a reverse proxy and forwards requests to Gunicorn (port 8000, no SSL). nginx is running in HTTP2 mode and I can verify that by using chrome and inspecting the 'protocol' column after sending a simple GET to the server. However, Gunicorn reports that the requests it receives are HTTP1.0. Also, I coulnt't find it in this list:
https://github.com/http2/http2-spec/wiki/Implementations
So, my questions are:
Is it possible to serve a Python (Flask) application with HTTP2? If yes, which servers support it?
In my case (one reverse proxy server and one serving the actual API), which server has to support HTTP2?
The reason I want to use HTTP2 is because in some cases I need to perform thousands of requests all together and I was interested to see if the multiplexed requests feature of HTTP2 can speed things up. With HTTP1.0 and Python Requests as the client, each request takes ~80ms which is unacceptable. The other solution would be to just bulk/batch my REST resources and send multiple with a single requests. Yes, this idea sounds just fine, but I am really interested to see if HTTP2 could speed things up.
Finally, I should mention that for the client side I use Python Requests with the Hyper http2 adapter.
Is it possible to serve a Python (Flask) application with HTTP/2?
Yes, by the information you provide, you are doing it just fine.
In my case (one reverse proxy server and one serving the actual API), which server has to support HTTP2?
Now I'm going to tread on thin ice and give opinions.
The way HTTP/2 has been deployed so far is by having an edge server that talks HTTP/2 (like ShimmerCat or NginX). That server terminates TLS and HTTP/2, and from there on uses HTTP/1, HTTP/1.1 or FastCGI to talk to the inner application.
Can, at least theoretically, an edge server talk HTTP/2 to web application? Yes, but HTTP/2 is complex and for inner applications, it doesn't pay off very well.
That's because most web application frameworks are built for handling requests for content, and that's done well enough with HTTP/1 or FastCGI. Although there are exceptions, web applications have little use for the subtleties of HTTP/2: multiplexing, prioritization, all the myriad of security precautions, and so on.
The resulting separation of concerns is in my opinion a good thing.
Your 80 ms response time may have little to do with the HTTP protocol you are using, but if those 80 ms are mostly spent waiting for input/output, then of course running things in parallel is a good thing.
Gunicorn will use a thread or a process to handle each request (unless you have gone the extra-mile to configure the greenlets backend), so consider if letting Gunicorn spawn thousands of tasks is viable in your case.
If the content of your requests allow it, maybe you can create temporary files and serve them with an HTTP/2 edge server.
It is now possible to serve HTTP/2 directly from a Python app, for example using Twisted. You asked specifically about a Flask app though, in which case I'd (with bias) recommend Quart which is the Flask API reimplemented on top of asyncio (with HTTP/2 support).
Your actual issue,
With HTTP1.0 and Python Requests as the client, each request takes ~80ms
suggests to me that the problem you may be experiencing is that each request opens a new connection. This could be alleviated via the use of a connection pool without requiring HTTP/2.

Redirecting HTTP to HTTPS on AWS-ELB-Tornado

I notice a lot of questions on this topic, but I haven't found any specifically for Tornado.
I have a Amazon EC2 Instance running behind a load balancer. The HTTPS cert terminates at the ELB, both port 80/443 are directed to the same Tornado server port.
How do I force redirect HTTP to HTTPS traffic?
You have to send a HTTP 301 or 302 with the new location, see RequestHandler.redirect() for details.
You can consider to add CloudFront (CDN) in front of your ELB, it has many benefits i.e. lower latency for your application, handling traffic spikes.
If you would go for this, CloudFront has http to https redirect feature.
Enable xheaders and use the X-Forwarded-Proto header. That will tell you if the original request came in through http or https.
Also, see SO question Retrieve browser headers in Python.

Categories