How to use cache-control with python in GAE?

How to use cache-control with python in GAE? - python

I'm choosing whether to enable cache header and what difference it will make.
The current code is this and I wonder whether I should enable the caching and what it will do?
#seconds_valid = 8600
#self.response.headers['Cache-Control'] = "public, max-age=%d" % seconds_valid
self.response.headers['Cache-Control'] = 'no-cache'
Can I test what the difference is if I change the code to this
seconds_valid = 8600
self.response.headers['Cache-Control'] = "public, max-age=%d" % seconds_valid
Am I doing it the right way? What exactly is going to happen when I enable the cache? Will I still be able to update the page?
Thank you

There is also AppEngine's caching reverse proxy / edge cache which may pick up your Cache-Control header if given a max-age and set to public like in your example. The edge cache is "best effort", meaning it is not 100% certain it will cache your response.
More information can be found here and here.

Setting Cache-Control will make no difference to your application. This value is only used by web browser, caching is done only on client side, not on the server. Correct values for Cache-Control can reduce your server loads and save bandwidth because user agents will try to cache content locally but it has nothing to do with appengine.
If you are looking for server-side caching to improve response time and decrease database reads have a look at memcached. To use memcache optimally you might also need to search the internet for cache-invalidation strategies.

Related

how to add rate limiting on tornado python app

would it be possible to implement a rate limiting feature on my tornado app? like limit the number of HTTP request from a specific client if they are identified to send too many requests per second (which red flags them as bots).
I think I could it manually by storing the requests on a database and analyzing the requests per IP address but I was just checking if there is already an existing solution for this feature.
I tried checking the github page of tornado, I have the same questions as this post but no explicit answer was provided. checked tornado's wiki links as well but I think rate limiting is not handled yet.

Instead of storing them in the DB, would be better to have them in a dictionary stored in memory for easy usability.
Also can you share the details whether the api has a load-balancer and which web-server is used.

The enterprise grade solution to your problem is ambassador.
You can use ambassador's solutions like envoy proxy and edge stack and have it set up that can do the needful.
additional to tore the data, you can use any popular cached db, or d that store as key:value pairs, for example redis.
if you doing this for a very small project, can use some npm/pip packages.
Read the docs: https://www.getambassador.io/products/edge-stack/api-gateway/

You should probably do this before your requests reach Tornado.
But if it's an application level feature (limiting requests depending on level of subscription), then you can do it in Tornado in lots of ways, depending on how complex you want the rate limiting to be.
Probably the simplest way is to have a dict on your tornado.web.Application that uses the ip as the key and the timestamp of the last request as the value and check every request against it in prepare- if not enough time passed since last request, raise a tornado.web.HTTPError(429) (ideally with a Retry-After header). If you do this, you will still need to clean up this dict now and then to remove entries that have not made a request recently or it will grow (you could do it finish on every request).
If you have another fast/in-memory storage attached (memcache, redis, sqlite), you should use that, but you definitely should not use an RDBMS as all those writes will not be great for its performance.

Python bottle server-side caching

Let's say we have an application based on Bottle like this:
from bottle import route, run, request, template, response
import time
def long_processing_task(i):
time.sleep(0.5) # here some more
return int(i)+2 # complicated processing in reality
#route('/')
def index():
i = request.params.get('id', '', type=str)
a = long_processing_task(i)
response.set_header("Cache-Control", "public, max-age=3600") # does not seem to work
return template('Hello {{a}}', a=a) # here in reality it's: template('index.html', a=a, b=b, ...) based on an external template file
run(port=80)
Obviously going to http://localhost/?id=1, http://localhost/?id=2, http://localhost/?id=3, etc.
takes at least 500 ms per page for the first loading.
How to make subsequent loading of these pages are faster?
More precisely, is there a way to have both:
client-side caching: if user A has visited http://localhost/?id=1 once, then if user A visits this page a second time, it will be faster
server-side caching: if user A has visited http://localhost/?id=1 once, then if user B visits this page later (for the first time for user B!), it will be faster too.
In other words: if 500 ms is spent to generate http://localhost/?id=1 for one user, it will be cached for all future users requesting the same page. (is there a name for that?)
?
Notes:
In my code response.set_header("Cache-Control", "public, max-age=3600") does not seem to work.
In this tutorial, it is mentioned about template caching:
Templates are cached in memory after compilation. Modifications made to the template files will have no affect until you clear the template cache. Call bottle.TEMPLATES.clear() to do so. Caching is disabled in debug mode.
but I think it's not related to caching of the final page ready to send to the client.
I already read Python Bottle and Cache-Control but this is related to static files.

Server Side
You want to avoid calling your long running task repeatedly. A naive solution that would work at small scale is to memoize long_processing_task:
from functools import lru_cache
#lru_cache(maxsize=1024)
def long_processing_task(i):
time.sleep(0.5) # here some more
return int(i)+2 # complicated processing in reality
More complex solutions (that scale better) involve setting up a reverse proxy (cache) in front of your web server.
Client Side
You'll want to use response headers to control how clients cache your responses. (See Cache-Control and Expires headers.) This is a broad topic, with many nuanced alternatives that are out of scope in an SO answer - for example, there are tradeoffs involved in asking clients to cache (they won't get updated results until their local cache expires).
An alternative to caching is to use conditional requests: use an ETag or Last-Modified header to return an HTTP 304 when the client has already received the latest version of the response.
Here's a helpful overview of the various header-based caching strategies.

Does python with wsgi (uwsgi) under nginx have some small default cache?

In my small web-site I feel need to make some data widely available, to avoid exchanging with database for every request made. E.g. this could be the list of current users show in the bottom of every page or the time of last update of ranking.
The stuff works in Python (Flask) running upon nginx + uwsgi (this docker image).
I wonder, do I have some small cache or shared memory for keeping such information "out of the box", or I need to take care of explicitly setting up some dedicated cache? Or perhaps some thing like this is provided by nginx?
alternatively I still can use database for it has its own cache I think, anyway
Sorry if question seems to be naive/silly - for I come from java world (where things a bit different as we serve all requests with one fat instance of java application) - and have some difficulty grasping what powers does wsgi/uwsgi provide. Thanks in advance!

Firstly, nginx has cache:
https://www.nginx.com/blog/nginx-caching-guide/
But for flask cacheing you also have options:
https://pythonhosted.org/Flask-Cache/
http://flask.pocoo.org/docs/1.0/patterns/caching/

Did you have a look at caching section from Flask docs?
It literally says:
Flask itself does not provide caching for you, but Werkzeug, one of the libraries it is based on, has some very basic cache support
You create a cache object once and keep it around, similar to how Flask objects are created. If you are using the development server you can create a SimpleCache object, that one is a simple cache that keeps the item stored in the memory of the Python interpreter:
from werkzeug.contrib.cache import SimpleCache
cache = SimpleCache()
-- UPDATE --
Or you could solve on the frontend side storing data in the web browser local storage.
If there's nothing in the local storage you call the DB, else you use the information from local storage rather than making db call.
Hope it helps.

Avoid running GeoIP on every page

This is the module I am working with: http://wiki.nginx.org/HttpGeoipModule
From what I can see, since it is configured on the nginx config and uwsgi it looks like there is no choice but to have it run the geoip on every page and then only collect and use the variable when needed.
From a performance point of view I would rather have it so I request the geoip ONLY when needed, cache it in a cookie or session and then not request it again to speed up the site.
Is anyone able to tell me if this is possible?

From a performance point of view I would rather have it so I request the geoip ONLY when needed, cache it in a cookie or session and then not request it again to speed up the site.
Is anyone able to tell me if this is possible?`
Yes, it's possible. But from a performance point of view, you should not worry, as geoip database are stored in memory (at the reading configuration phase) and nginx doing lookups very fast.
Anyway if you want, you can use something like:
set $country $cookie_country;
if ($country == '') {
set $country $geoip_country_code;
add_header Set-Cookie country=$geoip_country_code;
}
uwsgi_param GEOIP_COUNTRY $country;

No, you can't make nginx to perform GeoIP lookup on demand only. Since you define a geoip_country or geoip_city directive, nginx will request data from GeoIP database, whether the answer is used later or not. But you can fetch GeoIP data without nginx at all, i.e. directly with your application. Take a look for python geoip lib: http://dev.maxmind.com/geoip/downloadable#Python-5

Django statelessness?

I'm just wondering if Django was designed to be a fully stateless framework?
It seems to encourage statelessness and external storage mechanisms (databases and caches) but I'm wondering if it is possible to store some things in the server's memory while my app is in develpoment and runs via manage.py runserver.

Sure it's possible. But if you are writing a web application you probably won't want to do that because of threading issues.

That depends on what you mean by "store things in the server's memory." It also depends on the type of data. If you can, you're better off storing "global data" in a database or in the file system somewhere. Unless it is needed every request it doesn't really make sense to store it in the Django instance itself. You'll need to implement some form of locking to prevent race conditions, but you'd need to worry about race conditions if you stored everything on the server object anyway.
Of course, if you're talking about user-by-user data, Django does support sessions. Or, and this is another perfectly good option if you're willing to make the user save the data, cookies.

The best way to maintain state in a django app on a per-user basis is request.session (see django sessions) which is a dictionary you can use to remember things about the current user.
For Application-wide state you should use the a persistent datastore (database or key/value store)
example view for sessions:
def my_view(request):
pages_viewed = request.session.get('pages_viewed', 1) + 1
request.session['pages_viewed'] = pages_viewed
...
If you wanted to maintain local variables on a per app-instance basis you can just define module level variables like so
# number of times my_view has been served since by this server
# instance since the last restart
served_since_restart = 0
def my_view(request):
served_since_restart += 1
...
If you wanted to maintain some server state across ALL app servers (like total number of pages viewed EVER) you should probably use a persistent key/value store like redis, memcachedb, or riak. There is a decent comparison of all these options here: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
You can do it with redis (via redis-py) like so (assuming your redis server is at "127.0.0.1" (localhost) and it's port 6379 (the default):
import redis
def my_view(request):
r = redis.Redis(host='127.0.0.1', port="6379")
served = r.get('pages_served_all_time', 0)
served += 1
r.set('pages_served_all_time', served)
...

There is LocMemCache cache backend that stores data in-process. You can use it with sessions (but with great care: this cache is not cross-process so you will have to use single process for deployment because it will not be guaranteed that subsequent requests will be handled by the same process otherwise). Global variables may also work (use threadlocals if they shouldn't be shared for all process threads; the warning about cross-process communication also applies here).
By the way, what's wrong with external storage? External storage provides easy cross-process data sharing and other features (like memory limiting algorithms for cache or persistance with databases).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use cache-control with python in GAE? - python

Related

how to add rate limiting on tornado python app

Python bottle server-side caching

Does python with wsgi (uwsgi) under nginx have some small default cache?

Avoid running GeoIP on every page

Django statelessness?

Categories

Resources