Nginx: Speeding up Image Upload?

Nginx: Speeding up Image Upload? - python

My python application sits behind an Nginx instance. When I upload an image, which is one of the purpose of my app, I notice that nginx first saves the image in filesystem (used 'watch ls -l /tmp') and then hands it over to the app. Can I configure Nginx to work in-memory with image POST? My intent is to avoid touching the slow filesystem (the server runs on an embedded device).

Yes, set the proxy_max_temp_file_size to zero, or some other reasonably small value. Another option (which might be a better choice) is to set the proxy_temp_path to faster storage so that nginx can do a slightly better job of insulating the application from buggy or malicious hosts.

Related

What's the optimal way to store image data temporarily in a containerized website?

I'm currently working on a website where i want the user to upload one or more images, my flask backend will do some changes on these pictures and then return them back to the front end.
Where do I optimally save these images temporarily especially if there are more then one user at the same time on my website (I'm planning on containerizing the website). Is it safe for me to save the images in the folder of the website or do I need e.g. a database for that?

You should use a database, or external object storage like Amazon S3.
I say this for a couple of reasons:
Accidents do happen. Say the client does an HTTP POST, gets a URL back, and does an HTTP GET to retrieve the result. But in the meantime, the container restarts (because the system crashed; your cloud instance got terminated; you restarted the container to upgrade its image; the application failed); the container-temporary filesystem will get lost.
A worker can run in a separate container. It's very reasonable to structure this application as a front-end Web server, that pushes messages into a job queue, and then a back-end worker picks up messages out of that queue to process the images. The main server and the worker will have separate container-local filesystems.
You might want to scale up the parts of this. You can easily run multiple containers from the same image; they'll each have separate container-local filesystems, and you won't directly control which replica a request goes to, so every container needs access to the same underlying storage.
...and it might not be on the same host. In particular, cluster technologies like Kubernetes or Docker Swarm make it reasonably straightforward to run container-based applications spread across multiple systems; sharing files between hosts isn't straightforward, even in these environments. (Most of the Kubernetes Volume types that are easy to get aren't usable across multiple hosts, unless you set up a separate NFS server.)
That set of constraints would imply trying to avoid even named volumes as much as you can. It makes sense to use volumes for the underlying storage for your database, and it can make sense to use Docker bind mounts to inject configuration files or get log files out, but ideally your container doesn't really use its local filesystem at all and doesn't care how many copies of itself are running.
(Do not rely on Docker's behavior of populating a named volume on first use. There are three big problems with it: it is on first use only, so if you update the underlying image, the volume won't get updated; it only works with Docker named volumes and not other options like bind-mounts; and it only works in Docker proper and not in Kubernetes.)
Other decisions are possible given other sets of constraints. If you're absolutely sure you will never ever want to run this application spread across multiple nodes, Docker volumes or bind mounts might make sense. I'd still avoid the container-temporary filesystem.

Does python with wsgi (uwsgi) under nginx have some small default cache?

In my small web-site I feel need to make some data widely available, to avoid exchanging with database for every request made. E.g. this could be the list of current users show in the bottom of every page or the time of last update of ranking.
The stuff works in Python (Flask) running upon nginx + uwsgi (this docker image).
I wonder, do I have some small cache or shared memory for keeping such information "out of the box", or I need to take care of explicitly setting up some dedicated cache? Or perhaps some thing like this is provided by nginx?
alternatively I still can use database for it has its own cache I think, anyway
Sorry if question seems to be naive/silly - for I come from java world (where things a bit different as we serve all requests with one fat instance of java application) - and have some difficulty grasping what powers does wsgi/uwsgi provide. Thanks in advance!

Firstly, nginx has cache:
https://www.nginx.com/blog/nginx-caching-guide/
But for flask cacheing you also have options:
https://pythonhosted.org/Flask-Cache/
http://flask.pocoo.org/docs/1.0/patterns/caching/

Did you have a look at caching section from Flask docs?
It literally says:
Flask itself does not provide caching for you, but Werkzeug, one of the libraries it is based on, has some very basic cache support
You create a cache object once and keep it around, similar to how Flask objects are created. If you are using the development server you can create a SimpleCache object, that one is a simple cache that keeps the item stored in the memory of the Python interpreter:
from werkzeug.contrib.cache import SimpleCache
cache = SimpleCache()
-- UPDATE --
Or you could solve on the frontend side storing data in the web browser local storage.
If there's nothing in the local storage you call the DB, else you use the information from local storage rather than making db call.
Hope it helps.

how to set synchronous transmission under reverse proxy using nginx?

I'm using reverse proxy with Nginx.
When I POST a file to the Nginx, it seems that it will store the whole file in local and forward it to the backend server after received the whole file.
Is there a way to make Nginx receive & forward data synchronously？

It's already answered negatively from this SO link: nginx files upload streaming with proxy_pass
The answer of the above question was from one of the guys who is maintaining nginx code base. So you can forget it for now.
If it's really important to not transporting files twice, you may try nginx upload module if you have control over your upstream server. http://wiki.nginx.org/HttpUploadModule.

You mean streaming. Yeah, you probably want to play with proxy_buffering, proxy_store and/or proxy_temp_file_write_size:
http://wiki.nginx.org/HttpProxyModule#proxy_store
http://wiki.nginx.org/HttpProxyModule#proxy_buffering
http://wiki.nginx.org/HttpProxyModule#proxy_temp_file_write_size
Side note: since nginx is single-threaded, then you really want to use that feature (otherwise one upload may block entire server for quite a long time).

Does local GAE read and write to a local datastore file on the hard drive while it's running?

I have just noticed that when I have a running instance of my GAE application, there nothing happens with the datastore file when I add or remove entries using Python code or in admin console. I can even remove the file and still have all data safe and sound in admin area and accessible from code. But when I restart my application, all data obviously goes away and I have a blank datastore. So, the question - does GAE reads all data from the file only when it starts and then deals with it in the memory, saving the data after I stop the application? Does it make any requests to the datastore file when the application is running? If it doesn't save anything to the file while it's running, then, possibly, data may be lost if the application unexpectedly stops? Please make it clear for me if you know how it works in this aspect.

How the datastore reads and writes its underlying files varies - the standard datastore is read on startup, and written progressively, journal-style, as the app modifies data. The SQLite backend uses a SQLite database.
You shouldn't have to care, though - neither backend is designed for robustness in the face of failure, as they're development backends. You shouldn't be modifying or deleting the underlying files, either.

By default the dev_appserver will store it's data in a temporary location (which is why it disappears and you can't see anything changing)
If you don't want your data to disappear on restart set --datastore_path when running your dev server like:
dev_appserver.py --datastore_path /path/to/app/myapp.db /path/to/app
As nick said, the dev server is not built to be bulletproof, it's designed to help you quickly develop your app. The production setup is very different and will not do anything unexpected when you are dealing with exceptional circumstances.

fastcgi, cherrypy, and python

So I'm trying to do more web development in python, and I've picked cherrypy, hosted by lighttpd w/ fastcgi. But my question is a very basic one: why do I need to restart lighttpd (or apache) every time I change my application code, or the code for an underlying library?
I realize this question extends from a basic mis(i.e. poor)understanding of the fastcgi model, so I'm open to any schooling here, but I'm used to just changing a PHP file and it showing up, versus having to bounce the web server.
Any elucidation/useful mockery appreciated.

This is because of performance. For development, autoreloading is helpful. But for production, you don't want to autoreload. This is actually a decently-sized bottleneck in say PHP. Every time you access a PHP webpage, the server has to parse and load each page from scratch. With Python, the script is already loaded and running after the first access.
As has been pointed out, CherryPy has a autoreload setting. I'd recommend using the CherryPy built-in server for development and using lighttpd for production. That will likely save you some time. The tutorial shows you how to do this.

From a system-software-writer's pointer of view: This all depends on how the meta-data about the server process is organized within your daemon (lighttpd or fcgi). Some programs are designed for one time only initialization -- MOSTLY this allows a much simpler and better performing internal programming model.
Often it is very hard to program a server process reload config data in a easy way. You might have to introduce locks and external event objects (signals in UNIX). When you can synchronize the data structures by design -- i.e., only initializing once .... why complicate things by making the data model modifiable multiple times ?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.