How large data can memcached handle efficiently?

How large data can memcached handle efficiently? - python

How large values can I store and retrieve from memcached without degrading its performance?
I am using memcached with python-memcached in a django based web application.

Read this one:
https://groups.google.com/forum/?fromgroups=#!topic/memcached/IaMLUeOGxWk
You should not "store" anything in memcached.

Memcached is more or less only limited by available (free) memory in the number of servers you run it on. The more memory, the more data fits, and since it uses fairly efficient in-memory indexes, you won't really see performance degrade in any significant way with more objects.
Remember though, it's a cache and there is no guarantee that you'll be able to retrieve what you put in. More memory will make memcached try to keep more data in memory, but there is no guarantee that it won't just throw data away even if memory is available if it somehow finds that a better idea.

Related

Passing variables between two python processes

I am intended to make a program structure like below
PS1 is a python program persistently running. PC1, PC2, PC3 are client python programs. PS1 has a variable hashtable, whenever PC1, PC2... asks for the hashtable the PS1 will pass it to them.
The intention is to keep the table in memory since it is a huge variable (takes 10G memory) and it is expensive to calculate it every time. It is not feasible to store it in the hard disk (using pickle or json) and read it every time when it is needed. The read just takes too long.
So I was wondering if there is a way to keep a python variable persistently in the memory, so it can be used very fast whenever it is needed.

You are trying to reinvent a square wheel, when nice round wheels already exist!
Let's go one level up to how you have described your needs:
one large data set, that is expensive to build
different processes need to use the dataset
performance questions do not allow to simply read the full set from permanent storage
IMHO, we are exactly facing what databases were created for. For common use cases, having many processes all using their own copy of a 10G object is a memory waste, and the common way is that one single process have the data, and the others send requests for the data. You did not describe your problem enough, so I cannot say if the best solution will be:
a SQL database like PostgreSQL or MariaDB - as they can cache, if you have enough memory, all will be held automatically in memory
a NOSQL database (MongoDB, etc.) if your only (or main) need is single key access - very nice when dealing with lot of data requiring fast but simple access
a dedicated server using a dedicate query languages if your needs are very specific and none of the above solutions meet them
a process setting up a huge piece of shared memory that will be used by client processes - that last solution will certainly be fastest provided:
all clients make read-only accesses - it can be extended to r/w accesses but could lead to a synchronization nightmare
you are sure to have enough memory on your system to never use swap - if you do you will lose all the cache optimizations that real databases implement
the size of the database and the number of client process and the external load of the whole system never increase to a level where you fall in the swapping problem above
TL/DR: My advice is to experiment what are the performances with a good quality database and optionaly a dedicated chache. Those solution allow almost out of the box load balancing on different machines. Only if that does not work carefully analyze the memory requirements and be sure to document the limits in number of client processes and database size for future maintenance and use shared memory - read-only data being an hint that shared memory can be a nice solution

In short, to accomplish what you are asking about, you need to create a byte array as a RawArray from the multiprocessing.sharedctypes module that is large enough for your entire hashtable in the PS1 server, and then store the hashtable in that RawArray. PS1 needs to be the process that launches PC1, PC2, etc., which can then inherit access to the RawArray. You can create your own class of object that provides the hashtable interface through which the individual variables in the table are accessed that can be separately passed to each of the PC# processes that reads from the shared RawArray.

Django Cache: Caching thousands of queries for a long time, on basic server resources

I am building a website having in my mind that hundreds (I wish thousands!) of 'get' queries -per day- will be cached for a couple of months in the filesystem.
Reading the cache documentation, however, I observe that the default values lean towards a small and fast cache cycle.
An old post describes that a strategy like the one I imagine, wrecked havoc in their servers.
Of course, the current django code seems to have evolved since 2012. However the cache defaults still remain the same...
I wonder whether I am on the right track or not.
My familiarity with caching is restricted in enjoying the W3 Total Cache results after saving thousands of files in the relevant directories without understanding anything but its basic settings.
How would an experienced developer approach "stage 1" of this task:
Without the budget -yet- to support solutions based on Redis (for example) (Not a valid argument)
How would you cache a normally augmenting number of queries -capable to form a bulk- for a long period of time, running on rather basic server resources?

Django's cache backend *should be implementation agnostic. For example, if you want to start with filesystem cache or redis cache or memcache it shouldn't really matter to django.
I can think of a couple issues with your approach:
how fast is your dataset growing? If you have pretty stable sized dataset, it shoudn't matter if the cache entries are long-lived.
how will you invalidate your queries? if the queries are being cached for months, it suggests that data does not change; cache invalidation is a big thing to consider, clients shouldn't see stale data.
are you using Filesystem cache? if data is being cached per server, are requests being consitantly assigned to the same servers? If not then multiple servers can have duplicate caches, this is one of the benefits of using a centralized cache (redis/memcache)
you should be able to calculate a pretty good estimate based on your current dataset size, how much data you'd like to cache, and the rate of growth of your data on how large of cache you'd need. I feel like a shared cache will go very far, and can be ran on "basic server" resources.
For stage 1, i would:
choose a shared cache, either redis or memcached, this should be a lot less painful when you start to scale to multi server setups
estimate how much data you will need to cache, and what sort of data size growth you predict, to make sure your cache is of an appropriate size.
I feel like cache invalidation is usually not a set policy on how long the data should persist in the cache, it is governed by when your data changes, which should force invalidate the cache so that clients don't see stale data

Difference between memcache and python dictionary

In my current project, I am using Memcache to store key-value pairs, but since the communication happens over the socket between my process and the Memcache causing the huge latencies. We went with memcache because we had a requirement of storing large amount of key-value pairs. But now I want to store the dictionary as a global datastructure in my process. Is it a good thing? Because the dictionary will be stored in processes address space. Suggestions please....

The usual reason to use memcached is that you would like to distribute the cache among multiple machines, with the goal of both having data available on all the machines, while also utilizing the storage of all the machines. If those requirements don't apply to you, and you only need the cached data on a single machine, then memcached doesn't offer you all that much. In that case, moving the dictionary into your local process might be a good idea.

I wrote a thorough answer to this on the memcached "about" page. I drew pictures and everything.
In summary: If you have more than one process, the dictionary won't help you. If you have more than one process/computer, you're going to be burning tons of memory that could be reused in great ways that save you lots of money and get you more bigger stuff.

If you data is not so big, you may just dump your python dictionary to files with cPickle.dump or marshal.dump and, reload it from file with cPickle.load or marshal.load, and if you need to worry about diskspace, you may use bz2 or gzip compress / decompress during file read / rewrite.

reducing I/O on application and database

Is there a way to reduce the I/O's associated with either mysql or a python script? I am thinking of using EC2 and the costs seem okay except I can't really predict my I/O usage and I am worried it might blindside me with costs.
I basically develop a python script to parse data and upload it into mysql. Once its in mysql, I do some fairly heavy analytic on it(creating new columns, tables..basically alot of math and financial based analysis on a large dataset). So is there any design best practices to avoid heavy I/O's? I think memcached stores a everything in memory and accesses it from there, is there a way to get mysql or other scripts to do the same?
I am running the scripts fine right now on another host with 2 gigs of ram, but the ec2 instance I was looking at had about 8 gigs so I was wondering if I could use the extra memory to save me some money.

By IO I assume you mean disk IO... and assuming you can fit everything into memory comfortably. You could:
Disable swap on your box†
Use mysql MEMORY tables while you are processing, (or perhaps consider using an Sqlite3 in memory store if you are only using the database for the convenience of SQL queries)
Also: unless you are using EBS I didn't think Amazon charged for IO on your instance. EBS is much slower than your instance storage so only use it when you need the persistance, ie. not while you are crunching data.
†probably bad idea

You didn't really specify whether it was writes or reads. My guess is that you can do it all in a mysql instance in a ramdisc (tmpfs under Linux).
Operations such as ALTER TABLE and copying big data around end up creating a lot of IO requests because they move a lot of data. This is not the same as if you've just got a lot of random (or more predictable queries).
If it's a batch operation, maybe you can do it entirely in a tmpfs instance.
It is possible to run more than one mysql instance on the machine, it's pretty easy to start up an instance on a tmpfs - just use mysql_install_db with datadir in a tmpfs, then run mysqld with appropriate params. Stick that in some shell scripts and you'll get it to start up. As it's in a ramfs, it won't need to use much memory for its buffers - just set them fairly small.

How to measure Django cache performance?

I have a rather small (ca. 4.5k pageviews a day) website running on Django, with PostgreSQL 8.3 as the db.
I am using the database as both the cache and the sesssion backend. I've heard a lot of good things about using Memcached for this purpose, and I would definitely like to give it a try. However, I would like to know exactly what would be the benefits of such a change: I imagine that my site may be just not big enough for the better cache backend to make a difference. The point is: it wouldn't be me who would be installing and configuring memcached, and I don't want to waste somebody's time for nothing or very little.
How can I measure the overhead introduced by using the db as the cache backend? I've looked at django-debug-toolbar, but if I understand correctly it isn't something you'd like to put on a production site (you have to set DEBUG=True for it to work). Unfortunately, I cannot quite reproduce the production setting on my laptop (I have a different OS, CPU and a lot more RAM).
Has anyone benchmarked different Django cache/session backends? Does anybody know what would be the performance difference if I was doing, for example, one session-write on every request?

At my previous work we tried to measure caching impact on site we was developing. On the same machine we load-tested the set of 10 pages that are most commonly used as start pages (object listings), plus some object detail pages taken randomly from the pool of ~200000. The difference was like 150 requests/second to 30000 requests/second and the database queries dropped to 1-2 per page.
What was cached:
sessions
lists of objects retrieved for each individual page in object listing
secondary objects and common content (found on each page)
lists of object categories and other categorising properties
object counters (calculated offline by cron job)
individual objects
In general, we used only low-level granular caching, not the high-level cache framework. It required very careful design (cache had to be properly invalidated upon each database state change, like adding or modifying any object).

The DiskCache project publishes Django cache benchmarks comparing local memory, Memcached, Redis, file based, and diskcache.DjangoCache. An added benefit of DiskCache is that no separate process is necessary (unlike Memcached and Redis). Instead cache keys and small values are memory-mapped into the Django process memory. Retrieving values from the cache is generally faster than Memcached on localhost. A number of settings control how much data is kept in memory; the rest being paged out to disk.

Short answer : If you have enougth ram, memcached will be always faster. You can't really benchhmark memcached vs. database cache, just keep in mind that the big bottleneck with servers is disk access, specially write access.
Anyway, disk cache is better if you have many objects to cache and long time expiration. But for this situation, if you want gig performances, it is better to generate your pages statically with a python script and deliver them with ligthtpd or nginx.
For memcached, you could adjust the amount of ram dedicated to the server.

Just try it out. Use firebug or a similar tool and run memcache with a bit of RAM allocation (e.g. 64mb) on the test server.
Mark your average loading results seen in firebug without memcache, then turn caching on and mark new results. That's done as easy as it said.
The results usually shocks people, because the perfomance raises up very nicely.

Use django-debug-toolbar to see how much time has been saved on SQL query

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.