Difference between memcache and python dictionary - python

In my current project, I am using Memcache to store key-value pairs, but since the communication happens over the socket between my process and the Memcache causing the huge latencies. We went with memcache because we had a requirement of storing large amount of key-value pairs. But now I want to store the dictionary as a global datastructure in my process. Is it a good thing? Because the dictionary will be stored in processes address space. Suggestions please....

The usual reason to use memcached is that you would like to distribute the cache among multiple machines, with the goal of both having data available on all the machines, while also utilizing the storage of all the machines. If those requirements don't apply to you, and you only need the cached data on a single machine, then memcached doesn't offer you all that much. In that case, moving the dictionary into your local process might be a good idea.

I wrote a thorough answer to this on the memcached "about" page. I drew pictures and everything.
In summary: If you have more than one process, the dictionary won't help you. If you have more than one process/computer, you're going to be burning tons of memory that could be reused in great ways that save you lots of money and get you more bigger stuff.

If you data is not so big, you may just dump your python dictionary to files with cPickle.dump or marshal.dump and, reload it from file with cPickle.load or marshal.load, and if you need to worry about diskspace, you may use bz2 or gzip compress / decompress during file read / rewrite.

Related

Passing variables between two python processes

I am intended to make a program structure like below
PS1 is a python program persistently running. PC1, PC2, PC3 are client python programs. PS1 has a variable hashtable, whenever PC1, PC2... asks for the hashtable the PS1 will pass it to them.
The intention is to keep the table in memory since it is a huge variable (takes 10G memory) and it is expensive to calculate it every time. It is not feasible to store it in the hard disk (using pickle or json) and read it every time when it is needed. The read just takes too long.
So I was wondering if there is a way to keep a python variable persistently in the memory, so it can be used very fast whenever it is needed.
You are trying to reinvent a square wheel, when nice round wheels already exist!
Let's go one level up to how you have described your needs:
one large data set, that is expensive to build
different processes need to use the dataset
performance questions do not allow to simply read the full set from permanent storage
IMHO, we are exactly facing what databases were created for. For common use cases, having many processes all using their own copy of a 10G object is a memory waste, and the common way is that one single process have the data, and the others send requests for the data. You did not describe your problem enough, so I cannot say if the best solution will be:
a SQL database like PostgreSQL or MariaDB - as they can cache, if you have enough memory, all will be held automatically in memory
a NOSQL database (MongoDB, etc.) if your only (or main) need is single key access - very nice when dealing with lot of data requiring fast but simple access
a dedicated server using a dedicate query languages if your needs are very specific and none of the above solutions meet them
a process setting up a huge piece of shared memory that will be used by client processes - that last solution will certainly be fastest provided:
all clients make read-only accesses - it can be extended to r/w accesses but could lead to a synchronization nightmare
you are sure to have enough memory on your system to never use swap - if you do you will lose all the cache optimizations that real databases implement
the size of the database and the number of client process and the external load of the whole system never increase to a level where you fall in the swapping problem above
TL/DR: My advice is to experiment what are the performances with a good quality database and optionaly a dedicated chache. Those solution allow almost out of the box load balancing on different machines. Only if that does not work carefully analyze the memory requirements and be sure to document the limits in number of client processes and database size for future maintenance and use shared memory - read-only data being an hint that shared memory can be a nice solution
In short, to accomplish what you are asking about, you need to create a byte array as a RawArray from the multiprocessing.sharedctypes module that is large enough for your entire hashtable in the PS1 server, and then store the hashtable in that RawArray. PS1 needs to be the process that launches PC1, PC2, etc., which can then inherit access to the RawArray. You can create your own class of object that provides the hashtable interface through which the individual variables in the table are accessed that can be separately passed to each of the PC# processes that reads from the shared RawArray.

How do I maintain a cache in python

I have a requirement like
For the first time I run the process, I need to set a=1
And for the remaining times I run the same process, I need to set a=2
Is it possible to maintain cache that tells the process is ran for second time.
I don't want another physical file to be created in my directory structure.
I searched in internet, but found always the cache within the process.
Thanks in advance
The ways to preserve data between totally separate executions of a process are:
Saving a file.
Handing the data off to another process such as a Memcached or Redis instance, or a database, which will keep the data in memory and/or write it to disk somewhere.
Recording the data in some other, more unusual way such as changing the environment of the running operating system, printing out the data or otherwise displaying it so that the human operator can keep track of it, or something like that.
When you use the word 'cache' and state that you do not wish to write the data to disk, the first thing that comes to mind is memcached or some other in-memory cache. But any file-based solution will certainly be less complex than setting up and maintaining an in-memory key-value store.
Which solution you choose depends in part on what 'second time' means. Second time ever? Second time ever on a given computer? Second time since reboot? Since manual reset? Different methods of recording data are suited to different storage requirements.
If your application is really just a=1 versus a=2 using a file is as good as anything, otherwise consult here: http://docs.python.org/2/library/persistence.html for other persistence methods.
Data cached inside the process dies along with the process. You'll have to cache this info elsewhere since you want it to persist longer than the process lives. A file seems reasonable.

How large data can memcached handle efficiently?

How large values can I store and retrieve from memcached without degrading its performance?
I am using memcached with python-memcached in a django based web application.
Read this one:
https://groups.google.com/forum/?fromgroups=#!topic/memcached/IaMLUeOGxWk
You should not "store" anything in memcached.
Memcached is more or less only limited by available (free) memory in the number of servers you run it on. The more memory, the more data fits, and since it uses fairly efficient in-memory indexes, you won't really see performance degrade in any significant way with more objects.
Remember though, it's a cache and there is no guarantee that you'll be able to retrieve what you put in. More memory will make memcached try to keep more data in memory, but there is no guarantee that it won't just throw data away even if memory is available if it somehow finds that a better idea.

Alternatives to mincemeatpy and octopy

Briefly, octopy and mincemeatpy are python implementations of map-reduce (light-weight), and clients can join the cluster in ad-hoc manner without requiring any installations (Of-course, except python). Here are the project details OCTOPY and Mincemeatpy.
The problem with these is they need to hold the entire data in-memory (including intermediate key-value pairs). So even for a moderate size data, they throw out of memory exceptions.
The key-reasons I'm using them are:
Python.
No cluster installation required.
I just prototype, and I can directly port the algorithm once I'm ready.
So my question is: Is there any package which handles the same stuff, but not just in-memory (which can handle moderate size data) ?
Try PyMapReduce. It runs on your own machine, but on several processes - so you don't need to build up master-node architecture and it have plenty of runners, for example DiskBasedRunner, which seems to store map data to temp files and after reduces them.

How to measure Django cache performance?

I have a rather small (ca. 4.5k pageviews a day) website running on Django, with PostgreSQL 8.3 as the db.
I am using the database as both the cache and the sesssion backend. I've heard a lot of good things about using Memcached for this purpose, and I would definitely like to give it a try. However, I would like to know exactly what would be the benefits of such a change: I imagine that my site may be just not big enough for the better cache backend to make a difference. The point is: it wouldn't be me who would be installing and configuring memcached, and I don't want to waste somebody's time for nothing or very little.
How can I measure the overhead introduced by using the db as the cache backend? I've looked at django-debug-toolbar, but if I understand correctly it isn't something you'd like to put on a production site (you have to set DEBUG=True for it to work). Unfortunately, I cannot quite reproduce the production setting on my laptop (I have a different OS, CPU and a lot more RAM).
Has anyone benchmarked different Django cache/session backends? Does anybody know what would be the performance difference if I was doing, for example, one session-write on every request?
At my previous work we tried to measure caching impact on site we was developing. On the same machine we load-tested the set of 10 pages that are most commonly used as start pages (object listings), plus some object detail pages taken randomly from the pool of ~200000. The difference was like 150 requests/second to 30000 requests/second and the database queries dropped to 1-2 per page.
What was cached:
sessions
lists of objects retrieved for each individual page in object listing
secondary objects and common content (found on each page)
lists of object categories and other categorising properties
object counters (calculated offline by cron job)
individual objects
In general, we used only low-level granular caching, not the high-level cache framework. It required very careful design (cache had to be properly invalidated upon each database state change, like adding or modifying any object).
The DiskCache project publishes Django cache benchmarks comparing local memory, Memcached, Redis, file based, and diskcache.DjangoCache. An added benefit of DiskCache is that no separate process is necessary (unlike Memcached and Redis). Instead cache keys and small values are memory-mapped into the Django process memory. Retrieving values from the cache is generally faster than Memcached on localhost. A number of settings control how much data is kept in memory; the rest being paged out to disk.
Short answer : If you have enougth ram, memcached will be always faster. You can't really benchhmark memcached vs. database cache, just keep in mind that the big bottleneck with servers is disk access, specially write access.
Anyway, disk cache is better if you have many objects to cache and long time expiration. But for this situation, if you want gig performances, it is better to generate your pages statically with a python script and deliver them with ligthtpd or nginx.
For memcached, you could adjust the amount of ram dedicated to the server.
Just try it out. Use firebug or a similar tool and run memcache with a bit of RAM allocation (e.g. 64mb) on the test server.
Mark your average loading results seen in firebug without memcache, then turn caching on and mark new results. That's done as easy as it said.
The results usually shocks people, because the perfomance raises up very nicely.
Use django-debug-toolbar to see how much time has been saved on SQL query

Categories