Call methods for objects retrived from Memcached - python

Having some object that I need shared between two python processes on the same machine , I use python memcached to store objects in a process and use them in the other process. However even if they can be retrieved in the process that wants to use them, calling their methods returns nothing (not the expected result).
Should someone the necessary input, what happens in such a described scenario
- is the object passed entirely and both process share it's state ? , or only a copy is passed and changes made by one process are not visible in the other
- what other techniques / libraries could be employed in order to achieve sharing objects between python processes.
Thanks

Memcached stores string values and not rich objects as you describe.
What you describe sounds like a publisher-subscriber application - one process produces data and the other process acts on it.
Memcached is not the right tool for pub/sub. You need a list that you can atomically add to and pop from. While you can store a string representation of a list in Memcached, you cannot add and remove elements from this list atomically amongst processes.
Have a look at Redis. It has rich data structures that are ideal for pub/sub and mature Python clients to use.
For a simple implementation you could have your one process add (LPUSH) data to a list. The other process could pop (RPOP) data off this list and run a method on it. This would give you a good IPC queue that scales well.
If you need more sophistication then look at the Redis pub/sub docs - it has a messaging system that eliminates the need to poll a queue for new values.

Related

Fastest way to share a very large dict between multiple processes without copying

TL;DR: How to share a large (200MB) read only dict between multiple processes in a performant way, that is accessed VERY heavily without each process having a full copy in memory.
EDIT: It looks like if I just pass the dictionary as the argument for the multiprocessing.Pool/Process, it won't actually create a copy unless a worker modifies the dictionary. I just assumed it would copy. This behavior seems to be Unix only where fork is available and even then not always. But if so, it should solve my problem until this is converted to an ETL job.
What I'm trying to do:
I have a task to improve a script that replicates data from one store to another. Normalizing and transforming the data on the way. This task works on the scale of around 100 million documents coming from the source document store that get rolled up and pushed to another destination document store.
Each document has an ID and there is another document store is that essentially a key value store of those ID's mapped to some additional information needed for this task. This store is a lot smaller and doing queries against it while document from the main store come through, is not really an option without heavy caching and that heavy cache ends up being a copy of the whole thing very quickly. I just create the whole dictionary dictionary from that entire store at beginning before starting anything and use that. That dictionary is around ~200MB in size. Note that this dictionary is only ever read from.
For this I have setup multiprocessing and have around 30 concurrent processes. I've divided the work for each process such that each hit a different indices and can do the whole thing in around 4 hours.
I have noticed that I am extremely CPU bound when doing the following 2 things:
Using a thread pool/threads (what i'm currently doing) so each thread can access the dict without issue. The GIL is killing me and I have one process maxing out at 100% all the time with other CPU's sitting idle. Switching to PyPy helped a lot, but i'm still not happy with this approach.
Creating a Multiprocessing.Manager().dict() for the large dict and having the child processes access through that. The server process that this approach creates is constantly at 100% cpu. I don't know why, as I only ever read from this dictionary so I doubt it's a locking thing. I don't know how the Manager works internally but i'm guessing that the child processes are connecting via Pipes/Sockets for each fetch and the overhead of this is massive. It also suggests that using Reddis/Memcache will have the same problem if true. Maybe it can be configured better?
I am Memory bound when doing these things:
Using a SharedMemory view. You can't seem to do this for dicts like I need to. I can serialize the dict to get into the shared view, but for it to be usable on the Child process you need serialize the data to an actual usable dict which creates the copy in the process.
I strongly suspect that unless I've missed something I'm just going to have to "download more ram" or rewrite from Python into something without a GIL (or use ETL like it should be done in...).
In the case of ram, what is the most efficient way to store a dict like this to make it sting less? It's currently a standard dict mapped to a tuple of the extra information consisting of 3 long/float.
doc_to_docinfo = {
"ID1": (5.2, 3.0, 455),
}
Are there any more efficient hashmap implementations for this use case than what i'm doing?
You seem to have a similar problem that I have. It is possible to use my source here to create a partitioning of those dictionary-keys per thread. My suggestion: Split the document IDs into partitions of length 3 or 4, keep the partition table in sync for all processes/threads and then just move the parts of your documents to each process/thread and as an entrypoint the process does a dictionary lookup and finds out which process can handle the part of that dictionary. If you are clever with balancing the partitions, you could also have an equal amount of documents per thread managed.

Share state between threads in bottle

In my Bottle app running on pythonanywhere, I want objects to be persisted between requests.
If I write something like this:
X = {'count': 0}
#route('/count')
def count():
X['count'] += 1
tpl = SimpleTemplate('Hello {{count}}!')
return tpl.render(count=X['count'])
The count increments, meaning that X persists between requests.
I am currently running this on pythonanywhere, which is a managed service where I have no control over the web server (nginx I presume?) threading, load balancing (if any) etc...
My question is, is this coincidence because it's only using one thread while on minimal load from me doing my tests?
More generally, at which point will this stop working? E.g. I have more than one thread/socket/instance/load-balanced server etc...?
Beyond that, what is my best options to make something like this work (sticking to Bottle) even if I have to move to a barebones server.
Here's what Bottle docs have to say about their request object:
A thread-safe instance of LocalRequest. If accessed from within a request callback, this instance always refers to the current request (even on a multi-threaded server).
But I don't fully understand what that means, or where global variables like the one I used stand with regards to multi-threading.
TL;DR: You'll probably want to use an external database to store your state.
If your application is tiny, and you're planning to always have exactly one server process running, then your current approach can work; "all" you need to do is acquire a lock around every (!) access to the shared state (the dict X in your sample code). (I put "all" in scare quotes there because it's likely to become more complicated than it sounds at first.)
But, since you're asking about multithreading, I'll assume that your application is more than a toy, meaning that you plan to receive substantial traffic and/or want to handle multiple requests concurrently. In this case, you'll want multiple processes, which means that your approach--storing state in memory--cannot work. Memory is not shared across processes. The (general) way to share state across processes is to store the state externally, e.g. in a database.
Are you familiar with Redis? That'd be on my short list of candidates.
I go the answers by contacting PythonAnywhere support, who had this to say:
When you run a website on a free PythonAnywhere account, just
one process handles all of your requests -- so a global variable like
the one you use there will be fine. But as soon as you want to scale
up, and get (say) a hacker account, then you'll have multiple processes
(not, not threads) -- and of course each one will have its own global
variables, so things will go wrong.
So that part deals with the PythonAnywhere specifics on why it works, and when it would stop working on there.
The answer to the second part, about how to share variables between multiple Bottle processes, I also got from their support (most helpful!) once they understood that a database would not work well in this situation.
Different processes cannot of course share variables, and the most viable solution would be to:
write your own kind of caching server to handle keeping stuff in memory [...] You'd have one process that ran all of the time, and web API requests would access it somehow (an internal REST API?). It could maintain stuff in memory [...]
Ps: I didn't expect other replies to tell me to store state in a database, I figured that the fact I'm asking this means I have a good reason not to use a database, apologies for time wasted!

Passing variables between two python processes

I am intended to make a program structure like below
PS1 is a python program persistently running. PC1, PC2, PC3 are client python programs. PS1 has a variable hashtable, whenever PC1, PC2... asks for the hashtable the PS1 will pass it to them.
The intention is to keep the table in memory since it is a huge variable (takes 10G memory) and it is expensive to calculate it every time. It is not feasible to store it in the hard disk (using pickle or json) and read it every time when it is needed. The read just takes too long.
So I was wondering if there is a way to keep a python variable persistently in the memory, so it can be used very fast whenever it is needed.
You are trying to reinvent a square wheel, when nice round wheels already exist!
Let's go one level up to how you have described your needs:
one large data set, that is expensive to build
different processes need to use the dataset
performance questions do not allow to simply read the full set from permanent storage
IMHO, we are exactly facing what databases were created for. For common use cases, having many processes all using their own copy of a 10G object is a memory waste, and the common way is that one single process have the data, and the others send requests for the data. You did not describe your problem enough, so I cannot say if the best solution will be:
a SQL database like PostgreSQL or MariaDB - as they can cache, if you have enough memory, all will be held automatically in memory
a NOSQL database (MongoDB, etc.) if your only (or main) need is single key access - very nice when dealing with lot of data requiring fast but simple access
a dedicated server using a dedicate query languages if your needs are very specific and none of the above solutions meet them
a process setting up a huge piece of shared memory that will be used by client processes - that last solution will certainly be fastest provided:
all clients make read-only accesses - it can be extended to r/w accesses but could lead to a synchronization nightmare
you are sure to have enough memory on your system to never use swap - if you do you will lose all the cache optimizations that real databases implement
the size of the database and the number of client process and the external load of the whole system never increase to a level where you fall in the swapping problem above
TL/DR: My advice is to experiment what are the performances with a good quality database and optionaly a dedicated chache. Those solution allow almost out of the box load balancing on different machines. Only if that does not work carefully analyze the memory requirements and be sure to document the limits in number of client processes and database size for future maintenance and use shared memory - read-only data being an hint that shared memory can be a nice solution
In short, to accomplish what you are asking about, you need to create a byte array as a RawArray from the multiprocessing.sharedctypes module that is large enough for your entire hashtable in the PS1 server, and then store the hashtable in that RawArray. PS1 needs to be the process that launches PC1, PC2, etc., which can then inherit access to the RawArray. You can create your own class of object that provides the hashtable interface through which the individual variables in the table are accessed that can be separately passed to each of the PC# processes that reads from the shared RawArray.

Keep (specific) celery result objects in NoSQL backend for independent usage

This is a rather specific question to advanced users of celery. Let me explain the use case I have:
Usecase
I have to run ~1k-100k tasks that will run a simulation (movie) and return the data of the simulation as a rather large list of smaller objects (frames), say 10k-100k per frame and 1k frames. So the total amount of data produced will be very large, but assume that I have a database that can handle this. Speed is not a key factor here. Later I need to compute features from each frame which can be done completely independent.
The frames look like a dict that point to some numpy arrays and other simple data like strings and numbers and have a unique identifier UUID.
Important is that the final objects of interest are arbitrary joins and splits of these generated lists. As a metaphor consider the result movies be chop and recombined into new movies. These final lists (movies) are then basically a list of references to the frames using their UUIDs.
Now, I consider using celery to get these first movies and since these will end up in the backend DB anyway I might just keep these results indefinitely, at least the ones I specify to keep.
My question
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID. And if so does that make sense because of overhead and performance, etc.
Another possibility would be to not return anything and let the worker store the result in a DB. Is that preferred? It seems unnecessary to have a second channel of communication to another DB when Celery can do this already.
I am also interested in comments on using Celery in general for highly independent tasks that run long (>1h) and return large result objects. A fail is not problematic and can just be restarted. The resulting movies are stochastic! so functional approaches can be problematic. Even storing the random seed might not garantuee reproducible results! although I do not have side-effects. I just might have lots of workers available that are widely distributed. Imagine lots of desktop machines in a closed environment where every worker helps even if it is slow. Network speed and security is not an issue here. I know that this is not the original use case, but it seemed very easy to use it for these cases. The best analogy I found are projects like Folding#Home.
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID.
Yes, you can configure celery to store its results in a NoSQL database such as redis for access by UUID later. The two settings that will control the behavior of interest for you are result_expires and result_backend.
result_backend will specify which NoSQL database you want to store your results in (e.g., elasticsearch or redis) while result_expires will specify how long after a task completes that the task's result will be available for access.
After the task completes, you can access the results in python like this:
from celery.result import AsyncResult
result = task_name.delay()
print result.id
uuid = result.id
checked_result = AsyncResult(uuid)
# and you can access the result output here however you'd like
And if so does that make sense because of overhead and performance, etc.
I think this strategy makes perfect sense. I have typically used this a number of times when generating long-running reports for web users. The initial post will return the UUID from the celery task. The web client can poll the app sever via javascript using the UUID to see if the task is ready/complete. Once the report is ready, the page can redirect the user to the route that will allow the user to download or view the report by passing in the UUID.

Set / get objects with memcached

In a Django Python app, I launch jobs with Celery (a task manager). When each job is launched, they return an object (lets call it an instance of class X) that lets you check on the job and retrieve the return value or errors thrown.
Several people (someday, I hope) will be able to use this web interface at the same time; therefore, several instances of class X may exist at the same time, each corresponding to a job that is queued or running in parallel. It's difficult to come up with a way to hold onto these X objects because I cannot use a global variable (a dictionary that allows me to look up each X objects from a key); this is because Celery uses different processes, not just different threads, so each would modify its own copy of the global table, causing mayhem.
Subsequently, I received the great advice to use memcached to share the memory across the tasks. I got it working and was able to set and get integer and string values between processes.
The trouble is this: after a great deal of debugging today, I learned that memcached's set and get don't seem to work for classes. This is my best guess: Perhaps under the hood memcached serializes objects to the shared memory; class X (understandably) cannot be serialized because it points at live data (the status of the job), and so the serial version may be out of date (i.e. it may point to the wrong place) when it is loaded again.
Attempts to use a SQLite database were similarly fruitless; not only could I not figure out how to serialize objects as database fields (using my Django models.py file), I would be stuck with the same problem: the handles of the launched jobs need to stay in RAM somehow (or use some fancy OS tricks underneath), so that they update as the jobs finish or fail.
My best guess is that (despite the advice that thankfully got me this far) I should be launching each job in some external queue (for instance Sun/Oracle Grid Engine). However, I couldn't come up with a good way of doing that without using a system call, which I thought may be bad style (and potentially insecure).
How do you keep track of jobs that you launch in Django or Django Celery? Do you launch them by simply putting the job arguments into a database and then have another job that polls the database and runs jobs?
Thanks a lot for your help, I'm quite lost.
I think django-celery does this work for you. Did you had a look at the tables made by django-celery? I.e. djcelery_taskstate holds all data for a given task like state, worker_id and so on. For periodic tasks there is a table called djcelery_periodictask.
In a Django view you can access the TaskMeta object:
from djcelery.models import TaskMeta
task = TaskMeta.objects.get(task_id=task_id)
print task.status

Categories