Instance attributes do not persist using multiprocessing - python

I'm having an issue with instances not retaining changes to attributes, or even keeping new attributes that are created. I think I've narrowed it down to the fact that my script takes advantage of multiprocessing, and I'm thinking that changes occurring to instances in separate process threads are not 'remembered' when the script returns to the main thread.
Basically, I have several sets of data which I need to process in parallel. The data is stored as an attribute, and is altered via several methods in the class. At the conclusion of processing, I'm hoping to return to the main thread and concatenate the data from each of the object instances. However, as described above, when I try to access the instance attribute with the data after the parallel processing bit is done, there's nothing there. It's as if any changes enacted during the multiprocessing bit are 'forgotten'.
Is there an obvious solution to fix this? Or do I need to rebuild my code to instead return the processed data rather than just altering/storing it as an instance attribute? I guess an alternative solution would be to serialize the data, and then re-read it in when necessary, rather than just keeping it in memory.
Something maybe worth noting here is that I am using the pathos module rather than python's multiprocessingmodule. I was getting some errors pertaining to pickling, similar to here: Python multiprocessing PicklingError: Can't pickle <type 'function'>. My code is broken across several modules and as mentioned, the data processing methods are contained within a class.
Sorry for the wall of text.
EDIT
Here's my code:
import importlib
import pandas as pd
from pathos.helpers import mp
from provider import Provider
# list of data providers ... length is arbitrary
operating_providers = ['dataprovider1', 'dataprovider2', 'dataprovider3']
# create provider objects for each operating provider
provider_obj_list = []
for name in operating_providers:
loc = 'providers.%s' % name
module = importlib.import_module(loc)
provider_obj = Provider(module)
provider_obj_list.append(provider_obj)
processes = []
for instance in provider_obj_list:
process = mp.Process(target = instance.data_processing_func)
process.daemon = True
process.start()
processes.append(process)
for process in processes:
process.join()
# now that data_processing_func is complete for each set of data,
# stack all the data
stack = pd.concat((instance.data for instance in provider_obj_list))
I have a number of modules (their names listed in operating_providers) that contain attributes specific to their data source. These modules are iteratively imported and passed to new instances of the Provider class, which I created in a separate module (provider). I append each Provider instance to a list (provider_obj_list), and then iteratively create separate processes which call the instance method instance.data_processing_func. This function does some data processing (with each instance accessing completely different data files), and creates new instance attributes along the way, which I need to access when the parallel processing is complete.
I tried using multithreading instead, rather than multiprocessing -- in this case, my instance attributes persisted, which is what I want. However, I am not sure why this happens -- I'll have to study the differences between threading vs. multiprocessing.
Thanks for any help!

Here's some sample code showing how to do what I outlined in comment. I can't test it because I don't have provider or pathos installed, but it should give you a good idea of what I suggested.
import importlib
from pathos.helpers import mp
from provider import Provider
def process_data(loc):
module = importlib.import_module(loc)
provider_obj = Provider(module)
provider_obj.data_processing_func()
if __name__ == '__main__':
# list of data providers ... length is arbitrary
operating_providers = ['dataprovider1', 'dataprovider2', 'dataprovider3']
# create list of provider locations for each operating provider
provider_loc_list = []
for name in operating_providers:
loc = 'providers.%s' % name
provider_loc_list.append(loc)
processes = []
for loc in provider_loc_list:
process = mp.Process(target=process_data, args=(loc,))
process.daemon = True
process.start()
processes.append(process)
for process in processes:
process.join()

Related

Python: Pass Readonly Shelve to subprocesses

As discussed here: Python: Multiprocessing on Windows -> Shared Readonly Memory I have a heavily parallelized task.
Multiple workers do some stuff and in the end need to access some keys of a dictionary which contains several millions of key:value combinations. The keys which will be accessed, are only known within the worker after some further action also involving some file-processing (the example below is just for demonstration purposes, hence simplified in that way).
Before, my solution was to keep this big dictionary in memory, pass it once into shared memory and access it by the single workers. But it consumes a lot of RAM... So I wanted to use shelve (because the values of that dictionary are again dicts or lists).
So a simplified example of what I tried was:
def shelveWorker(tupArgs):
id, DB = tupArgs
return DB[id]
if __name__ == '__main__':
DB = shelve.open('file.db', flag='r', protocol=2)
joblist = []
for id in range(10000):
joblist.append((str(id), DB))
p = multiprocessing.Pool()
for returnValue in p.imap_unordered(shelveWorker, joblist):
# do something with returnValue
pass
p.close()
p.join()
Unfortunately I get:
"TypeError: can't pickle DB objects"
But IMHO it does not make any sense to open the shelve itself (DB = shelve.open('file.db', flag='r', protocol=2)) within each worker on its own because of slower runtime (I have several thousand workers).
How to go about it?
Thanks!

Taking advantage of fork system call to avoid read/writing or serializing altogether?

I am using mac book and therefore, multiprocessing will use fork system call instead of spawning a new process. Also, I am using Python (with multiprocessing or Dask).
I have a very big pandas dataframe. I need to have many parallel subprocesses work with a portion of this one big dataframe. Let's say I have 100 partitions of this table that needs to be worked on in parallel. I want to avoid having to need to make 100 copies of this big dataframe as that will overwhelm memory. So the current approach I am taking is to partition it, save each partition to disk, and have each process read them in to process the portion each of them are responsible for. But this read/write is very expensive for me, and I would like to avoid it.
But if I make one global variable of this dataframe, then due to COW behavior, each process will be able to read from this dataframe without making an actual physical copy of it (as long as it does not modify it). Now the question I have is, if I make this one global dataframe and name it:
global my_global_df
my_global_df = one_big_df
and then in one of the subprocess I do:
a_portion_of_global_df_readonly = my_global_df.iloc[0:10]
a_portion_of_global_df_copied = a_portion_of_global_df_readonly.reset_index(drop=True)
# reset index will make a copy of the a_portion_of_global_df_readonly
do something with a_portion_of_global_df_copied
If I do the above, will I have created a copy of the entire my_global_df or just a copy of the a_portion_of_global_df_readonly, and thereby, in extension, avoided making copies of 100 one_big_df?
One additional, more general question is, why do people have to deal with Pickle serialization and/or read/write to disk to transfer the data across multiple processes when (assuming people are using UNIX) setting the data as global variable will effectively make it available at all child processes so easily? Is there danger in using COW as a means to make any data available to subprocesses in general?
[Reproducible code from the thread below]
from multiprocessing import Process, Pool
import contextlib
import pandas as pd
def my_function(elem):
return id(elem)
num_proc = 4
num_iter = 10
df = pd.DataFrame(np.asarray([1]))
print(id(df))
with contextlib.closing(Pool(processes=num_proc)) as p:
procs = [p.apply_async(my_function, args=(df, )) for elem in range(num_iter)]
results = [proc.get() for proc in procs]
p.close()
p.join()
print(results)
Summarizing the comments, on a forking system such as Mac or Linux, a child process has a copy-on-write (COW) view of the parent address space, including any DataFrames that it may hold. It is safe to use and modify the dataframe in child processes without changing the data in the parent or other sibling child processses.
That means that it is unnecessary to serialize the dataframe to pass it to the child. All you need is the reference to the dataframe. For a Process, you can just pass the reference directly
p = multiprocessing.Process(target=worker_fctn, args=(my_dataframe,))
p.start()
p.join()
If you use a Queue or another tool such as a Pool then the data will likely be serialized. You can use a global variable known to the worker but not actually passed to the worker to get around that problem.
What remains is the return data. It is in the child only and still needs to be serialized to be returned to the parent.

Does multiprocessing copy the object in this scenario?

import multiprocessing
import numpy as np
import multiprocessing as mp
import ctypes
class Test():
def __init__(self):
shared_array_base = multiprocessing.Array(ctypes.c_double, 100, lock=False)
self.a = shared_array = np.ctypeslib.as_array(shared_array_base)
def my_fun(self,i):
self.a[i] = 1
if __name__ == "__main__":
num_cores = multiprocessing.cpu_count()
t = Test()
def my_fun_wrapper(i):
t.my_fun(i)
with mp.Pool(num_cores) as p:
p.map(my_fun_wrapper, np.arange(100))
print(t.a)
In the code above, I'm trying to write a code to modify an array, using multiprocessing. The function my_fun(), executed in each process, should modify the value for the array a[:] at index i which is passed to my_fun() as a parameter. With regards to the code above, I would like to know what is being copied.
1) Is anything in the code being copied by each process? I think the object might be but ideally nothing is.
2) Is there a way to get around using a wrapper function my_fun() for the object?
Almost everything in your code is getting copied, except the shared memory you allocated with multiprocessing.Array. multiprocessing is full of unintuitive, implicit copies.
When you spawn a new process in multiprocessing, the new process needs its own version of just about everything in the original process. This is handled differently depending on platform and settings, but we can tell you're using "fork" mode, because your code wouldn't work in "spawn" or "forkserver" mode - you'd get an error about the workers not being able to find my_fun_wrapper. (Windows only supports "spawn", so we can tell you're not on Windows.)
In "fork" mode, this initial copy is made by using the fork system call to ask the OS to essentially copy the whole entire process and everything inside. The memory allocated by multiprocessing.Array is sort of "external" and isn't copied, but most other things are. (There's also copy-on-write optimization, but copy-on-write still behaves as if everything was copied, and the optimization doesn't work very well in Python due to refcount updates.)
When you dispatch tasks to worker processes, multiprocessing needs to make even more copies. Any arguments, and the callable for the task itself, are objects in the master process, and objects inherently exist in only one process. The workers can't access any of that. They need their own versions. multiprocessing handles this second round of copies by pickling the callable and arguments, sending the serialized bytes over interprocess communication, and unpickling the pickles in the worker.
When the master pickles my_fun_wrapper, the pickle just says "look for the my_fun_wrapper function in the __main__ module", and the workers look up their version of my_fun_wrapper to unpickle it. my_fun_wrapper looks for a global t, and in the workers, that t was produced by the fork, and the fork produced a t with an array backed by the shared memory you allocated with your original multiprocessing.Array call.
On the other hand, if you try to pass t.my_fun to p.map, then multiprocessing has to pickle and unpickle a method object. The resulting pickle doesn't say "look up the t global variable and get its my_fun method". The pickle says to build a new Test instance and get its my_fun method. The pickle doesn't have any instructions in it about using the shared memory you allocated, and the resulting Test instance and its array are independent of the original array you wanted to modify.
I know of no good way to avoid needing some sort of wrapper function.

Sharing many queues among processes in Python

I am aware of multiprocessing.Manager() and how it can be used to create shared objects, in particular queues which can be shared between workers. There is this question, this question, this question and even one of my own questions.
However, I need to define a great many queues, each of which is linking a specific pair of processes. Say that each pair of processes and its linking queue is identified by the variable key.
I want to use a dictionary to access my queues when I need to put and get data. I cannot make this work. I've tried a number of things. With multiprocessing imported as mp:
Defining a dict like for key in all_keys: DICT[key] = mp.Queue in a config file which is imported by the multiprocessing module (call it multi.py) does not return errors, but the queue DICT[key] is not shared between the processes, each one seems to have their own copy of the queue and thus no communication happens.
If I try to define the DICT at the beginning of the main multiprocessing function that defines the processes and starts them, like
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Queue()
I get the error
RuntimeError: Queue objects should only be shared between processes through
inheritance
Changing to
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Manager().Queue()
only makes everything worse. Trying similar definitions at the head of multi.py rather than inside the main function returns similar errors.
There must be a way to share many queues between processes without explicitly naming each one in the code. Any ideas?
Edit
Here is a basic schema of the program:
1- load the first module, which defines some variables, imports multi, launches multi.main(), and loads another module which starts a cascade of module loads and code execution. Meanwhile...
2- multi.main looks like this:
def main():
manager = mp.Manager()
pool = mp.Pool()
DICT2 = manager.dict()
for key in all_keys:
DICT2[key] = manager.Queue()
proc_1 = pool.apply_async(targ1,(DICT1[key],) ) #DICT1 is defined in the config file
proc_2 = pool.apply_async(targ2,(DICT2[key], otherargs,)
Rather than use pool and manager, I was also launching processes with the following:
mp.Process(target=targ1, args=(DICT[key],))
3 - The function targ1 takes input data that is coming in (sorted by key) from the main process. It is meant to pass the result to DICT[key] so targ2 can do its work. This is the part that is not working. There are an arbitrary number of targ1s, targ2s, etc. and therefore an arbitrary number of queues.
4 - The results of some of these processes will be sent to a bunch of different arrays / pandas dataframes which are also indexed by key, and which I would like to be accessible from arbitrary processes, even ones launched in a different module. I have yet to write this part and it might be a different question. (I mention it here because the answer to 3 above might also solve 4 nicely.)
It sounds like your issues started when you tried to share a multiprocessing.Queue() by passing it as an argument. You can get around this by creating a managed queue instead:
import multiprocessing
manager = multiprocessing.Manager()
passable_queue = manager.Queue()
When you use a manager to create it, you are storing and passing around a proxy to the queue, rather than the queue itself, so even when the object you pass to your worker processes is a copied, it will still point at the same underlying data structure: your queue. It's very similar (in concept) to pointers in C/C++. If you create your queues this way, you will be able to pass them when you launch a worker process.
Since you can pass queues around now, you no longer need your dictionary to be managed. Keep a normal dictionary in main that will store all the mappings, and only give your worker processes the queues they need, so they won't need access to any mappings.
I've written an example of this here. It looks like you are passing objects between your workers, so that's what's done here. Imagine we have two stages of processing, and the data both starts and ends in the control of main. Look at how we can create the queues that connect the workers like a pipeline, but by giving them only they queues they need, there's no need for them to know about any mappings:
import multiprocessing as mp
def stage1(q_in, q_out):
q_out.put(q_in.get()+"Stage 1 did some work.\n")
return
def stage2(q_in, q_out):
q_out.put(q_in.get()+"Stage 2 did some work.\n")
return
def main():
pool = mp.Pool()
manager = mp.Manager()
# create managed queues
q_main_to_s1 = manager.Queue()
q_s1_to_s2 = manager.Queue()
q_s2_to_main = manager.Queue()
# launch workers, passing them the queues they need
results_s1 = pool.apply_async(stage1, (q_main_to_s1, q_s1_to_s2))
results_s2 = pool.apply_async(stage2, (q_s1_to_s2, q_s2_to_main))
# Send a message into the pipeline
q_main_to_s1.put("Main started the job.\n")
# Wait for work to complete
print(q_s2_to_main.get()+"Main finished the job.")
pool.close()
pool.join()
return
if __name__ == "__main__":
main()
The code produces this output:
Main started the job.
Stage 1 did some work.
Stage 2 did some work.
Main finished the job.
I didn't include an example of storing the queues or AsyncResults objects in dictionaries, because I still don't quite understand how your program is supposed to work. But now that you can pass your queues freely, you can build your dictionary to store the queue/process mappings as needed.
In fact, if you really do build a pipeline between multiple workers, you don't even need to keep a reference to the "inter-worker" queues in main. Create the queues, pass them to your workers, then only retain references to queues that main will use. I would definitely recommend trying to let old queues be garbage collected as quickly as possible if you really do have "an arbitrary number" of queues.

How to share variables across scripts in python?

The following does not work
one.py
import shared
shared.value = 'Hello'
raw_input('A cheap way to keep process alive..')
two.py
import shared
print shared.value
run on two command lines as:
>>python one.py
>>python two.py
(the second one gets an attribute error, rightly so).
Is there a way to accomplish this, that is, share a variable between two scripts?
Hope it's OK to jot down my notes about this issue here.
First of all, I appreciate the example in the OP a lot, because that is where I started as well - although it made me think shared is some built-in Python module, until I found a complete example at [Tutor] Global Variables between Modules ??.
However, when I looked for "sharing variables between scripts" (or processes) - besides the case when a Python script needs to use variables defined in other Python source files (but not necessarily running processes) - I mostly stumbled upon two other use cases:
A script forks itself into multiple child processes, which then run in parallel (possibly on multiple processors) on the same PC
A script spawns multiple other child processes, which then run in parallel (possibly on multiple processors) on the same PC
As such, most hits regarding "shared variables" and "interprocess communication" (IPC) discuss cases like these two; however, in both of these cases one can observe a "parent", to which the "children" usually have a reference.
What I am interested in, however, is running multiple invocations of the same script, ran independently, and sharing data between those (as in Python: how to share an object instance across multiple invocations of a script), in a singleton/single instance mode. That kind of problem is not really addressed by the above two cases - instead, it essentially reduces to the example in OP (sharing variables across two scripts).
Now, when dealing with this problem in Perl, there is IPC::Shareable; which "allows you to tie a variable to shared memory", using "an integer number or 4 character string[1] that serves as a common identifier for data across process space". Thus, there are no temporary files, nor networking setups - which I find great for my use case; so I was looking for the same in Python.
However, as accepted answer by #Drewfer notes: "You're not going to be able to do what you want without storing the information somewhere external to the two instances of the interpreter"; or in other words: either you have to use a networking/socket setup - or you have to use temporary files (ergo, no shared RAM for "totally separate python sessions").
Now, even with these considerations, it is kinda difficult to find working examples (except for pickle) - also in the docs for mmap and multiprocessing. I have managed to find some other examples - which also describe some pitfalls that the docs do not mention:
Usage of mmap: working code in two different scripts at Sharing Python data between processes using mmap | schmichael's blog
Demonstrates how both scripts change the shared value
Note that here a temporary file is created as storage for saved data - mmap is just a special interface for accessing this temporary file
Usage of multiprocessing: working code at:
Python multiprocessing RemoteManager under a multiprocessing.Process - working example of SyncManager (via manager.start()) with shared Queue; server(s) writes, clients read (shared data)
Comparison of the multiprocessing module and pyro? - working example of BaseManager (via server.serve_forever()) with shared custom class; server writes, client reads and writes
How to synchronize a python dict with multiprocessing - this answer has a great explanation of multiprocessing pitfalls, and is a working example of SyncManager (via manager.start()) with shared dict; server does nothing, client reads and writes
Thanks to these examples, I came up with an example, which essentially does the same as the mmap example, with approaches from the "synchronize a python dict" example - using BaseManager (via manager.start() through file path address) with shared list; both server and client read and write (pasted below). Note that:
multiprocessing managers can be started either via manager.start() or server.serve_forever()
serve_forever() locks - start() doesn't
There is auto-logging facility in multiprocessing: it seems to work fine with start()ed processes - but seems to ignore the ones that serve_forever()
The address specification in multiprocessing can be IP (socket) or temporary file (possibly a pipe?) path; in multiprocessing docs:
Most examples use multiprocessing.Manager() - this is just a function (not class instantiation) which returns a SyncManager, which is a special subclass of BaseManager; and uses start() - but not for IPC between independently ran scripts; here a file path is used
Few other examples serve_forever() approach for IPC between independently ran scripts; here IP/socket address is used
If an address is not specified, then an temp file path is used automatically (see 16.6.2.12. Logging for an example of how to see this)
In addition to all the pitfalls in the "synchronize a python dict" post, there are additional ones in case of a list. That post notes:
All manipulations of the dict must be done with methods and not dict assignments (syncdict["blast"] = 2 will fail miserably because of the way multiprocessing shares custom objects)
The workaround to dict['key'] getting and setting, is the use of the dict public methods get and update. The problem is that there are no such public methods as alternative for list[index]; thus, for a shared list, in addition we have to register __getitem__ and __setitem__ methods (which are private for list) as exposed, which means we also have to re-register all the public methods for list as well :/
Well, I think those were the most critical things; these are the two scripts - they can just be ran in separate terminals (server first); note developed on Linux with Python 2.7:
a.py (server):
import multiprocessing
import multiprocessing.managers
import logging
logger = multiprocessing.log_to_stderr()
logger.setLevel(logging.INFO)
class MyListManager(multiprocessing.managers.BaseManager):
pass
syncarr = []
def get_arr():
return syncarr
def main():
# print dir([]) # cannot do `exposed = dir([])`!! manually:
MyListManager.register("syncarr", get_arr, exposed=['__getitem__', '__setitem__', '__str__', 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'])
manager = MyListManager(address=('/tmp/mypipe'), authkey='')
manager.start()
# we don't use the same name as `syncarr` here (although we could);
# just to see that `syncarr_tmp` is actually <AutoProxy[syncarr] object>
# so we also have to expose `__str__` method in order to print its list values!
syncarr_tmp = manager.syncarr()
print("syncarr (master):", syncarr, "syncarr_tmp:", syncarr_tmp)
print("syncarr initial:", syncarr_tmp.__str__())
syncarr_tmp.append(140)
syncarr_tmp.append("hello")
print("syncarr set:", str(syncarr_tmp))
raw_input('Now run b.py and press ENTER')
print
print 'Changing [0]'
syncarr_tmp.__setitem__(0, 250)
print 'Changing [1]'
syncarr_tmp.__setitem__(1, "foo")
new_i = raw_input('Enter a new int value for [0]: ')
syncarr_tmp.__setitem__(0, int(new_i))
raw_input("Press any key (NOT Ctrl-C!) to kill server (but kill client first)".center(50, "-"))
manager.shutdown()
if __name__ == '__main__':
main()
b.py (client)
import time
import multiprocessing
import multiprocessing.managers
import logging
logger = multiprocessing.log_to_stderr()
logger.setLevel(logging.INFO)
class MyListManager(multiprocessing.managers.BaseManager):
pass
MyListManager.register("syncarr")
def main():
manager = MyListManager(address=('/tmp/mypipe'), authkey='')
manager.connect()
syncarr = manager.syncarr()
print "arr = %s" % (dir(syncarr))
# note here we need not bother with __str__
# syncarr can be printed as a list without a problem:
print "List at start:", syncarr
print "Changing from client"
syncarr.append(30)
print "List now:", syncarr
o0 = None
o1 = None
while 1:
new_0 = syncarr.__getitem__(0) # syncarr[0]
new_1 = syncarr.__getitem__(1) # syncarr[1]
if o0 != new_0 or o1 != new_1:
print 'o0: %s => %s' % (str(o0), str(new_0))
print 'o1: %s => %s' % (str(o1), str(new_1))
print "List is:", syncarr
print 'Press Ctrl-C to exit'
o0 = new_0
o1 = new_1
time.sleep(1)
if __name__ == '__main__':
main()
As a final remark, on Linux /tmp/mypipe is created - but is 0 bytes, and has attributes srwxr-xr-x (for a socket); I guess this makes me happy, as I neither have to worry about network ports, nor about temporary files as such :)
Other related questions:
Python: Possible to share in-memory data between 2 separate processes (very good explanation)
Efficient Python to Python IPC
Python: Sending a variable to another script
You're not going to be able to do what you want without storing the information somewhere external to the two instances of the interpreter.
If it's just simple variables you want, you can easily dump a python dict to a file with the pickle module in script one and then re-load it in script two.
Example:
one.py
import pickle
shared = {"Foo":"Bar", "Parrot":"Dead"}
fp = open("shared.pkl","w")
pickle.dump(shared, fp)
two.py
import pickle
fp = open("shared.pkl")
shared = pickle.load(fp)
print shared["Foo"]
sudo apt-get install memcached python-memcache
one.py
import memcache
shared = memcache.Client(['127.0.0.1:11211'], debug=0)
shared.set('Value', 'Hello')
two.py
import memcache
shared = memcache.Client(['127.0.0.1:11211'], debug=0)
print shared.get('Value')
What you're trying to do here (store a shared state in a Python module over separate python interpreters) won't work.
A value in a module can be updated by one module and then read by another module, but this must be within the same Python interpreter. What you seem to be doing here is actually a sort of interprocess communication; this could be accomplished via socket communication between the two processes, but it is significantly less trivial than what you are expecting to have work here.
you can use the relative simple mmap file.
you can use the shared.py to store the common constants. The following code will work across different python interpreters \ scripts \processes
shared.py:
MMAP_SIZE = 16*1024
MMAP_NAME = 'Global\\SHARED_MMAP_NAME'
* The "Global" is windows syntax for global names
one.py:
from shared import MMAP_SIZE,MMAP_NAME
def write_to_mmap():
map_file = mmap.mmap(-1,MMAP_SIZE,tagname=MMAP_NAME,access=mmap.ACCESS_WRITE)
map_file.seek(0)
map_file.write('hello\n')
ret = map_file.flush() != 0
if sys.platform.startswith('win'):
assert(ret != 0)
else:
assert(ret == 0)
two.py:
from shared import MMAP_SIZE,MMAP_NAME
def read_from_mmap():
map_file = mmap.mmap(-1,MMAP_SIZE,tagname=MMAP_NAME,access=mmap.ACCESS_READ)
map_file.seek(0)
data = map_file.readline().rstrip('\n')
map_file.close()
print data
*This code was written for windows, linux might need little adjustments
more info at - https://docs.python.org/2/library/mmap.html
Share a dynamic variable by Redis:
script_one.py
from redis import Redis
from time import sleep
cli = Redis('localhost')
shared_var = 1
while True:
cli.set('share_place', shared_var)
shared_var += 1
sleep(1)
Run script_one in a terminal (a process):
$ python script_one.py
script_two.py
from redis import Redis
from time import sleep
cli = Redis('localhost')
while True:
print(int(cli.get('share_place')))
sleep(1)
Run script_two in another terminal (another process):
$ python script_two.py
Out:
1
2
3
4
5
...
Dependencies:
$ pip install redis
$ apt-get install redis-server
I'd advise that you use the multiprocessing module. You can't run two scripts from the commandline, but you can have two separate processes easily speak to each other.
From the doc's examples:
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print q.get() # prints "[42, None, 'hello']"
p.join()
You need to store the variable in some sort of persistent file. There are several modules to do this, depending on your exact need.
The pickle and cPickle module can save and load most python objects to file.
The shelve module can store python objects in a dictionary-like structure (using pickle behind the scenes).
The dbm/bsddb/dbhash/gdm modules can store string variables in a dictionary-like structure.
The sqlite3 module can store data in a lightweight SQL database.
The biggest problem with most of these are that they are not synchronised across different processes - if one process reads a value while another is writing to the datastore then you may get incorrect data or data corruption. To get round this you will need to write your own file locking mechanism or use a full-blown database.
If you wanna read and modify shared data between 2 scripts which run separately, a good solution would be to take advantage of python multiprocessing module and use a Pipe() or a Queue() (see differences here). This way you get to sync scripts and avoid problems regarding concurrency and global variables (like what happens if both scripts wanna modify a variable at the same time).
The best part about using pipes/queues is that you can pass python objects through them.
Also there are methods to avoid waiting for data if there hasn't been passed yet (queue.empty() and pipeConn.poll()).
See an example using Queue() below:
# main.py
from multiprocessing import Process, Queue
from stage1 import Stage1
from stage2 import Stage2
s1= Stage1()
s2= Stage2()
# S1 to S2 communication
queueS1 = Queue() # s1.stage1() writes to queueS1
# S2 to S1 communication
queueS2 = Queue() # s2.stage2() writes to queueS2
# start s2 as another process
s2 = Process(target=s2.stage2, args=(queueS1, queueS2))
s2.daemon = True
s2.start() # Launch the stage2 process
s1.stage1(queueS1, queueS2) # start sending stuff from s1 to s2
s2.join() # wait till s2 daemon finishes
# stage1.py
import time
import random
class Stage1:
def stage1(self, queueS1, queueS2):
print("stage1")
lala = []
lis = [1, 2, 3, 4, 5]
for i in range(len(lis)):
# to avoid unnecessary waiting
if not queueS2.empty():
msg = queueS2.get() # get msg from s2
print("! ! ! stage1 RECEIVED from s2:", msg)
lala = [6, 7, 8] # now that a msg was received, further msgs will be different
time.sleep(1) # work
random.shuffle(lis)
queueS1.put(lis + lala)
queueS1.put('s1 is DONE')
# stage2.py
import time
class Stage2:
def stage2(self, queueS1, queueS2):
print("stage2")
while True:
msg = queueS1.get() # wait till there is a msg from s1
print("- - - stage2 RECEIVED from s1:", msg)
if msg == 's1 is DONE ':
break # ends loop
time.sleep(1) # work
queueS2.put("update lists")
EDIT: just found that you can use queue.get(False) to avoid blockage when receiving data. This way there's no need to check first if the queue is empty. This is no possible if you use pipes.
Use text files or environnement variables. Since the two run separatly, you can't really do what you are trying to do.
In your example, the first script runs to completion, and then the second script runs. That means you need some sort of persistent state. Other answers have suggested using text files or Python's pickle module. Personally I am lazy, and I wouldn't use a text file when I could use pickle; why should I write a parser to parse my own text file format?
Instead of pickle you could also use the json module to store it as JSON. This might be preferable if you want to share the data to non-Python programs, as JSON is a simple and common standard. If your Python doesn't have json, get simplejson.
If your needs go beyond pickle or json -- say you actually want to have two Python programs executing at the same time and updating the persistent state variables in real time -- I suggest you use the SQLite database. Use an ORM to abstract the database away, and it's super easy. For SQLite and Python, I recommend Autumn ORM.
This method seems straight forward for me:
class SharedClass:
def __init__(self):
self.data = {}
def set_data(self, name, value):
self.data[name] = value
def get_data(self, name):
try:
return self.data[name]
except:
return "none"
def reset_data(self):
self.data = {}
sharedClass = SharedClass()
PS : you can set the data with a parameter name and a value for it, and to access the value you can use the get_data method, below is the example:
to set the data
example 1:
sharedClass.set_data("name","Jon Snow")
example 2:
sharedClass.set_data("email","jon#got.com")\
to get the data
sharedClass.get_data("email")\
to reset the entire state simply use
sharedClass.reset_data()
Its kind of accessing data from a json object (dict in this case)
Hope this helps....
You could use the basic from and import functions in python to import the variable into two.py. For example:
from filename import variable
That should import the variable from the file.
(Of course you should replace filename with one.py, and replace variable with the variable you want to share to two.py.)
You can also solve this problem by making the variable as global
python first.py
class Temp:
def __init__(self):
self.first = None
global var1
var1 = Temp()
var1.first = 1
print(var1.first)
python second.py
import first as One
print(One.var1.first)

Categories