Related
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
I am trying to create workers for a task that involves reading a lot of files and analyzing them.
I want something like this:
list_of_unique_keys_from_csv_file = [] # About 200mb array (10m rows)
# a list of uniquekeys for comparing inside worker processes to a set of flatfiles
I need more threads as it is going very slow, doing the comparison with one process (10 minutes per file).
I have another set of flat-files that I compare the CSV file to, to see if unique keys exist. This seems like a map reduce type of problem.
main.py:
def worker_process(directory_glob_of_flat_files, list_of_unique_keys_from_csv_file):
# Do some parallel comparisons "if not in " type stuff.
# generate an array of
# lines of text like : "this item_x was not detected in CSV list (from current_flatfile)"
if current_item not in list_of_unique_keys_from_csv_file:
all_lines_this_worker_generated.append(sometext + current_item)
return all_lines_this_worker_generated
def main():
all_results = []
pool = Pool(processes=6)
partitioned_flat_files = [] # divide files from glob by 6
results = pool.starmap(worker_process, partitioned_flat_files, {{{{i wanna pass in my read-only parameter}}}})
pool.close()
pool.join()
all_results.extend(results )
resulting_file.write(all_results)
I am using both a linux and a windows environment, so perhaps I need something cross-platform compatible (the whole fork() discussion).
Main Question: Do I need some sort of Pipe or Queue, I can't seem to find good examples of how to transfer around a big read-only string array, a copy for each worker process?
You can just split your read-only parameters and then pass them in. The multiprocessing module is cross-platform compatible, so don't worry about it.
Actually, every process, even sub-process, has its own resources, that means no matter how you pass the parameters to it, it will keep a copy of the original one instead of sharing it. In this simple case, when you pass the parameters from main process into sub-processes, Pool automatically makes a copy of your variables. Because sub-processes just have the copies of original one, so the modification cannot be shared. It doesn't matter in this case as your variables are read-only.
But be careful about your code, you need to wrap the parameters you need into an
iterable collection, for example:
def add(a, b):
return a + b
pool = Pool()
results = pool.starmap(add, [(1, 2), (3, 4)])
print(results)
# [3, 7]
I am trying to use multiprocessing in python 3.6. I have a for loopthat runs a method with different arguments. Currently, it is running one at a time which is taking quite a bit of time so I am trying to use multiprocessing. Here is what I have:
def test(self):
for key, value in dict.items():
pool = Pool(processes=(cpu_count() - 1))
pool.apply_async(self.thread_process, args=(key,value))
pool.close()
pool.join()
def thread_process(self, key, value):
# self.__init__()
print("For", key)
I think what my code is using 3 processes to run one method but I would like to run 1 method per process but I don't know how this is done. I am using 4 cores btw.
You're making a pool at every iteration of the for loop. Make a pool beforehand, apply the processes you'd like to run in multiprocessing, and then join them:
from multiprocessing import Pool, cpu_count
import time
def t():
# Make a dummy dictionary
d = {k: k**2 for k in range(10)}
pool = Pool(processes=(cpu_count() - 1))
for key, value in d.items():
pool.apply_async(thread_process, args=(key, value))
pool.close()
pool.join()
def thread_process(key, value):
time.sleep(0.1) # Simulate a process taking some time to complete
print("For", key, value)
if __name__ == '__main__':
t()
You're not populating your multiprocessing.Pool with data - you're re-initializing the pool on each loop. In your case you can use Pool.map() to do all the heavy work for you:
def thread_process(args):
print(args)
def test():
pool = Pool(processes=(cpu_count() - 1))
pool.map(thread_process, your_dict.items())
pool.close()
if __name__ == "__main__": # important guard for cross-platform use
test()
Also, given all those self arguments I reckon you're snatching this off of a class instance and if so - don't, unless you know what you're doing. Since multiprocessing in Python essentially works as, well, multi-processing (unlike multi-threading) you don't get to share your memory, which means your data is pickled when exchanging between processes, which means anything that cannot be pickled (like instance methods) doesn't get called. You can read more on that problem on this answer.
I think what my code is using 3 processes to run one method but I would like to run 1 method per process but I don't know how this is done. I am using 4 cores btw.
No, you are in fact using the correct syntax here to utilize 3 cores to run an arbitrary function independently on each. You cannot magically utilize 3 cores to work together on one task with out explicitly making that a part of the algorithm itself/ coding that your self often using threads (which do not work the same in python as they do outside of the language).
You are however re-initializing the pool every loop you'll need to do something like this instead to actually perform this properly:
cpus_to_run_on = cpu_count() - 1
pool = Pool(processes=(cpus_to_run_on)
# don't call a dictionary a dict, you will not be able to use dict() any
# more after that point, that's like calling a variable len or abs, you
# can't use those functions now
pool.map(your_function, your_function_args)
pool.close()
Take a look at the python multiprocessing docs for more specific information if you'd like to get a better understanding of how it works. Under python, you cannot utilize threading to do multiprocessing with the default CPython interpreter. This is because of something called the global interpreter lock, which stops concurrent resource access from within python itself. The GIL doesn't exist in other implementations of the language, and is not something other languages like C and C++ have to deal with (and thus you can actually use threads in parallel to work together on a task, unlike CPython)
Python gets around this issue by simply making multiple interpreter instances when using the multiprocessing module, and any message passing between instances is done via copying data between processes (ie the same memory is typically not touched by both interpreter instances). This does not however happen in the misleadingly named threading module, which often actually slow processes down because of a process called context switching. Threading today has limited usefullness, but provides an easier way around non GIL locked processes like socket and file reads/writes than async python.
Beyond all this though there is a bigger problem with your multiprocessing. Your writing to standard output. You aren't going to get the gains you want. Think about it. Each of your processes "print" data, but its all being displayed in one terminal/output screen. So even if your processes are "printing" they aren't really doing that independently, and the information has to be coalesced back into another processes where the text interface lies (ie your console). So these processes write whatever they were going to to some sort of buffer, which then has to be copied (as we learned from how multiprocessing works) to another process which will then take that buffered data and output it.
Typically dummy programs use printing as a means of showing how there is no order between execution of these processes, that they can finish at different times, they aren't meant to demonstrate the performance benefits of multi core processing.
I have experimented a bit this week with multiprocessing. The fastest way that I discovered to do multiprocessing in python3 is using imap_unordered, at least in my scenario. Here is a script you can experiment with using your scenario to figure out what works best for you:
import multiprocessing
NUMBER_OF_PROCESSES = multiprocessing.cpu_count()
MP_FUNCTION = 'imap_unordered' # 'imap_unordered' or 'starmap' or 'apply_async'
def process_chunk(a_chunk):
print(f"processig mp chunk {a_chunk}")
return a_chunk
map_jobs = [1, 2, 3, 4]
result_sum = 0
if MP_FUNCTION == 'imap_unordered':
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
for i in pool.imap_unordered(process_chunk, map_jobs):
result_sum += i
elif MP_FUNCTION == 'starmap':
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
try:
map_jobs = [(i, ) for i in map_jobs]
result_sum = pool.starmap(process_chunk, map_jobs)
result_sum = sum(result_sum)
finally:
pool.close()
pool.join()
elif MP_FUNCTION == 'apply_async':
with multiprocessing.Pool(processes=NUMBER_OF_PROCESSES) as pool:
result_sum = [pool.apply_async(process_chunk, [i, ]).get() for i in map_jobs]
result_sum = sum(result_sum)
print(f"result_sum is {result_sum}")
I found that starmap was not too far behind in performance, in my scenario it used more cpu and ended up being a bit slower. Hope this boilerplate helps.
I am trying to parallelize some calculations with the use of the multiprocessing module.
How can be sure that every process that is spawned by multiprocessing.Pool.map_async is running on a different (previously created) folder?
The problem is that each process calls some third parts library that wrote temp files to disk, and if you run many of those in the same folder, you mess up one with the other.
Additionally, I can't create a new folder for every function call made by map_async, but rather, I would like to create as little as possible folders (ie, one per each process).
The code would be similar to this:
import multiprocessing,os,shutil
processes=16
#starting pool
pool=multiprocessing.Pool(processes)
#The asked dark-magic here?
devshm='/dev/shm/'
#Creating as many folders as necessary
for p in range(16):
os.mkdir(devshm+str(p)+'/')
shutil.copy(some_files,p)
def example_function(i):
print os.getcwd()
return i*i
result=pool.map_async(example_function,range(1000))
So that at any time, every call of example_function is executed on a different folder.
I know that a solution might be to use subprocess to spawn the different processes, but I would like to stick to multiprocessing (I would need to pickle some objects, write to disk,read, unpickle for every spawned subprocess, rather than passing the object itself through the function call(using functools.partial) .
PS.
This question is somehow similar, but that solution doesn't guarantee that every function call is taking place on a different folder, which indeed is my goal.
Since you don't specify in your question, i'm assuming you don't need the contents of the directory after your function has finished executing.
The absolute easiest method is to create and destroy the temp directories in your function that uses them. This way the rest of your code doesn't care about environment/directories of the worker processes and Pool fits nicely. I would also use python's built-in functionality for creating temporary directories:
import multiprocessing, os, shutil, tempfile
processes=16
def example_function(i):
with tempfile.TemporaryDirectory() as path:
os.chdir(path)
print(os.getcwd())
return i*i
if __name__ == '__main__':
#starting pool
pool=multiprocessing.Pool(processes)
result=pool.map(example_function,range(1000))
NOTE: tempfile.TemporaryDirectory was introduced in python 3.2. If you are using an older version of python, you can copy the wrapper class into your code.
If you really need to setup the directories beforehand...
Trying to make this work with Pool is a little hacky. You could pass the name of the directory to use along with the data, but you could only pass an initial amount equal to the number of directories. Then, you would need to use something like imap_unordered to see when a result is done (and it's directory is available for reuse).
A much better approach, in my opinion, is not to use Pool at all, but create individual Process objects and assign each one to a directory. This is generally better if you need to control some part of the Process's environment, where Pool is generally better when your problem is data-driven and doesn't care about the processes or their environment.
There different ways to pass data to/from the Process objects, but the simplest is a queue:
import multiprocessing,os,shutil
processes=16
in_queue = multiprocessing.Queue()
out_queue = multiprocessing.Queue()
def example_function(path, qin, qout):
os.chdir(path)
for i in iter(qin.get, 'stop'):
print(os.getcwd())
qout.put(i*i)
devshm='/dev/shm/'
# create processes & folders
procs = []
for i in range(processes):
path = devshm+str(i)+'/'
os.mkdir(path)
#shutil.copy(some_files,path)
procs.append(multiprocessing.Process(target=example_function, args=(path,in_queue, out_queue)))
procs[-1].start()
# send input
for i in range(1000):
in_queue.put(i)
# send stop signals
for i in range(processes):
in_queue.put('stop')
# collect output
results = []
for i in range(1000):
results.append(out_queue.get())
The following does not work
one.py
import shared
shared.value = 'Hello'
raw_input('A cheap way to keep process alive..')
two.py
import shared
print shared.value
run on two command lines as:
>>python one.py
>>python two.py
(the second one gets an attribute error, rightly so).
Is there a way to accomplish this, that is, share a variable between two scripts?
Hope it's OK to jot down my notes about this issue here.
First of all, I appreciate the example in the OP a lot, because that is where I started as well - although it made me think shared is some built-in Python module, until I found a complete example at [Tutor] Global Variables between Modules ??.
However, when I looked for "sharing variables between scripts" (or processes) - besides the case when a Python script needs to use variables defined in other Python source files (but not necessarily running processes) - I mostly stumbled upon two other use cases:
A script forks itself into multiple child processes, which then run in parallel (possibly on multiple processors) on the same PC
A script spawns multiple other child processes, which then run in parallel (possibly on multiple processors) on the same PC
As such, most hits regarding "shared variables" and "interprocess communication" (IPC) discuss cases like these two; however, in both of these cases one can observe a "parent", to which the "children" usually have a reference.
What I am interested in, however, is running multiple invocations of the same script, ran independently, and sharing data between those (as in Python: how to share an object instance across multiple invocations of a script), in a singleton/single instance mode. That kind of problem is not really addressed by the above two cases - instead, it essentially reduces to the example in OP (sharing variables across two scripts).
Now, when dealing with this problem in Perl, there is IPC::Shareable; which "allows you to tie a variable to shared memory", using "an integer number or 4 character string[1] that serves as a common identifier for data across process space". Thus, there are no temporary files, nor networking setups - which I find great for my use case; so I was looking for the same in Python.
However, as accepted answer by #Drewfer notes: "You're not going to be able to do what you want without storing the information somewhere external to the two instances of the interpreter"; or in other words: either you have to use a networking/socket setup - or you have to use temporary files (ergo, no shared RAM for "totally separate python sessions").
Now, even with these considerations, it is kinda difficult to find working examples (except for pickle) - also in the docs for mmap and multiprocessing. I have managed to find some other examples - which also describe some pitfalls that the docs do not mention:
Usage of mmap: working code in two different scripts at Sharing Python data between processes using mmap | schmichael's blog
Demonstrates how both scripts change the shared value
Note that here a temporary file is created as storage for saved data - mmap is just a special interface for accessing this temporary file
Usage of multiprocessing: working code at:
Python multiprocessing RemoteManager under a multiprocessing.Process - working example of SyncManager (via manager.start()) with shared Queue; server(s) writes, clients read (shared data)
Comparison of the multiprocessing module and pyro? - working example of BaseManager (via server.serve_forever()) with shared custom class; server writes, client reads and writes
How to synchronize a python dict with multiprocessing - this answer has a great explanation of multiprocessing pitfalls, and is a working example of SyncManager (via manager.start()) with shared dict; server does nothing, client reads and writes
Thanks to these examples, I came up with an example, which essentially does the same as the mmap example, with approaches from the "synchronize a python dict" example - using BaseManager (via manager.start() through file path address) with shared list; both server and client read and write (pasted below). Note that:
multiprocessing managers can be started either via manager.start() or server.serve_forever()
serve_forever() locks - start() doesn't
There is auto-logging facility in multiprocessing: it seems to work fine with start()ed processes - but seems to ignore the ones that serve_forever()
The address specification in multiprocessing can be IP (socket) or temporary file (possibly a pipe?) path; in multiprocessing docs:
Most examples use multiprocessing.Manager() - this is just a function (not class instantiation) which returns a SyncManager, which is a special subclass of BaseManager; and uses start() - but not for IPC between independently ran scripts; here a file path is used
Few other examples serve_forever() approach for IPC between independently ran scripts; here IP/socket address is used
If an address is not specified, then an temp file path is used automatically (see 16.6.2.12. Logging for an example of how to see this)
In addition to all the pitfalls in the "synchronize a python dict" post, there are additional ones in case of a list. That post notes:
All manipulations of the dict must be done with methods and not dict assignments (syncdict["blast"] = 2 will fail miserably because of the way multiprocessing shares custom objects)
The workaround to dict['key'] getting and setting, is the use of the dict public methods get and update. The problem is that there are no such public methods as alternative for list[index]; thus, for a shared list, in addition we have to register __getitem__ and __setitem__ methods (which are private for list) as exposed, which means we also have to re-register all the public methods for list as well :/
Well, I think those were the most critical things; these are the two scripts - they can just be ran in separate terminals (server first); note developed on Linux with Python 2.7:
a.py (server):
import multiprocessing
import multiprocessing.managers
import logging
logger = multiprocessing.log_to_stderr()
logger.setLevel(logging.INFO)
class MyListManager(multiprocessing.managers.BaseManager):
pass
syncarr = []
def get_arr():
return syncarr
def main():
# print dir([]) # cannot do `exposed = dir([])`!! manually:
MyListManager.register("syncarr", get_arr, exposed=['__getitem__', '__setitem__', '__str__', 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'])
manager = MyListManager(address=('/tmp/mypipe'), authkey='')
manager.start()
# we don't use the same name as `syncarr` here (although we could);
# just to see that `syncarr_tmp` is actually <AutoProxy[syncarr] object>
# so we also have to expose `__str__` method in order to print its list values!
syncarr_tmp = manager.syncarr()
print("syncarr (master):", syncarr, "syncarr_tmp:", syncarr_tmp)
print("syncarr initial:", syncarr_tmp.__str__())
syncarr_tmp.append(140)
syncarr_tmp.append("hello")
print("syncarr set:", str(syncarr_tmp))
raw_input('Now run b.py and press ENTER')
print
print 'Changing [0]'
syncarr_tmp.__setitem__(0, 250)
print 'Changing [1]'
syncarr_tmp.__setitem__(1, "foo")
new_i = raw_input('Enter a new int value for [0]: ')
syncarr_tmp.__setitem__(0, int(new_i))
raw_input("Press any key (NOT Ctrl-C!) to kill server (but kill client first)".center(50, "-"))
manager.shutdown()
if __name__ == '__main__':
main()
b.py (client)
import time
import multiprocessing
import multiprocessing.managers
import logging
logger = multiprocessing.log_to_stderr()
logger.setLevel(logging.INFO)
class MyListManager(multiprocessing.managers.BaseManager):
pass
MyListManager.register("syncarr")
def main():
manager = MyListManager(address=('/tmp/mypipe'), authkey='')
manager.connect()
syncarr = manager.syncarr()
print "arr = %s" % (dir(syncarr))
# note here we need not bother with __str__
# syncarr can be printed as a list without a problem:
print "List at start:", syncarr
print "Changing from client"
syncarr.append(30)
print "List now:", syncarr
o0 = None
o1 = None
while 1:
new_0 = syncarr.__getitem__(0) # syncarr[0]
new_1 = syncarr.__getitem__(1) # syncarr[1]
if o0 != new_0 or o1 != new_1:
print 'o0: %s => %s' % (str(o0), str(new_0))
print 'o1: %s => %s' % (str(o1), str(new_1))
print "List is:", syncarr
print 'Press Ctrl-C to exit'
o0 = new_0
o1 = new_1
time.sleep(1)
if __name__ == '__main__':
main()
As a final remark, on Linux /tmp/mypipe is created - but is 0 bytes, and has attributes srwxr-xr-x (for a socket); I guess this makes me happy, as I neither have to worry about network ports, nor about temporary files as such :)
Other related questions:
Python: Possible to share in-memory data between 2 separate processes (very good explanation)
Efficient Python to Python IPC
Python: Sending a variable to another script
You're not going to be able to do what you want without storing the information somewhere external to the two instances of the interpreter.
If it's just simple variables you want, you can easily dump a python dict to a file with the pickle module in script one and then re-load it in script two.
Example:
one.py
import pickle
shared = {"Foo":"Bar", "Parrot":"Dead"}
fp = open("shared.pkl","w")
pickle.dump(shared, fp)
two.py
import pickle
fp = open("shared.pkl")
shared = pickle.load(fp)
print shared["Foo"]
sudo apt-get install memcached python-memcache
one.py
import memcache
shared = memcache.Client(['127.0.0.1:11211'], debug=0)
shared.set('Value', 'Hello')
two.py
import memcache
shared = memcache.Client(['127.0.0.1:11211'], debug=0)
print shared.get('Value')
What you're trying to do here (store a shared state in a Python module over separate python interpreters) won't work.
A value in a module can be updated by one module and then read by another module, but this must be within the same Python interpreter. What you seem to be doing here is actually a sort of interprocess communication; this could be accomplished via socket communication between the two processes, but it is significantly less trivial than what you are expecting to have work here.
you can use the relative simple mmap file.
you can use the shared.py to store the common constants. The following code will work across different python interpreters \ scripts \processes
shared.py:
MMAP_SIZE = 16*1024
MMAP_NAME = 'Global\\SHARED_MMAP_NAME'
* The "Global" is windows syntax for global names
one.py:
from shared import MMAP_SIZE,MMAP_NAME
def write_to_mmap():
map_file = mmap.mmap(-1,MMAP_SIZE,tagname=MMAP_NAME,access=mmap.ACCESS_WRITE)
map_file.seek(0)
map_file.write('hello\n')
ret = map_file.flush() != 0
if sys.platform.startswith('win'):
assert(ret != 0)
else:
assert(ret == 0)
two.py:
from shared import MMAP_SIZE,MMAP_NAME
def read_from_mmap():
map_file = mmap.mmap(-1,MMAP_SIZE,tagname=MMAP_NAME,access=mmap.ACCESS_READ)
map_file.seek(0)
data = map_file.readline().rstrip('\n')
map_file.close()
print data
*This code was written for windows, linux might need little adjustments
more info at - https://docs.python.org/2/library/mmap.html
Share a dynamic variable by Redis:
script_one.py
from redis import Redis
from time import sleep
cli = Redis('localhost')
shared_var = 1
while True:
cli.set('share_place', shared_var)
shared_var += 1
sleep(1)
Run script_one in a terminal (a process):
$ python script_one.py
script_two.py
from redis import Redis
from time import sleep
cli = Redis('localhost')
while True:
print(int(cli.get('share_place')))
sleep(1)
Run script_two in another terminal (another process):
$ python script_two.py
Out:
1
2
3
4
5
...
Dependencies:
$ pip install redis
$ apt-get install redis-server
I'd advise that you use the multiprocessing module. You can't run two scripts from the commandline, but you can have two separate processes easily speak to each other.
From the doc's examples:
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print q.get() # prints "[42, None, 'hello']"
p.join()
You need to store the variable in some sort of persistent file. There are several modules to do this, depending on your exact need.
The pickle and cPickle module can save and load most python objects to file.
The shelve module can store python objects in a dictionary-like structure (using pickle behind the scenes).
The dbm/bsddb/dbhash/gdm modules can store string variables in a dictionary-like structure.
The sqlite3 module can store data in a lightweight SQL database.
The biggest problem with most of these are that they are not synchronised across different processes - if one process reads a value while another is writing to the datastore then you may get incorrect data or data corruption. To get round this you will need to write your own file locking mechanism or use a full-blown database.
If you wanna read and modify shared data between 2 scripts which run separately, a good solution would be to take advantage of python multiprocessing module and use a Pipe() or a Queue() (see differences here). This way you get to sync scripts and avoid problems regarding concurrency and global variables (like what happens if both scripts wanna modify a variable at the same time).
The best part about using pipes/queues is that you can pass python objects through them.
Also there are methods to avoid waiting for data if there hasn't been passed yet (queue.empty() and pipeConn.poll()).
See an example using Queue() below:
# main.py
from multiprocessing import Process, Queue
from stage1 import Stage1
from stage2 import Stage2
s1= Stage1()
s2= Stage2()
# S1 to S2 communication
queueS1 = Queue() # s1.stage1() writes to queueS1
# S2 to S1 communication
queueS2 = Queue() # s2.stage2() writes to queueS2
# start s2 as another process
s2 = Process(target=s2.stage2, args=(queueS1, queueS2))
s2.daemon = True
s2.start() # Launch the stage2 process
s1.stage1(queueS1, queueS2) # start sending stuff from s1 to s2
s2.join() # wait till s2 daemon finishes
# stage1.py
import time
import random
class Stage1:
def stage1(self, queueS1, queueS2):
print("stage1")
lala = []
lis = [1, 2, 3, 4, 5]
for i in range(len(lis)):
# to avoid unnecessary waiting
if not queueS2.empty():
msg = queueS2.get() # get msg from s2
print("! ! ! stage1 RECEIVED from s2:", msg)
lala = [6, 7, 8] # now that a msg was received, further msgs will be different
time.sleep(1) # work
random.shuffle(lis)
queueS1.put(lis + lala)
queueS1.put('s1 is DONE')
# stage2.py
import time
class Stage2:
def stage2(self, queueS1, queueS2):
print("stage2")
while True:
msg = queueS1.get() # wait till there is a msg from s1
print("- - - stage2 RECEIVED from s1:", msg)
if msg == 's1 is DONE ':
break # ends loop
time.sleep(1) # work
queueS2.put("update lists")
EDIT: just found that you can use queue.get(False) to avoid blockage when receiving data. This way there's no need to check first if the queue is empty. This is no possible if you use pipes.
Use text files or environnement variables. Since the two run separatly, you can't really do what you are trying to do.
In your example, the first script runs to completion, and then the second script runs. That means you need some sort of persistent state. Other answers have suggested using text files or Python's pickle module. Personally I am lazy, and I wouldn't use a text file when I could use pickle; why should I write a parser to parse my own text file format?
Instead of pickle you could also use the json module to store it as JSON. This might be preferable if you want to share the data to non-Python programs, as JSON is a simple and common standard. If your Python doesn't have json, get simplejson.
If your needs go beyond pickle or json -- say you actually want to have two Python programs executing at the same time and updating the persistent state variables in real time -- I suggest you use the SQLite database. Use an ORM to abstract the database away, and it's super easy. For SQLite and Python, I recommend Autumn ORM.
This method seems straight forward for me:
class SharedClass:
def __init__(self):
self.data = {}
def set_data(self, name, value):
self.data[name] = value
def get_data(self, name):
try:
return self.data[name]
except:
return "none"
def reset_data(self):
self.data = {}
sharedClass = SharedClass()
PS : you can set the data with a parameter name and a value for it, and to access the value you can use the get_data method, below is the example:
to set the data
example 1:
sharedClass.set_data("name","Jon Snow")
example 2:
sharedClass.set_data("email","jon#got.com")\
to get the data
sharedClass.get_data("email")\
to reset the entire state simply use
sharedClass.reset_data()
Its kind of accessing data from a json object (dict in this case)
Hope this helps....
You could use the basic from and import functions in python to import the variable into two.py. For example:
from filename import variable
That should import the variable from the file.
(Of course you should replace filename with one.py, and replace variable with the variable you want to share to two.py.)
You can also solve this problem by making the variable as global
python first.py
class Temp:
def __init__(self):
self.first = None
global var1
var1 = Temp()
var1.first = 1
print(var1.first)
python second.py
import first as One
print(One.var1.first)