Memory leak while retrieving data from a proxy class - python

I am multi-processing data from a series of files.
To achieve the purpose, I built a class to distribute the data.
I started 4 processes that will visit the same class and retrieve data.
The problem is, if I use the class method (retrieve()) to retrieve data, the memory will keep going up. If I don't, the memory is stable, even though the data keeps refreshing by getData(). How to keep a stable memory usage while retrieving data? Or any other way to achieve the same goal?
import pandas as pd
from multiprocessing import Process, RLock
from multiprocessing.managers import BaseManager
class myclass():
def __init__(self, path):
self.path = path
self.lock = RLock()
self.getIter()
def getIter(self):
self.iter = pd.read_csv(self.path, chunksize=1000)
def getData(self):
with self.lock:
try:
self.data = next(self.iter)
except:
self.getIter()
self.data = next(self.iter)
def retrieve(self):
return self.data
def worker(c):
while True:
c.getData()
# Uncommenting the following line, memory usage goes up
data = c.retrieve()
#Generate a testing file
with open('tmp.csv', 'w') as f:
for i in range(1000000):
f.write('%f\n'%(i*1.))
BaseManager.register('myclass', myclass)
bm = BaseManager()
bm.start()
c = bm.myclass('tmp.csv')
for i in range(4):
p = Process(target=worker, args=(c,))
p.start()

I wasn't able to find out the cause nor solving it, but after changing the data type for the returning variable from pandas.DataFrame to a str (json string), the problem goes.

Related

Multiprocessing advice and stop processes

I am trying to implement a system where:
Actors generate data
Replay is a class that manage the data generated by the Actors (In theory it does much more than in the code below, but I kept it simple for posting it here)
Learner use the data of the Replay class (and sometimes update some data of Replay)
To implement that, I appended my generated data of the Actors to a multiprocessing.Queue, I use a process to push my data to my Replay. I used a multiprocessing.BaseManager to share the Replay.
This is my implementation (the code is working):
import time
import random
from collections import deque
import torch.multiprocessing as mp
from multiprocessing.managers import BaseManager
T = 20
B = 5
REPLAY_MINIMUM_SIZE = 10
REPLAY_MAXIMUM_SIZE = 100
class Actor:
def __init__(self, global_buffer, rank):
self.rank = rank
self.local_buffer = []
self.global_buffer = global_buffer
def run(self, num_steps):
for step in range(num_steps):
data = f'{self.rank}_{step}'
self.local_buffer.append(data)
if len(self.local_buffer) >= B:
self.global_buffer.put(self.local_buffer)
self.local_buffer = []
class Learner:
def __init__(self, replay):
self.replay = replay
def run(self, num_steps):
while self.replay.size() <= REPLAY_MINIMUM_SIZE:
time.sleep(0.1)
for step in range(num_steps):
batch = self.replay.sample(B)
print(batch)
class Replay:
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
def push(self, experiences):
self.memory.extend(experiences)
def sample(self, n):
return random.sample(self.memory, n)
def size(self):
return len(self.memory)
def send_data_to_replay(global_buffer, replay):
while True:
if not global_buffer.empty():
batch = global_buffer.get()
replay.push(batch)
if __name__ == '__main__':
num_actors = 2
global_buffer = mp.Queue()
BaseManager.register("ReplayMemory", Replay)
Manager = BaseManager()
Manager.start()
replay = Manager.ReplayMemory(REPLAY_MAXIMUM_SIZE)
learner = Learner(replay)
learner_process = mp.Process(target=learner.run, args=(T,))
learner_process.start()
actor_processes = []
for rank in range(num_actors):
p = mp.Process(target=Actor(global_buffer, rank).run, args=(T,))
p.start()
actor_processes.append(p)
replay_process = mp.Process(target=send_data_to_replay, args=(global_buffer, replay,))
replay_process.start()
learner_process.join()
[actor_process.join() for actor_process in actor_processes]
replay_process.join()
I followed several tutorials and read websites related to multiprocessing, but I am very new to distributed computing. I am not sure if what I am doing is right.
I wanted to know if there is some malpractice in my code or things that are not following good practices. Moreover, when I launch the program, the different processes do not terminate. And I am not sure why and how to handle it.
Any feedback would be appreciated!
I find that when working with multiprocessing it is best to have a Queue for each running process. When you are ready to close the application you can send an exit message ( or poison pill ) to each queue and close each process cleanly.
When you launch a child process pass the parent queue and the child queue to the new process through inheritance.

Python sharing a deque between multiprocessing processes

I've been looking at the following questions for the pas hour without any luck:
Python sharing a dictionary between parallel processes
multiprocessing: sharing a large read-only object between processes?
multiprocessing in python - sharing large object (e.g. pandas dataframe) between multiple processes
I've written a very basic test file to illustrate what I'm trying to do:
from collections import deque
from multiprocessing import Process
import numpy as np
class TestClass:
def __init__(self):
self.mem = deque(maxlen=4)
self.process = Process(target=self.run)
def run(self):
while True:
self.mem.append(np.array([0, 1, 2, 3, 4]))
def print_values(x):
while True:
print(x)
test = TestClass()
process = Process(target=print_values(test.mem))
test.process.start()
process.start()
Currently this outputs the following :
deque([], maxlen=4)
How can I access the mem value's from the main code or the process that runs "print_values"?
Unfortunately multiprocessing.Manager() doesn't support deque but it does work with list, dict, Queue, Value and Array. A list is fairly close so I've used it in the example below..
from multiprocessing import Process, Manager, Lock
import numpy as np
class TestClass:
def __init__(self):
self.maxlen = 4
self.manager = Manager()
self.mem = self.manager.list()
self.lock = self.manager.Lock()
self.process = Process(target=self.run, args=(self.mem, self.lock))
def run(self, mem, lock):
while True:
array = np.random.randint(0, high=10, size=5)
with lock:
if len(mem) >= self.maxlen:
mem.pop(0)
mem.append(array)
def print_values(mem, lock):
while True:
with lock:
print mem
test = TestClass()
print_process = Process(target=print_values, args=(test.mem, test.lock))
test.process.start()
print_process.start()
test.process.join()
print_process.join()
You have to be a little careful using manager objects. You can use them a lot like the objects they reference but you can't do something like... mem = mem[-4:] to truncate the values because you're changing the referenced object.
As for coding style, I might move the Manager objects outside the class or move the print_values function inside it but for an example, this works. If you move things around, just note that you can't use self.mem directly in the run method. You need to pass it in when you start the process or the fork that python does in the background will create a new instance and it won't be shared.
Hopefully this works for your situation, if not, we can try to adapt it a bit.
So by combining the code provided by #bivouac0 and the comment #Marijn Pieters posted, I came up with the following solution:
from multiprocessing import Process, Manager, Queue
class testClass:
def __init__(self, maxlen=4):
self.mem = Queue(maxsize=maxlen)
self.process = Process(target=self.run)
def run(self):
i = 0
while True:
self.mem.empty()
while not self.mem.full():
self.mem.put(i)
i += 1
def print_values(queue):
while True:
values = queue.get()
print(values)
if __name__ == "__main__":
test = testClass()
print_process = Process(target=print_values, args=(test.mem,))
test.process.start()
print_process.start()
test.process.join()
print_process.join()

python, multthreading, safe to use pandas "to_csv" on common file?

I've got some code that works pretty nicely. It's a while-loop that goes through a list of dates, finds files on my HDD that corresponds to those dates, does some calculations with those files, and then outputs to a "results.csv" file using the command:
my_df.to_csv("results.csv",mode = 'a')
I'm wondering if it's safe to create a new thread for each date, and call the stuff in the while loop on several dates at a time?
MY CODE:
import datetime, time, os
import sys
import threading
import helperPY #a python file containing the logic I need
class myThread (threading.Thread):
def __init__(self, threadID, name, counter,sn, m_date):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.counter = counter
self.sn = sn
self.m_date = m_date
def run(self):
print "Starting " + self.name
m_runThis(sn, m_date)
print "Exiting " + self.name
def m_runThis(sn, m_date):
helperPY.helpFn(sn,m_date) #this is where the "my_df.to_csv()" is called
sn = 'XXXXXX'
today=datetime.datetime(2016,9,22) #
yesterday=datetime.datetime(2016,6,13)
threadList = []
i_threadlist=0
while(today>yesterday):
threadList.append(myThread(i_threadlist, str(today), i_threadlist,sn,today))
threadList[i_threadlist].start()
i_threadlist = i_threadlist +1
today = today-datetime.timedelta(1)
Writing the file in multiple threads is not safe. But you can create a lock to protect that one operation while letting the rest run in parallel. Your to_csv isn't shown, but you could create the lock
csv_output_lock = threading.Lock()
and pass it to helperPY.helpFn. When you get to the operation, do
with csv_output_lock:
my_df.to_csv("results.csv",mode = 'a')
You get parallelism for other operations - subject to the GIL of course - but the file access is protected.

Shared state in multiprocessing Processes

Please consider this code:
import time
from multiprocessing import Process
class Host(object):
def __init__(self):
self.id = None
def callback(self):
print "self.id = %s" % self.id
def bind(self, event_source):
event_source.callback = self.callback
class Event(object):
def __init__(self):
self.callback = None
def trigger(self):
self.callback()
h = Host()
h.id = "A"
e = Event()
h.bind(e)
e.trigger()
def delayed_trigger(f, delay):
time.sleep(delay)
f()
p = Process(target = delayed_trigger, args = (e.trigger, 3,))
p.start()
h.id = "B"
e.trigger()
This gives in output
self.id = A
self.id = B
self.id = A
However, I expected it to give
self.id = A
self.id = B
self.id = B
..because the h.id was already changed to "B" by the time the trigger method was called.
It seems that a copy of host instance is created at the moment when the separate Process is started, so the changes in the original host do not influence that copy.
In my project (more elaborate, of course), the host instance fields are altered time to time, and it is important that the events that are triggered by the code running in a separate process, have access to those changes.
multiprocessing runs stuff in separate processes. It is almost inconceivable that things are not copied as they're sent, as sharing stuff between processes requires shared memory or communication.
In fact, if you peruse the module, you can see the amount of effort it takes to actually share anything between the processes after the diverge, either through explicit communication, or through explicitly-shared objects (which are of a very limited subset of the language, and have to be managed by a Manager).

How to properly set up multiprocessing proxy objects for objects that already exist

I'm trying to share an existing object across multiple processing using the proxy methods described here. My multiprocessing idiom is the worker/queue setup, modeled after the 4th example here.
The code needs to do some calculations on data that are stored in rather large files on disk. I have a class that encapsulates all the I/O interactions, and once it has read a file from disk, it saves the data in memory for the next time a task needs to use the same data (which happens often).
I thought I had everything working from reading the examples linked to above. Here is a mock up of the code that just uses numpy random arrays to model the disk I/O:
import numpy
from multiprocessing import Process, Queue, current_process, Lock
from multiprocessing.managers import BaseManager
nfiles = 200
njobs = 1000
class BigFiles:
def __init__(self, nfiles):
# Start out with nothing read in.
self.data = [ None for i in range(nfiles) ]
# Use a lock to make sure only one process is reading from disk at a time.
self.lock = Lock()
def access(self, i):
# Get the data for a particular file
# In my real application, this function reads in files from disk.
# Here I mock it up with random numpy arrays.
if self.data[i] is None:
with self.lock:
self.data[i] = numpy.random.rand(1024,1024)
return self.data[i]
def summary(self):
return 'BigFiles: %d, %d Storing %d of %d files in memory'%(
id(self),id(self.data),
(len(self.data) - self.data.count(None)),
len(self.data) )
# I'm using a worker/queue setup for the multprocessing:
def worker(input, output):
proc = current_process().name
for job in iter(input.get, 'STOP'):
(big_files, i, ifile) = job
data = big_files.access(ifile)
# Do some calculations on the data
answer = numpy.var(data)
msg = '%s, job %d'%(proc, i)
msg += '\n Answer for file %d = %f'%(ifile, answer)
msg += '\n ' + big_files.summary()
output.put(msg)
# A class that returns an existing file when called.
# This is my attempted workaround for the fact that Manager.register needs a callable.
class ObjectGetter:
def __init__(self, obj):
self.obj = obj
def __call__(self):
return self.obj
def main():
# Prior to the place where I want to do the multprocessing,
# I already have a BigFiles object, which might have some data already read in.
# (Here I start it out empty.)
big_files = BigFiles(nfiles)
print 'Initial big_files.summary = ',big_files.summary()
# My attempt at making a proxy class to pass big_files to the workers
class BigFileManager(BaseManager):
pass
getter = ObjectGetter(big_files)
BigFileManager.register('big_files', callable = getter)
manager = BigFileManager()
manager.start()
# Set up the jobs:
task_queue = Queue()
for i in range(njobs):
ifile = numpy.random.randint(0, nfiles)
big_files_proxy = manager.big_files()
task_queue.put( (big_files_proxy, i, ifile) )
# Set up the workers
nproc = 12
done_queue = Queue()
process_list = []
for j in range(nproc):
p = Process(target=worker, args=(task_queue, done_queue))
p.start()
process_list.append(p)
task_queue.put('STOP')
# Log the results
for i in range(njobs):
msg = done_queue.get()
print msg
print 'Finished all jobs'
print 'big_files.summary = ',big_files.summary()
# Shut down the workers
for j in range(nproc):
process_list[j].join()
task_queue.close()
done_queue.close()
main()
This works in the sense that it calculates everything correctly, and it is caching the data that is read along the way. The only problem I'm having is that at the end, the big_files object doesn't have any of the files loaded. The final msg returned is:
Process-2, job 999. Answer for file 198 = 0.083406
BigFiles: 4303246400, 4314056248 Storing 198 of 200 files in memory
But then after it's all done, we have:
Finished all jobs
big_files.summary = BigFiles: 4303246400, 4314056248 Storing 0 of 200 files in memory
So my question is: What happened to all the stored data? It's claiming to be using the same self.data according to the id(self.data). But it's empty now.
I want the end state of big_files to have all the saved data that it accumulated along the way, since I actually have to repeat this entire process many times, so I don't want to have to redo all the (slow) I/O each time.
I'm assuming it must have something to do with my ObjectGetter class. The examples for using BaseManager only show how to make a new object that will be shared, not share an existing one. So am I doing something wrong with way I get the existing big_files object? Can anyone suggest a better way to do this step?
Thanks much!

Categories