I have a folder containing 497 pandas' dataframes stored as .parquet files. The folder total dimension is 7.6GB.
I'm trying to develop a simple trading system. So I create 2 different classes, the main one is the Portfolio one, this class then creates an Asset object for every single dataframe in the data folder.
import os
import pandas as pd
from dask.delayed import delayed
class Asset(file):
def __init__:
self.data_path = 'path\\to\\data\\folder\\'
self.data = pd.read_parquet(self.data_path + file, engine='auto')
class Portfolio:
def __init__:
self.data_path = 'path\\to\\data\\folder\\'
self.files_list = [file for file in os.listdir(self.data_path) if file.endswith('.parquet')]
self.assets_list = []
self.results = None
self.shared_data = '???'
def assets_loading(self):
for file in self.files_list:
tmp = Asset(file)
self.assets_list.append(tmp)
def dask_delayed(self):
for asset in self.assets_list:
backtest = delayed(self.model)(asset)
def dask_compute(self):
self.results = delayed(dask_delayed)
self.results.compute()
def model(self, asset):
# do shet
if __name__ == '__main__':
portfolio = Portfolio()
portfolio.dask_compute()
I'm doing something wrong cause it looks like the results are not processed. If I try to check portfolio.results the console prints:
Out[5]: Delayed('NoneType-7512ffcc-3b10-445f-928a-f01c01bae29c')
So here are my questions:
Can you explain me what's wrong?
When I run the function assets_loading() I'm basically loading the entire data folder in memory for faster processing speed, but it saturates my RAM (16GB available). I didn't thought that a 7.6GB folder could saturates 16GB RAM, that's why I want to use Dask. Any solution compatible with my script work flow?
There's is another problem and probably the bigger one. With Dask I'm trying to parallelize the model function over multiple assets at the same time, but I need a shared memory (self.shared_data in the script) to store some variables value that resides inside each Dask process to the Portfolio object (for example, the single asset's year performance). Can you explain me how can I share data between Dask delayed processes and how to store this data in a Portfolio's variable?
Thanks a lot.
There are a few things wrong with the line self.results = delayed(dask_delayed):
Here you are creating a delayed function, not a delayed result; you need to call the delayed function
dask_delayed is not defined here, you probably mean self.dask_delayed
the method dask_delayed does not return anything
you call .compute() (which doesn't exist for a delayed function, only a delayed result), but don't store the output - computing doesn't happen in-place, as you seem to assume.
You probably wanted
self.result = delayed(self.dask_delayed)().compute()
Now you need to fix dask_delayed(), so that it return something. It should not be calling more delayed functions, since it itself is already to be delayed.
Finally, for filling up memory with pd.read_parquet, it does not surprise me that the in-memory version of the data is bigger, compression/encoding is one of the aims of the parquet format. You could try using dask.dataframe.read_parquet, which is lazy/on-demand.
Related
I have a directory of Pickled lists which I would like to load sequentially, use as part of an operation, and then discard. The files are around 0.75 - 2GB each when pickled and I can load a number in memory at any one time, although nowhere near all of them. Each pickled file represents one day of data.
Currently, the unpickling process consumes a substantial proportion of the runtime of the program. My proposed solution is to load the first file and, whilst the operation is running on this file, asynchronously load the next file in the list.
I have thought of two ways I could do this: 1) Threading and 2) Asyncio. I have tried both of these but neither has seemed to work. Below is my (attempted) implementation of a Threading-based solution.
import os
import threading
import pickle
class DataSource:
def __init__(self, folder):
self.folder = folder
self.next_file = None
def get(self):
if self.next_file is None:
self.load_file()
data = self.next_file
io_thread = threading.Thread(target=self.load_file, daemon=True)
io_thread.start()
return data
def get_next_file(self):
for filename in sorted(os.listdir(self.folder)):
yield self.folder + filename
def load_file(self):
self.next_file = pickle.load(open(next(self.get_next_file()), "rb"))
The main program will call DataSource().get() to retrieve each file. The first time it is loaded, load_file() will load the file into next_file where it will be stored. Then, the thread io_thread should load each successive file into next_file to be returned via get() as needed.
The thread that is launched does appear to do some work (it consumes a vast amount of RAM, ~60GB) however it does not appear to update next_file.
Could someone suggest why this doesn't work? And, additionally, if there is a better way to achieve this result?
Thanks
DataSource().get() seems to be your first problem: that means you create a new instance of the DataSource class always, and only ever get to load the first file, because you never call the same DataSource object instance again so that you'd proceed to the next file. Maybe you mean to do along the lines of:
datasource = DataSource()
while datasource.not_done():
datasource.get()
It would be useful to share the full code, and preferrably on repl.it or somewhere where it can be executed.
Also, if you want better performance, I might be worthwhile to look into the multiprocessing module, as Python blocks some operations with the global interpreter lock (GIL) so that only one thread runs at a time, even when you have multiple CPU cores. That might not be a problem though in your case as reading from disk is probably the bottleneck, I'd guess Python releases the lock while executing underlying native code to read from filesystem.
I'm also curious about how you could use asyncio for pickles .. I guess you read the pickle to mem from a file first, and then unpickle when it's done, while doing other processing during the loading. That seems like it could work nicely.
Finally, I'd add debug prints to see what's going on.
Update: Next problem seems to be that you are using the get_next_file generator wrongly. There you create a new generator each time with self.get_next_file(), so you only ever load the first file. You should only create the generator once and then call next() on it. Maybe this helps to understand, is also on replit:
def get_next_file():
for filename in ['a', 'b', 'c']:
yield filename
for n in get_next_file():
print(n)
print("---")
print(next(get_next_file()))
print(next(get_next_file()))
print(next(get_next_file()))
print("---")
gen = get_next_file()
print(gen)
print(next(gen))
print(next(gen))
print(next(gen))
Output:
a
b
c
---
a
a
a
---
<generator object get_next_file at 0x7ff4757f6cf0>
a
b
c
https://repl.it/#ToniAlatalo/PythonYieldNext#main.py
Again, debug prints would help you see what's going on, what file you are loading when etc.
this is my first question here and I hope that I'm not opening up a question very similar to an already existing one. If that's the case, please excuse me!
So, the situation I'm having a bit of trouble with, is the following:
I would like to run independent python scripts in parallel, which can access the same python objects, in my case a Pandas Dataframe. My idea is that one script is basically constantly running and subscribes to a stream of data (here data that is pushed via a websocket), which it then appends to a shared Dataframe. The second script should be able to be started independent of the first one and still access the Dataframe, which is constantly updated by the first script. In the second script I want to execute different kinds of analysis in predefined time intervals or do other relatively time intesive operations on the live data.
I've already tried to run all operations from within one script, but I kept having disconnects from the websocket. In the future there are also multiple scripts supposed to access the shared data in real time.
Instead of saving a csv file or pickle after every update in script 1, I would rather have a solution where both scripts basically share the same memory. I also only need one of the scripts to write and append to the Dataframe, the other only needs to read from it. The multiprocessing module seems to have some interesting features, that might help but I couldn't realy make any sense of it so far. I also looked at global variables but that also doesn't seem to be the right thing to use in this case.
What I imagine is something like this (I know, that the code is completely wrong and this is just for illustrative purposes):
The first script should keep assigning new data from the datastream to the dataframe and share this object.
from share_data import share
shared_df = pd.DataFrame()
for data from datastream:
shared_df.append(data)
share(shared_df)
The second script should then be able to do the following:
from share_data import get
df = get(shared_df)
Is this at all possible or do you have any ideas on how the accomplish this in a simple way? Or do you have any suggestions which packages might be good to use?
Best regards,
Ole
You already have quite the right sense of what you can do to use your data.
The best solution depends on your actual needs,
so I will try to cover the possibilities with a working example.
What you want
If I understand your need completely, you want to
continuously update a DataFrame (from a websocket)
while doing some computations on the same DataFrame
keeping the DataFrame up to date on the computation workers,
one computation is CPU intensive
another is not.
What you need
As you said, you will need a way to run different threads or processes in order to keep the computation running.
How about Threads
The easiest way to achieve what you want would be to use the threading library.
Since threads can share memory, and you only have one worker actually updating the DataFrame, it is quite easy to propose a way to manage the data up to date:
import time
from dataclasses import dataclass
import pandas
from threading import Thread
#dataclass
class DataFrameHolder:
"""This dataclass holds a reference to the current DF in memory.
This is necessary if you do operations without in-place modification of
the DataFrame, since you will need replace the whole object.
"""
dataframe: pandas.DataFrame = pandas.DataFrame(columns=['A', 'B'])
def update(self, data):
self.dataframe = self.dataframe.append(data, ignore_index=True)
class StreamLoader:
"""This class is our worker communicating with the websocket"""
def __init__(self, df_holder: DataFrameHolder) -> None:
super().__init__()
self.df_holder = df_holder
def update_df(self):
# read from websocket and update your DF.
data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
}
self.df_holder.update(data)
def run(self):
# limit loop for the showcase
for _ in range(5):
self.update_df()
print("[1] Updated DF %s" % str(self.df_holder.dataframe))
time.sleep(3)
class LightComputation:
"""This class is a random computation worker"""
def __init__(self, df_holder: DataFrameHolder) -> None:
super().__init__()
self.df_holder = df_holder
def compute(self):
print("[2] Current DF %s" % str(self.df_holder.dataframe))
def run(self):
# limit loop for the showcase
for _ in range(5):
self.compute()
time.sleep(5)
def main():
# We create a DataFrameHolder to keep our DataFrame available.
df_holder = DataFrameHolder()
# We create and start our update worker
stream = StreamLoader(df_holder)
stream_process = Thread(target=stream.run)
stream_process.start()
# We create and start our computation worker
compute = LightComputation(df_holder)
compute_process = Thread(target=compute.run)
compute_process.start()
# We join our Threads, i.e. we wait for them to finish before continuing
stream_process.join()
compute_process.join()
if __name__ == "__main__":
main()
Note that we use a class to hold reference of the current DataFrame because some operations like append are not necessarily inplace,
thus, if we directly sent the reference to the DataFrame object, the modification would be lost on the worker.
Here the DataFrameHolder object will keep the same location in memory, thus the worker can still access the updated DataFrame.
Processes may be more powerful
Now if you need more computation power, processes may be more useful since they enable you to isolate your worker on a different core.
To start a Process instead of a Thread in python, you can use the multiprocessing library.
The API of both objects are the same and you will only have to change the constructors as follow
from threading import Thread
# I create a thread
compute_process = Thread(target=compute.run)
from multiprocessing import Process
# I create a process that I can use the same way
compute_process = Process(target=compute.run)
Now if you tried to change the values in the above script, you will see that the DataFrame is not updating correctly.
For this you will need more work since the two processes don't share memory, and you have multiple ways of communicating between them (https://en.wikipedia.org/wiki/Inter-process_communication)
The reference of #SimonCrane is quite interesting on the matters and showcases the use of a shared-memory between two processes using multiprocessing.manager.
Here is a version with the worker using a separate process with a shared memory:
import logging
import multiprocessing
import time
from dataclasses import dataclass
from multiprocessing import Process
from multiprocessing.managers import BaseManager
from threading import Thread
import pandas
#dataclass
class DataFrameHolder:
"""This dataclass holds a reference to the current DF in memory.
This is necessary if you do operations without in-place modification of
the DataFrame, since you will need replace the whole object.
"""
dataframe: pandas.DataFrame = pandas.DataFrame(columns=['A', 'B'])
def update(self, data):
self.dataframe = self.dataframe.append(data, ignore_index=True)
def retrieve(self):
return self.dataframe
class DataFrameManager(BaseManager):
"""This dataclass handles shared DataFrameHolder.
See https://docs.python.org/3/library/multiprocessing.html#examples
"""
# You can also use a socket file '/tmp/shared_df'
MANAGER_ADDRESS = ('localhost', 6000)
MANAGER_AUTH = b"auth"
def __init__(self) -> None:
super().__init__(self.MANAGER_ADDRESS, self.MANAGER_AUTH)
self.dataframe: pandas.DataFrame = pandas.DataFrame(columns=['A', 'B'])
#classmethod
def register_dataframe(cls):
BaseManager.register("DataFrameHolder", DataFrameHolder)
class DFWorker:
"""Abstract class initializing a worker depending on a DataFrameHolder."""
def __init__(self, df_holder: DataFrameHolder) -> None:
super().__init__()
self.df_holder = df_holder
class StreamLoader(DFWorker):
"""This class is our worker communicating with the websocket"""
def update_df(self):
# read from websocket and update your DF.
data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
}
self.df_holder.update(data)
def run(self):
# limit loop for the showcase
for _ in range(4):
self.update_df()
print("[1] Updated DF\n%s" % str(self.df_holder.retrieve()))
time.sleep(3)
class LightComputation(DFWorker):
"""This class is a random computation worker"""
def compute(self):
print("[2] Current DF\n%s" % str(self.df_holder.retrieve()))
def run(self):
# limit loop for the showcase
for _ in range(4):
self.compute()
time.sleep(5)
def main():
logger = multiprocessing.log_to_stderr()
logger.setLevel(logging.INFO)
# Register our DataFrameHolder type in the DataFrameManager.
DataFrameManager.register_dataframe()
manager = DataFrameManager()
manager.start()
# We create a managed DataFrameHolder to keep our DataFrame available.
df_holder = manager.DataFrameHolder()
# We create and start our update worker
stream = StreamLoader(df_holder)
stream_process = Thread(target=stream.run)
stream_process.start()
# We create and start our computation worker
compute = LightComputation(df_holder)
compute_process = Process(target=compute.run)
compute_process.start()
# The managed dataframe is updated in every Thread/Process
time.sleep(5)
print("[0] Main process DF\n%s" % df_holder.retrieve())
# We join our Threads, i.e. we wait for them to finish before continuing
stream_process.join()
compute_process.join()
if __name__ == "__main__":
main()
As you can see, the differences between threading and processing are quite tiny.
With a few tweaks, you can start from there to connect to the same manager if you want to use a different file to handle your CPU intensive processing.
My mxnet script is likely limited by i/o of data loading into the GPU, and I am trying to speed this up by prefetching. The trouble is I can't figure out how to prefetch with a custom data iterator.
My first hypothesis/hope was that it would be enough to set the values of self.preprocess_threads and self.prefetch_buffer, as I had seen here for iterators such as mxnet.io.ImageRecordUInt8Iter. However, when I did this I saw no performance change relative to the script before I had set these variables, so clearly setting these did not work.
Then I noticed, the existence of a class mx.io.PrefetchingIter in addition to the base class for which I had implemented a child class mx.io.DataIter. I found this documentation, but I have not been able to find any examples, and I am a little confused about what needs to happen where/when. However, I am not clear on how to use this. For example. I see that in addition to next() it has an iter_next() method, which simply says "move to the next batch". What does this mean exactly? What does it mean to "move" to the next batch without producing it? I found the source code for this class, and based on a brief reading, it seems as though it takes multiple iterators and creates one thread per iterator. This likely would not work for my current design, as I really want multiple threads used to prefetch from the same iterator.
Here is what I am trying to do via a custom data iterator
I maintain a global multiprocessing.Queue on which I pop data as it becomes available
I produce that data by running (via multiprocessing) a command line script that executes a c++ binary which produces a numpy file
I open the numpy file and load its contents into memory, process them, and put the processed bits on the global multiprocessing.Queue
My custom iterator pulls off this queue and also kicks off more jobs to produce more data when the queue is empty.
Here is my code:
def launchJobForDate(date_str):
### this is a function that gets called via multiprocessing
### to produce new data by calling a c++ binary
### whenever data queue is empty so that we need to produce more data
try:
f = "testdata/data%s.npy"%date_str
if not os.path.isfile(f):
cmd = CMD % ( date_str, JSON_FILE, date_str, date_str, date_str)
while True:
try:
output = subprocess.check_output(cmd, shell=True)
break
except:
pass
while True:
try:
d = np.load(f)
break
except:
pass
data_queue.put((d, date_str))
except Exception as ex:
print("launchJobForDate: ERROR ", ex)
class ProduceDataIter(mx.io.DataIter):
#staticmethod
def processData(d, time_steps, num_inputs):
try:
...processes data...
return [z for z in zip(bigX, bigY, bigEvalY, dates)]
except Exception as ex:
print("processData: ERROR ", ex)
def __init__(self, num_mgrs, end_date_str):
## iter stuff
self.preprocess_threads = 4
self.prefetch_buffer = 1
## set up internal data to preserve state
## and make a list of dates for which to run binary
#property
def provide_data(self):
return [mx.io.DataDesc(name='seq_var',
shape=(args_batch_size * GPU_COUNT,
self.time_steps,
self.num_inputs),
layout='NTC')]
#property
def provide_label(self):
return [mx.io.DataDesc(name='bd_return',
shape=(args_batch_size * GPU_COUNT)),
mx.io.DataDesc(name='bd_return',
shape=(args_batch_size * GPU_COUNT, num_y_cols)),
mx.io.DataDesc(name='date',
shape=(args_batch_size * GPU_COUNT))]
def __next__(self):
try:
z = self.z.pop(0)
data = z[0:1]
label = z[1:]
return mx.io.DataBatch(data, label)
except Exception as ex:
### if self.z (a list) has no elements to pop we need
### to get more data off the queue, process it, and put it
### on self.x so it's ready for calls to __next__()
while True:
try:
d = data_queue.get_nowait()
processedData = ProduceDataIter.processData(d,
self.time_steps,
self.num_inputs)
self.z.extend(processedData)
counter_queue.put(counter_queue.get() - 1)
z = self.z.pop(0)
data = z[0:1]
label = z[1:]
return mx.io.DataBatch(data, label)
except queue.Empty:
...this is where new jobs to produce new data and put them
...on the queue would happen if nothing is left on the queue
I have then tried making one of these iterators as well as a prefetch iterator like so:
mgr = ProcessMgr(2, end_date_str)
mgrOuter = mx.io.PrefetchingIter([mgr])
The problem is that mgrOuter immediately throws a StopIteration as soon as __next__() is called the first time, and without invoking mgr.__next__() as I thought it might.
Finally, I also noticed that gluon has a DataLoader object which seems like it might handle prefetching, however in this case it also seems to assume that the underlying data is from a Dataset which has a finite and unchanging layout (based on the fact that it is implemented in terms of getitem, which takes an index). So I have not pursued this option as it seem unpromising given the dynamic queue-like nature of the data I am generating as training input.
My questions are:
How do I need to modify my code above so that there will be prefetching for my custom iterator?
Where might I find an example or more detailed documentation of how mx.io.PrefetchingIter works?
Are there other strategies I should be aware of for getting more performance out of my GPUs via a custom iterator? Right now they are only operating at around 50% capacity, and upping (or lowering) the batch size doesn't change this. What other knobs might I be able to turn to increase GPU use efficiency?
Thanks for any feedback and advice.
As you already mentioned, gluon DataLoader is providing prefetching. In your custom DataIterator, you are using Numpy arrays as input. So you could do the following:
f = "testdata/data%s.npy"%date_str
data = np.load(f)
train = gluon.data.ArrayDataset(mx.nd.array(data))
train_iter = gluon.data.DataLoader(train, shuffle=True, num_workers=4, batch_size=batch_size, last_batch='rollover')
Since you are creating your data dynamically, you could try resetting the DataLoader in every epoch and load a new Numpy array.
If GPU utilization is still low, then try to increase the batch_size and the num_workers. Another problem could be also the size of your dataset. Resetting the DataLoader will impact the performance, so having a larger dataset will increase the time of an epoch and as such increase performance.
I built a little function that will gather some data using a 3rd party API. Call if def MyFunc(Symbol, Field) that will return some info based on the symbol given.
The idea was to fill a Pandas df with the returned value using something like:
df['MyNewField'] = df.apply(lamba x: MyFunc(x, 'FieldName'))
All this works BUT, each query takes around 100ms to run. This seems fast until you realize you may have 30,000 or more to do (3,000 Symbols with 10 fields each for starters).
I was wondering if there would be a way to run this concurrently as each request is independent? I am not looking for multi processor etc. libraries but instead a way to do multiple queries to the 3rd party at the same time to reduce the time taken to gather all the data. (Also, I suppose this will change the initial structure used to store all the received data - I do not mind not using Apply and my dataframe at first and instead save the data as it is received on a text or library type structure -).
NOTE: While I wish I could change MyFunc to request multiple symbols/fields at once this cannot be done for all cases (meaning some fields do not allow that and a single request is the only way to go). This is why I am looking at concurrent execution and not at changing MyFunc.
Thanks!
There are many libraries to parallelize pandas dataframe. However, I prefer native multi-processing pool to do the same. Also, I use tqdm along with it to know the progress.
import numpy as np
from multiprocessing import cpu_count, Pool
cores = 4 #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def partition(data, num_partitions):
partition_len = int(len(data)/num_partitions)
partitions = []
num_rows = 0
for i in range(num_partitions-1):
partition = data.iloc[i*partition_len:i*partition_len+partition_len]
num_rows = num_rows + partition_len
partitions.append(partition)
partitions.append(data.iloc[num_rows:len(data)])
return partitions
def parallelize(data, func):
data_split = partition(data, partitions)
pool = Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
df['MyNewField'] = parallelize(df['FieldName'], MyFunc)
EDIT: Updated with environment information (see first section)
Environment
I'm using Python 2.7
Ubuntu 16.04
Issue
I have an application which I've simplified into a three-stage process:
Gather data from multiple data sources (HTTP requests, system info, etc)
Compute metrics based on this data
Output these metrics in various formats
Each of these stages must complete before moving on to the next stage, however each stage consists of multiple sub-tasks that can be run in parallel (I can send off 3 HTTP requests and read system logs while waiting for them to return)
I've divided up the stages into modules and the sub-tasks into submodules, so my project hierarchy looks like so:
+ datasources
|-- __init__.py
|-- data_one.py
|-- data_two.py
|-- data_three.py
+ metrics
|-- __init__.py
|-- metric_one.py
|-- metric_two.py
+ outputs
|-- output_one.py
|-- output_two.py
- app.py
app.py looks roughly like so (pseudo-code for brevity):
import datasources
import metrics
import outputs
for datasource in dir(datasources):
datasource.refresh()
for metric in dir(metrics):
metric.calculate()
for output in dir(outputs):
output.dump()
(There's additional code wrapping the dir call to ignore system modules, there's exception handling, etc -- but this is the gist of it)
Each datasource sub-module looks roughly like so:
data = []
def refresh():
# Populate the "data" member somehow
data = [1, 2, 3]
return
Each metric sub-module looks roughly like so:
import datasources.data_one as data_one
import datasources.data_two as data_two
data = []
def calculate():
# Use the datasources to compute the metric
data = [sum(x) for x in zip(data_one, data_two)]
return
In order to parallelize the first stage (datasources) I wrote something simple like the following:
def run_thread(datasource):
datasource.refresh()
threads = []
for datasource in dir(datasources):
thread = threading.Thread(target=run_thread, args=(datasource))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
This works, and afterwards I can compute any metric and the datasources.x.data attribute is populated
In order to parallelize the second stage (metrics) because it depends less on I/O and more on CPU, I felt like simple threading wouldn't actually speed things up and I would need the multiprocessing module in order to take advantage of multiple cores. I wrote the following:
def run_pool(calculate):
calculate()
pool = multiprocessing.Pool()
pool.map(run_pool, [m.calculate for m in dir(metrics)]
pool.close()
pool.join()
This code runs for a few seconds (so I think it's working?) but then when I try:
metrics.metric_one.data
it returns [], like the module was never run
Somehow by using the multiprocessing module it seems to be scoping the threads so that they no longer share the data attribute. How should I go about rewriting this so that I can compute each metric in parallel, taking advantage of multiple cores, but still have access to the data when it's done?
Updated again, per comments:
Since you're in 2.7, and you're dealing with modules instead of objects, you're having problems pickling what you need. The workaround is not pretty. It involves passing the name of each module to your operating function. I updated the partial section, and also updated to remove the with syntax.
A few things:
First, in general, it's better to multicore than thread. With threading, you always run a risk of dealing with the Global Interpreter Lock, which can be extremely inefficient. This becomes a non-issue if you use multicore.
Second, you've got the right concept, but you make it strange by having a global-to-the-module data member. Make your sources return the data you're interested in, and make your metrics (and outputs) take a list of data as input and output the resultant list.
This would turn your pseudocode into something like this:
app.py:
import datasources
import metrics
import outputs
pool = multiprocessing.Pool()
data_list = pool.map(lambda o: o.refresh, list(dir(datasources)))
pool.close()
pool.join()
pool = multiprocessing.Pool()
metrics_funcs = [(m, data_list) for m in dir(metrics)]
metrics_list = pool.map(lambda m: m[0].calculate(m[1]), metrics_funcs)
pool.close()
pool.join()
pool = multiprocessing.Pool()
output_funcs = [(o, data_list, metrics_list) for o in dir(outputs)]
output_list = pool.map(lambda o: o[0].dump(o[1], o[2]), output_funcs)
pool.close()
pool.join()
Once you do this, your data source would look like this:
def refresh():
# Populate the "data" member somehow
return [1, 2, 3]
And your metrics would look like this:
def calculate(data_list):
# Use the datasources to compute the metric
return [sum(x) for x in zip(data_list)]
And finally, your output could look like this:
def dump(data_list, metrics_list):
# do whatever; you now have all the information
Removing the data "global" and passing it makes each piece a lot cleaner (and a lot easier to test). This highlights making each piece completely independent. As you can see, all I'm doing is changing what's in the list that gets passed to map, and in this case, I'm injecting all the previous calculations by passing them as a tuple and unpacking them in the function. You don't have to use lambdas, of course. You can define each function separately, but there's really not much to define. However, if you do define each function, you could use partial functions to reduce the number of arguments you pass. I use that pattern a lot, and in your more complicated code, you may need to. Here's one example:
from functools import partial
do_dump(module_name, data_list, metrics_list):
globals()[module_name].dump(data_list, metrics_list)
invoke = partial(do_dump, data_list=data_list, metrics_list=metrics_list)
with multiprocessing.Pool() as pool:
output_list = pool.map(invoke, [o.__name__ for o in dir(outputs)])
pool.close()
pool.join()
Update, per comments:
When you use map, you're guaranteed that the order of your inputs matches the order of your outputs, i.e. data_list[i] is the output for running dir(datasources)[i].refresh(). Rather than importing the datasources modules into metrics, I would make this change to app.py:
data_list = ...
pool.close()
pool.join()
data_map = {name: data_list[i] for i, name in enumerate(dir(datasources))}
And then pass data_map into each metric. Then the metric gets the data that it wants by name, e.g.
d1 = data_map['data_one']
d2 = data_map['data_two']
return [sum(x) for x in zip([d1, d2])]
Process and Thread behave quite differently in python. If you want to use multiprocessing you will need to use a synchronized data type to pass information around.
For example you could use multiprocessing.Array, which can be shared between your processes.
For detail see the docs: https://docs.python.org/2/library/multiprocessing.html#sharing-state-between-processes