I'm trying to evaluate an ANN. I get the accuracies if I use n_jobs = 1, however, when I use n_jobs = - 1 I get the following error.
BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
I have tried using other numbers but it only works if I use n_jobs = 1
This is the code I am running:
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1)
This is the error I am getting:
Traceback (most recent call last):
File "<ipython-input-12-cc51c2d2980a>", line 1, in <module>
accuracies = cross_val_score(estimator = classifier, X = X_train,
y = y_train, cv = 10, n_jobs = -1)
File "C:\Users\javie\Anaconda3\lib\site-
packages\sklearn\model_selection\_validation.py", line 402, in
cross_val_score
error_score=error_score)
File "C:\Users\javie\Anaconda3\lib\site-
packages\sklearn\model_selection\_validation.py", line 240, in
cross_validate
for train, test in cv.split(X, y, groups))
File "C:\Users\javie\Anaconda3\lib\site-
packages\sklearn\externals\joblib\parallel.py", line 930, in __call__
self.retrieve()
File "C:\Users\javie\Anaconda3\lib\site-
packages\sklearn\externals\joblib\parallel.py", line 833, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "C:\Users\javie\Anaconda3\lib\site-
packages\sklearn\externals\joblib\_parallel_backends.py", line 521, in
wrap_future_result
return future.result(timeout=timeout)
File "C:\Users\javie\Anaconda3\lib\concurrent\futures\_base.py", line
432, in result
return self.__get_result()
File "C:\Users\javie\Anaconda3\lib\concurrent\futures\_base.py", line
384, in __get_result
raise self._exception
BrokenProcessPool: A task has failed to un-serialize. Please ensure that
the arguments of the function are all picklable.`
Spyder should have analyzed each batch in parallel, but even when I use n_jobs = 1 it only analyzes 10 epochs.
This always happens when using multiprocessing in an iPython console in Spyder. A workaround is to run the script from the command line instead.
Just posting this for others, in case it's helpful. I Ran into the same issue today running a GridSearchCV on a Dask array / cluster. Sklearn v.0.24
Solved it by using the joblib context manager as described here: https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism
I get this error as well, but only on Windows. I am using joblib to run a function (call it func_x) in parallel. That function is imported from a module, let's call it module_a.
module_a also uses a function (call it func_y) from another module, module_b, which it imports using the syntax import module_b.
I found that I can avoid the BrokenProcessPool error if I edit module_a and change the import line to from module_b import func_y.
I also had to remove the if __name__ == '__main__:' from the main script which was importing module_a.
I think this subtle difference in how modules are imported to the namespace determines whether that module can then be pickled by joblib for parallel processing in Windows.
I hope this helps!
--
A minimal reproducible example is below:
Original main.py
from joblib import Parallel, delayed
import module_a
if __name__ == '__main__':
Parallel(n_jobs=4, verbose=3)(delayed(module_a.func_x)(i) for i in range(50))
Original module_a.py (fails on Windows with BrokenProcessPool error; kernel restart required)
import module_b
def func_x(i):
j = i ** 3
k = module_b.func_y(j)
return k
Edited main.py
from joblib import Parallel, delayed
import module_a
Parallel(n_jobs=4, verbose=3)(delayed(module_a.func_x)(i) for i in range(50))
Edited module_a.py (succeeds on Windows)
from module_b import func_y # changed
def func_x(i):
j = i ** 3
k = func_y(j) # changed
return k
module_b.py
def func_y(m):
k = j ** 3
return k
If you use Spyder IDE simply switch to external terminal in settings (run>execute in an external system terminal).
Got a similar error when I was using RandomizedSearchCV with all possible cores and threads. Realized that I am getting this error only when the number of combinations I was sampling from the parameter space is a less number like 5. When I selected a higher n_iter value, the issue was fixed. I believe it has something to with the fact that the number fits happening should be at least greater than the number cores and parallel compute units on your system
Hope this might help someone later.
Related
Im trying to run a sequence of more than one ParallelRunStep in an AzureML pipeline. To do so, I create a step with the following helper:
def create_step(name, script, inp, inp_ds):
out = pip_core.PipelineData(name=f"{name}_out", datastore=dstore, is_directory=True)
out_ds = out.as_dataset()
out_ds_named = out_ds.as_named_input(f"{name}_out")
config = cont_steps.ParallelRunConfig(
source_directory="src",
entry_script=script,
mini_batch_size="1",
error_threshold=0,
output_action="summary_only",
compute_target=compute_target,
environment=component_env,
node_count=2,
logging_level="DEBUG"
)
step = cont_steps.ParallelRunStep(
name=name,
parallel_run_config=config,
inputs=[inp_ds],
output=out,
arguments=[],
allow_reuse=False,
)
return step, out, out_ds_named
As an example I create two steps like this
step1, out1, out1_ds_named = create_step("step1", "demo_s1.py", input_ds, named_input_ds)
step2, out2, out2_ds_named = create_step("step2", "demo_s2.py", out1, out1_ds_named)
Creating an experiment and submitting it to an existing workspace and Azure ML compute cluster works. Also the first step step1 uses the input_ds runs its script demo_s1.py (which produces its output files, and finishes successfully.
However the second step step2 never get started.
And there is a final exception
The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.16968441009521484 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 394
Traceback (most recent call last):
File "driver/amlbi_main.py", line 52, in <module>
main()
File "driver/amlbi_main.py", line 44, in main
JobStarter().start_job()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/job_starter.py", line 48, in start_job
job.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/job.py", line 70, in start
master.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 174, in start
self._start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 149, in _start
self.wait_for_input_init()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 124, in wait_for_input_init
raise exc
exception.FirstTaskCreationTimeout: Unable to create any task within 600 seconds.
Load the datasource and read the first row locally to see how long it will take.
Set the advanced argument '--first_task_creation_timeout' to a larger value in arguments in ParallelRunStep.
I have the impression, that the second step is waiting for some data. However the first step creates the supplied output directory and also a file.
import argparse
import os
def init():
pass
def run(parallel_input):
print(f"*** Running {os.path.basename(__file__)} with input {parallel_input}")
parser = argparse.ArgumentParser(description="Data Preparation")
parser.add_argument('--output', type=str, required=True)
args, unknown_args = parser.parse_known_args()
out_path = os.path.join(args.output, "1.data")
os.makedirs(args.output, exist_ok=True)
open(out_path, "a").close()
return [out_path]
I have no idea how to debug further. Has anybody an idea?
You can check this notebook for parallel run and make sure that you are using the same packages.
https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb
I've been trying to use a custom openai gym environment for fixed wing uav from https://github.com/eivindeb/fixed-wing-gym by testing it with the openai stable-baselines algorithms but I have been running into issues for several days now. My baseline is the CartPole example Multiprocessing: Unleashing the Power of Vectorized Environments from https://stable-baselines.readthedocs.io/en/master/guide/examples.html#multiprocessing-unleashing-the-power-of-vectorized-environments since I would need to supply arguments and I am trying to use multiprocessing which I believe this example is all I need.
I have modified the baseline example as follows:
import gym
import numpy as np
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines.common import set_global_seeds
from stable_baselines import ACKTR, PPO2
from gym_fixed_wing.fixed_wing import FixedWingAircraft
def make_env(env_id, rank, seed=0):
"""
Utility function for multiprocessed env.
:param env_id: (str) the environment ID
:param num_env: (int) the number of environments you wish to have in subprocesses
:param seed: (int) the inital seed for RNG
:param rank: (int) index of the subprocess
"""
def _init():
env = FixedWingAircraft("fixed_wing_config.json")
#env = gym.make(env_id)
env.seed(seed + rank)
return env
set_global_seeds(seed)
return _init
if __name__ == '__main__':
env_id = "fixed_wing"
#env_id = "CartPole-v1"
num_cpu = 4 # Number of processes to use
# Create the vectorized environment
env = SubprocVecEnv([lambda: FixedWingAircraft for i in range(num_cpu)])
#env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
obs = env.reset()
for _ in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
and the error I keep getting is the following:
Traceback (most recent call last):
File "/home/bonie/PycharmProjects/deepRL_fixedwing/fixed-wing-gym/gym_fixed_wing/ACKTR_fixedwing.py", line 38, in <module>
model = PPO2(MlpPolicy, env, verbose=1)
File "/home/bonie/PycharmProjects/deepRL_fixedwing/stable-baselines/stable_baselines/ppo2/ppo2.py", line 104, in __init__
self.setup_model()
File "/home/bonie/PycharmProjects/deepRL_fixedwing/stable-baselines/stable_baselines/ppo2/ppo2.py", line 134, in setup_model
n_batch_step, reuse=False, **self.policy_kwargs)
File "/home/bonie/PycharmProjects/deepRL_fixedwing/stable-baselines/stable_baselines/common/policies.py", line 660, in __init__
feature_extraction="mlp", **_kwargs)
File "/home/bonie/PycharmProjects/deepRL_fixedwing/stable-baselines/stable_baselines/common/policies.py", line 540, in __init__
scale=(feature_extraction == "cnn"))
File "/home/bonie/PycharmProjects/deepRL_fixedwing/stable-baselines/stable_baselines/common/policies.py", line 221, in __init__
scale=scale)
File "/home/bonie/PycharmProjects/deepRL_fixedwing/stable-baselines/stable_baselines/common/policies.py", line 117, in __init__
self._obs_ph, self._processed_obs = observation_input(ob_space, n_batch, scale=scale)
File "/home/bonie/PycharmProjects/deepRL_fixedwing/stable-baselines/stable_baselines/common/input.py", line 51, in observation_input
type(ob_space).__name__))
NotImplementedError: Error: the model does not support input space of type NoneType
I am not sure what to really input as the env_id and for the def make_env(env_id, rank, seed=0) function. I am also thinking that the VecEnv function for parallel processes is not properly setup.
I am coding with Python v3.6 using PyCharm IDE in Ubuntu 18.04.
Any suggestions would really help at this point!
Thank you.
You created a custom environment alright, but you didn't register it with the openai gym interface. That's what the env_id refers to. All environments in gym can be set up by calling their registered name.
So basically what you need to do is follow the set up instructions here and create the appropriate __init__.py and setup.py scripts, and follow the same file structure.
At the end locally install your package using pip install -e . from within your environment directory.
The python pint module implements physical quantities. I would like to use it together with multiprocessing. However, I don't know how to handle creating a UnitRegistry in the new process. If I do the intuitive:
from multiprocessing import Process
from pint import UnitRegistry, set_application_registry
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
def f(one, two):
print(one / two)
if __name__ == '__main__':
p = Process(target=f, args=(Q(50, 'ms'), Q(50, 'ns')))
p.start()
p.join()
Then I get an the following exception:
Traceback (most recent call last):
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\pmaunz\PyCharmProjects\IonControl34\tests\pintmultiprocessing.py", line 12, in f
print(one / two)
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 738, in __truediv__
return self._mul_div(other, operator.truediv)
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 675, in _mul_div
offset_units_self = self._get_non_multiplicative_units()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 1312, in _get_non_multiplicative_units
offset_units = [unit for unit in self._units.keys()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 1313, in <listcomp>
if not self._REGISTRY._units[unit].is_multiplicative]
KeyError: 'millisecond'
Which I assume originates from the lack of initializing the UnitRegistry on the child process before unpickling the arguments. (Initializing the UnitRegistry in the function f does not work, as the variables have already been unpickled).
How would I go about sending a pint Quantity to a child process?
Edit after Tim Peter's answer:
The problem is not tied to multiprocessing. Simply pickling quantities
from pint import UnitRegistry, set_application_registry
import pickle
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
with open("pint.pkl", 'wb') as f:
pickle.dump(Q(50, 'ms'), f)
pickle.dump(Q(50, 'ns'), f)
and then unpickling in a new script leads to the same problem:
from pint import UnitRegistry, set_application_registry
import pickle
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
with open("pint.pkl", 'rb') as f:
t1 = pickle.load(f)
t2 = pickle.load(f)
print(t1 / t2)
results in the same exception. As Tim points out, it is sufficient to add a line Q(50, 'ns'); Q(50, 'ms') before unpickling. When digging into the source code for pint, upon creation of a quantity with unit ms this unit is added to an internal registry. Pickling uses a UnitContainer instance to save the units. When creating a Quantity via unpickling the unit is not added to the registry.
A simple fix (in pint source code) is to change the function Quantity.__reduce__ to return a string.
diff --git a/pint/quantity.py b/pint/quantity.py
index 3f30a25..695866a 100644
--- a/pint/quantity.py
+++ b/pint/quantity.py
## -57,7 +57,7 ## class _Quantity(SharedRegistryObject):
def __reduce__(self):
from . import _build_quantity
- return _build_quantity, (self.magnitude, self._units)
+ return _build_quantity, (self.magnitude, str(self._units))
def __new__(cls, value, units=None):
if units is None:
I have opened an issue on pint's github site.
I never used pint before, but this looked interesting ;-) First thing I noted is that I have no problem if I stick to units explicitly listed by this line:
print(dir(ureg.sys.mks))
For example, "hour" and "second" are both in the output of that, and your program runs fine if the Process line is changed to:
p = Process(target=f, args=(Q(50, 'hour'), Q(50, 'second')))
You're on Windows, so multiprocessing is using the "spawn" method: the entire program is imported fresh by the worker process, so in particular the:
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
lines were executed in the worker process too. So the unit registry is initialized in the worker, but it's not the same (identical) registry used in the main program - no memory is shared between processes.
To get much deeper we really need an expert in how pint is implemented. My guess is that for units "made up" (not in the output produced by the dir() line above) by parsing strings, new stuff is added to the registry at some level, which is needed later to reconstruct the values. "ns" and "ms" are of this nature: they are not in the dir() output.
Your program works fine as-is if I add a line like this immediately after your Q=ureg.Quantity line:
Q(1, 'ms'); Q(1, 'ns')
That was a shot in the dark (an "educated guess") that worked: it just forced the worker process to parse the same "made up" units used in the main process, to try to force its unit registry into a similar state.
I hope there's a cleaner way to get it to work, but can't help more. I'd ask the pint authors about it.
I run the following file (called test_func.py) with py.test:
import findspark
findspark.init()
from pyspark.context import SparkContext
def filtering(data):
return data.map(lambda p: modif(p)).count()
def modif(row):
row.split(",")
class Test(object):
sc = SparkContext('local[1]')
def test_filtering(self):
data = self.sc.parallelize(['1','2', ''])
assert filtering(data) == 2
And, because the modif function is used inside the map transform, it fails with the following error:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
ImportError: No module named clustering.test_func
pyspark does not manage to find the modif function. Note that the file test_func.py is in the directory clustering and I run py.test from inside clustering directory.
What astonishes me is that if I use modif function outside a map, it works fine. For instance, if I do: modif(data.first())
Any idea why I get such importErrors and how I could fix it?
EDIT
I had already tested what has been proposed Avihoo Mamka's answer, i.e. to add test_func.py to the files copied to all workers. However, it had no effect. It is no surprise for me because I think the main file, from where the spark application is created, is always sent to all workers.
I think it can come from the fact that pyspark is looking for clustering.test_func and not test_func.
The key here is the Traceback you got.
PySpark is telling you that the worker process doesn't have access to clustering.test_func.py. When you initialize the SparkContext you can pass a list of files that should be copied to the worker:
sc = SparkContext("local[1]", "App Name", pyFiles=['MyFile.py', 'lib.zip', 'app.egg'])
More information: https://spark.apache.org/docs/1.5.2/programming-guide.html
All I need to do is, train two regression models (using scikit-learn) on the same data at the same time, using different cores. I've tried to figured out by myself using Process without success.
gb1 = GradientBoostingRegressor(n_estimators=10)
gb2 = GradientBoostingRegressor(n_estimators=100)
def train_model(model, data, target):
model.fit(data, target)
live_data # Pandas DataFrame object
target # Numpy array object
p1 = Process(target=train_model, args=(gb1, live_data, target)) # same data
p2 = Process(target=train_model, args=(gb2, live_data, target)) # same data
p1.start()
p2.start()
If I run the code above I get the following error while trying to start the p1 process.
Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
p1.start()
File "C:\Python27\lib\multiprocessing\process.py", line 130, in start
self._popen = Popen(self)
File "C:\Python27\lib\multiprocessing\forking.py", line 274, in __init__
to_child.close()
IOError: [Errno 22] Invalid argument
I'm running all this as a script (in IDLE) on Windows. Any suggestions on how should I proceed?
Ok.. after hours spent in try to get this working, I'll post my solution.
First thing. If you're on Windows and you're using the interactive intepreter you need to encapsualte all your code under the 'main' condition, at exeption of function definition and imports. This because when a new process will be spawned will go on loop.
My solution below:
from sklearn.ensemble import GradientBoostingRegressor
from multiprocessing import Pool
from itertools import repeat
def train_model(params):
model, data, target = params
# since Pool args accept once argument, we need to pass only one
# and then unroll it as above
model.fit(data, target)
return model
if __name__ == '__main__':
gb1 = GradientBoostingRegressor(n_estimators=10)
gb2 = GradientBoostingRegressor(n_estimators=100)
live_data # Pandas DataFrame object
target # Numpy array object
po = Pool(2) # 2 is numbers of process we want to spawn
gb, gb2 = po.map_async(train_model,
zip([gb1,gb2], repeat(data), repeat(target))
# this will zip in one iterable object
).get()
# get will start the processes and execute them
po.terminate()
# kill the spawned processes