Second `ParallelRunStep` in pipeline times out at start - python

Im trying to run a sequence of more than one ParallelRunStep in an AzureML pipeline. To do so, I create a step with the following helper:
def create_step(name, script, inp, inp_ds):
out = pip_core.PipelineData(name=f"{name}_out", datastore=dstore, is_directory=True)
out_ds = out.as_dataset()
out_ds_named = out_ds.as_named_input(f"{name}_out")
config = cont_steps.ParallelRunConfig(
step = cont_steps.ParallelRunStep(
return step, out, out_ds_named
As an example I create two steps like this
step1, out1, out1_ds_named = create_step("step1", "", input_ds, named_input_ds)
step2, out2, out2_ds_named = create_step("step2", "", out1, out1_ds_named)
Creating an experiment and submitting it to an existing workspace and Azure ML compute cluster works. Also the first step step1 uses the input_ds runs its script (which produces its output files, and finishes successfully.
However the second step step2 never get started.
And there is a final exception
The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.16968441009521484 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 394
Traceback (most recent call last):
File "driver/", line 52, in <module>
File "driver/", line 44, in main
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/", line 48, in start_job
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/", line 70, in start
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/", line 174, in start
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/", line 149, in _start
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/", line 124, in wait_for_input_init
raise exc
exception.FirstTaskCreationTimeout: Unable to create any task within 600 seconds.
Load the datasource and read the first row locally to see how long it will take.
Set the advanced argument '--first_task_creation_timeout' to a larger value in arguments in ParallelRunStep.
I have the impression, that the second step is waiting for some data. However the first step creates the supplied output directory and also a file.
import argparse
import os
def init():
def run(parallel_input):
print(f"*** Running {os.path.basename(__file__)} with input {parallel_input}")
parser = argparse.ArgumentParser(description="Data Preparation")
parser.add_argument('--output', type=str, required=True)
args, unknown_args = parser.parse_known_args()
out_path = os.path.join(args.output, "")
os.makedirs(args.output, exist_ok=True)
open(out_path, "a").close()
return [out_path]
I have no idea how to debug further. Has anybody an idea?

You can check this notebook for parallel run and make sure that you are using the same packages.


Pbixrefresher for Power BI Desktop report refresh/publish error

I am trying to refresh power b.i. more frequently than current capability of gateway schedule refresh.
I found this:
Installed and verified I have all required packages installed to run.
Right now it works fine until the end - where after it refreshes a Save function seems to execute correctly - but the report does not save - and when it tries Publish function - a prompt is created asking if user would like to save and there is a timeout.
I have tried increasing the time-out argument and adding more wait time in the routine (along with a couple of other suggested ideas from the github issues thread).
Below is what cmd looks like along with the error - I also added the Main routine of the pbixrefresher file in case there is a different way to save (hotkeys) or something worth trying. I tried this both as my user and admin in CMD - but wasn't sure if it's possible a permissions setting could block the report from saving. Thank you for reading any help is greatly appreciated.
Starting Power BI
Waiting 15 sec
Identifying Power BI window
Waiting for refresh end (timeout in 100000 sec)
Traceback (most recent call last):
File "c:\python36\lib\site-packages\pywinauto\", line 258, in __resolve_control
File "c:\python36\lib\site-packages\pywinauto\", line 458, in wait_until_passes
raise err
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\python36\lib\", line 193, in _run_module_as_main
"main", mod_spec)
File "c:\python36\lib\", line 85, in run_code
exec(code, run_globals)
File "C:\Python36\Scripts\", line 9, in
File "c:\python36\lib\site-packages\pbixrefresher\", line 77, in main
publish_dialog.child_window(title = WORKSPACE, found_index=0).click_input()
File "c:\python36\lib\site-packages\pywinauto\", line 379, in getattribute
ctrls = self.__resolve_control(self.criteria)
File "c:\python36\lib\site-packages\pywinauto\", line 261, in __resolve_control
raise e.original_exception
File "c:\python36\lib\site-packages\pywinauto\", line 436, in wait_until_passes
func_val = func(*args, **kwargs)
File "c:\python36\lib\site-packages\pywinauto\", line 222, in __get_ctrl
ctrl = self.backend.generic_wrapper_class(findwindows.find_element(**ctrl_criteria))
File "c:\python36\lib\site-packages\pywinauto\", line 87, in find_element
raise ElementNotFoundError(kwargs)
pywinauto.findwindows.ElementNotFoundError: {'auto_id': 'KoPublishToGroupDialog', 'top_level_only': False, 'parent': <uia_element_info.UIAElementInfo - 'Simple - Power BI Desktop',, 8914246>, 'backend': 'uia'}
The main routine from pbixrefresher:
def main():
# Parse arguments from cmd
parser = argparse.ArgumentParser()
parser.add_argument("workbook", help = "Path to .pbix file")
parser.add_argument("--workspace", help = "name of online Power BI service work space to publish in", default = "My workspace")
parser.add_argument("--refresh-timeout", help = "refresh timeout", default = 30000, type = int)
parser.add_argument("--no-publish", dest='publish', help="don't publish, just save", default = True, action = 'store_false' )
parser.add_argument("--init-wait", help = "initial wait time on startup", default = 15, type = int)
args = parser.parse_args()
timings.after_clickinput_wait = 1
WORKBOOK = args.workbook
WORKSPACE = args.workspace
INIT_WAIT = args.init_wait
REFRESH_TIMEOUT = args.refresh_timeout
# Kill running PBI
PROCNAME = "PBIDesktop.exe"
for proc in psutil.process_iter():
# check whether the process name matches
# Start PBI and open the workbook
print("Starting Power BI")
os.system('start "" "' + WORKBOOK + '"')
print("Waiting ",INIT_WAIT,"sec")
# Connect pywinauto
print("Identifying Power BI window")
app = Application(backend = 'uia').connect(path = PROCNAME)
win = app.window(title_re = '.*Power BI Desktop')
win.wait("enabled", timeout = 300)
win.Save.wait("enabled", timeout = 300)
win.Save.wait("enabled", timeout = 300)
win.wait("enabled", timeout = 300)
# Refresh
print("Waiting for refresh end (timeout in ", REFRESH_TIMEOUT,"sec)")
win.wait("enabled", timeout = REFRESH_TIMEOUT)
# Save
type_keys("%1", win)
win.wait("enabled", timeout = REFRESH_TIMEOUT)
# Publish
if args.publish:
publish_dialog = win.child_window(auto_id = "KoPublishToGroupDialog")
publish_dialog.child_window(title = WORKSPACE).click_input()
win.Replace.wait('visible', timeout = 10)
except Exception:
if win.Replace.exists():
win["Got it"].wait('visible', timeout = REFRESH_TIMEOUT)
win["Got it"].click_input()
# Force close
for proc in psutil.process_iter():
if __name__ == '__main__':
except Exception as e:
Had the same issue, but using "win.type_keys("^S")" solved.
I think I finally figured out what was happening. My prior post about version compatibility did not solve the issue.
It's a very strange bug - pywinauto sometimes misses the correct button. I was able to reproduce this multiple times, although it didn't happen every time - it sometimes happened on win.Refresh.click_input(), sometimes on win.Publish.click_input() This was why the popup dialog 'KoPublishToGroupDialog' cannot be found after clicking Publish and the script fails. This also indicates that the Refresh wouldn't work consistently because the script doesn't check if the dialog opens - it just waits for a predetermined amount of time to allow the refresh to finish.
I implemented a check to see if the dialog window actually opened, e.g.
if not win.child_window(auto_id="KoPublishToGroupDialog").exists():
raise AttributeError("publish dialog failed to open")
along with a retry and maximum wait loop but this didn't fix the issue. What finally worked was much simpler:
The toolbar of the PowerBI Desktop application has two modes, expanded and small. The default setup is to use expanded mode - you can minimize the toolbar to use small icons with the small arrow in the top right of PowerBI Desktop (bottom right in the pic below).
This seems to take care of the pywinauto bug (or whatever actually causes it) where the Refresh or Publish buttons are missed when clicking them.
Hope this helps someone, it took way too long to figure this out.
Unfortunately this was NOT the full solution, what I didn't realize (and probably had a lot to do with missing the buttons) was the virtual machine's screen resolution I'm using to run this script was resized when the connection was closed. I'm using the tscon.exe solution to connect the desktop session to the console via a batch script so it isn't shut down after disconnecting. To keep the resolution I installed qres.exe (added to PATH) and added it to the batch script:
for /f "skip=1 tokens=3" %%s in ('query user %USERNAME%') do (
%windir%\System32\tscon.exe %%s /dest:console
timeout 5
qres /X 1920 /Y 1080 /C 32
I use this script to disconnect from the remote desktop session but keep the screen resolution. I also set up my RDP client to connect using this resolution.
At the moment I'm still not 100% sure what else might be necessary to get this to work consistently. For example after setting up the new fixed resolution I had to open PowerBI Desktop manually and set it to windowed (non-maximized) mode so it wouldn't completely miss the Refresh button (it was at least close before). My last test (staying connected) worked, disconnected from RDP it still seems to have issues though.

How to fix pickle.unpickling error caused by calls to subprocess.Popen in parallel script that uses mpi4py

Repeated serial calls to subprocess.Popen() in a script parallelized with mpi4py eventually cause what seems to be data corruption during communication, manifesting as a pickle.unpickling error of various types (I have seen the unpickling errors: EOF, invalid unicode character, invalid load key, unpickling stack underflow). It seems to only happen when the data being communicated is large, the number of serial calls to subprocess is large, or the number of mpi processes is large.
I can reproduce the error with python>=2.7, mpi4py>=3.0.1, and openmpi>=3.0.0. I would ultimately like to communicate python objects so I am using the lowercase mpi4py methods. Here is a minimum code which reproduces the error:
#!/usr/bin/env python
from mpi4py import MPI
from copy import deepcopy
import subprocess
nr_calcs = 4
tasks_per_calc = 44
data_size = 55000
# --------------------------------------------------------------------
def run_test(nr_calcs, tasks_per_calc, data_size):
# Init MPI
rank = comm.Get_rank()
comm_size = comm.Get_size()
# Run Moc Calcs
icalc = 0
while True:
if icalc > nr_calcs - 1: break
index = icalc
icalc += 1
# Init Moc Tasks
task_list = []
moc_task = data_size*"x"
if rank==0:
task_list = [deepcopy(moc_task) for i in range(tasks_per_calc)]
task_list = comm.bcast(task_list)
# Moc Run Tasks
itmp = rank
while True:
if itmp > len(task_list)-1: break
itmp += comm_size
proc = subprocess.Popen(["echo", "TEST CALL TO SUBPROCESS"],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
out,err = proc.communicate()
print("Rank {:3d} Finished Calc {:3d}".format(rank, index))
# --------------------------------------------------------------------
if __name__ == '__main__':
run_test(nr_calcs, tasks_per_calc, data_size)
Running this on one 44 core node with 44 mpi processes completes the first 3 "calculations" successfully, but on the final loop some processes raise:
Traceback (most recent call last):
File "./", line 54, in <module>
run_test(nr_calcs, tasks_per_calc, data_size)
File "./", line 39, in run_test
task_list = comm.bcast(task_list)
File "mpi4py/MPI/Comm.pyx", line 1257, in mpi4py.MPI.Comm.bcast
File "mpi4py/MPI/msgpickle.pxi", line 639, in mpi4py.MPI.PyMPI_bcast
File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
File "mpi4py/MPI/msgpickle.pxi", line 101, in mpi4py.MPI.Pickle.cloads
Sometimes the UnpicklingError has a descriptor, such as invalid load key "x", or EOF Error, invalid unicode character, or unpickling stack underflow.
Edit: It appears the problem disappears with openmpi<3.0.0, and using mvapich2, but it would still be good to understand what's going on.
I had the same problem. In my case, I got my code to work by installing mpi4py in a Python virtual environment, and by setting mpi4py.rc.recv_mprobe = False as suggested by Intel:
However, in the end I just switched to using the capital letter methods Recv and Send with NumPy arrays. They work fine with subprocess and they don't need any additional tricks.

Writing to file and running the files sometimes works, mostly only first one

This code produces WindowsError most times, rarely (like often first time it is run) not.
""" Running hexlified codes from codefiles module prepared previously """
import tempfile
import subprocess
import threading
import os
import codefiles
if __name__ == '__main__':
for ind, c in enumerate(codefiles.exes):
fn = tempfile.mktemp() + '.exe'
# for some reason hexlified code is sometimes odd length and one nibble
if len(c) & 1:
c += '0'
c = c.decode('hex')
with open(fn, 'wb') as f:
threading.Thread(target=lambda:subprocess.Popen("", executable=fn)).start()
""" One time works, one time WindowsError 32
132096 c:\docume~1\admin\locals~1\temp\tmpkhhxxo.exe
991232 c:\docume~1\admin\locals~1\temp\tmp9ow6zz.exe
>>> ================================ RESTART ================================
132096 c:\docume~1\admin\locals~1\temp\tmp3hb0cf.exe
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Python27\lib\", line 810, in __bootstrap_inner
File "C:\Python27\lib\", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "C:\Documents and Settings\Admin\My Documents\Google Drive\Python\Tools\runner.pyw", line 18, in <lambda>
threading.Thread(target=lambda:subprocess.Popen("", executable=fn)).start()
File "C:\Python27\lib\", line 710, in __init__
errread, errwrite)
File "C:\Python27\lib\", line 958, in _execute_child
WindowsError: [Error 32] The process cannot access the file because it is being used by another process
991232 c:\docume~1\admin\locals~1\temp\tmpnkfuon.exe
Hexlification is done with this script, and it sometimes seems to produce odd number of nibbles which seems also odd.
# bootstrapper for running hexlified executables,
# hexlification can be done by running this file directly or calling function make
# running can be done by run function in runner module, which imports the code,
# runs them from temporary files
import os
modulename = ''
def code_file_names(d='.', ext='.exe'):
return [n for n in os.listdir(d)
if n.lower().endswith(ext)]
def make():
codes = code_file_names(os.curdir)
with open(modulename, 'a') as f:
f.write('exes = (')
hex_codes = [open(n, 'rb').read().encode('hex') for n in codes]
assert all(len(c) & 1 == 0 for c in hex_codes)
print len(hex_codes),map(len, hex_codes)
hex_codes = [repr(c) for c in hex_codes]
if __name__ == '__main__':
import make_exe
# prepare hexlified exes for exes in directory if codefiles not prepared
if modulename not in os.listdir('.'):
# prepare script for py2_exe to execute the scripts by run from runner
make_exe.py2_exe('runner.pyw', 'tools/dvd.ico')
After Mr. Martelli's suggestion, I got the error disappear, but still not expected result.
I pass in new version of code the creation of exe file if it exists (saved file names in new creation routine). It launches both codes (two hexlified codes) when creating the files, but multiple copies of first code afterwards.
tempfile.mktemp(), besides being deprecated since many years because of its security problems, guarantees unique names only within a single run of a program, since, per , "The module uses a global variable that tell it how to construct a temporary name". So on the second run, as the very clear error message tells you, "The process cannot access the file because it is being used by another process" (specifically, the process started by the previous run, i.e you cannot re-write an .exe that's currently running in some process).
The fix is to make sure each run uses its own unique directory for temporary files. See mkdtemp at . How to eventually clean up those temporary directories is a different issue since that can't be done as long as any process is running an .exe file within such a directory -- you'll probably need a separate "clean-up script" that does what it can for the purpose, run periodically (e.g in Unix I'd use cron) and a repository (e.g a small sqlite DB, or even just a file) to record which of the temporary directories created in previous runs still exist and need to be cleared up (catching the exceptions seen when they can't get cleaned up yet, so as to retry in the future).

Train two models concurrently

All I need to do is, train two regression models (using scikit-learn) on the same data at the same time, using different cores. I've tried to figured out by myself using Process without success.
gb1 = GradientBoostingRegressor(n_estimators=10)
gb2 = GradientBoostingRegressor(n_estimators=100)
def train_model(model, data, target):, target)
live_data # Pandas DataFrame object
target # Numpy array object
p1 = Process(target=train_model, args=(gb1, live_data, target)) # same data
p2 = Process(target=train_model, args=(gb2, live_data, target)) # same data
If I run the code above I get the following error while trying to start the p1 process.
Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
File "C:\Python27\lib\multiprocessing\", line 130, in start
self._popen = Popen(self)
File "C:\Python27\lib\multiprocessing\", line 274, in __init__
IOError: [Errno 22] Invalid argument
I'm running all this as a script (in IDLE) on Windows. Any suggestions on how should I proceed?
Ok.. after hours spent in try to get this working, I'll post my solution.
First thing. If you're on Windows and you're using the interactive intepreter you need to encapsualte all your code under the 'main' condition, at exeption of function definition and imports. This because when a new process will be spawned will go on loop.
My solution below:
from sklearn.ensemble import GradientBoostingRegressor
from multiprocessing import Pool
from itertools import repeat
def train_model(params):
model, data, target = params
# since Pool args accept once argument, we need to pass only one
# and then unroll it as above, target)
return model
if __name__ == '__main__':
gb1 = GradientBoostingRegressor(n_estimators=10)
gb2 = GradientBoostingRegressor(n_estimators=100)
live_data # Pandas DataFrame object
target # Numpy array object
po = Pool(2) # 2 is numbers of process we want to spawn
gb, gb2 = po.map_async(train_model,
zip([gb1,gb2], repeat(data), repeat(target))
# this will zip in one iterable object
# get will start the processes and execute them
# kill the spawned processes

Sequentially running an external independent process using Tkinter and python

*I'm creating a batch simulation job chooser + scheduler using Tkinter (Portable PYscripter, python v2.7.3)
*This program will function as a front end, to a commercial solver program
*The program needs to allow the user to choose a bunch of files to simulate, sequentially, one after the other.
*It also needs to have the facility to modify (Add/delete) jobs from an existing/running job list.
*Each simulation will definitely run for several hours.
*The output of the simulation will be viewed on separate programs and I do not need any pipe to the output. The external viewer will be called from the GUI, when desired.
***I have a main GUI window, which allows the user to :
choose job files, submit jobs, view the submission log, stop running jobs(one by one)
The above works well.
*If I use subprocess.Popen("command") : all the simulation input files are launched at the same time. It MUST be sequential (due to license and memory limitations)
*If I use" ") or the wait() method, then the GUI hangs and there is no scope to stop/add/modify the job list. Even if the "job submit" command is on an independent window, both the parent windows hang untill the job completes.
*How do I launch the simulation jobs sequentially (like AND allow the main GUI window to function for the purpose of job list modification or stopping a job ?
The jobs are in a list, taken using "askopenfilenames" and then run using a For loop.
Relevant parts of the Code :
def file_chooser_default():
global flist1
flist1=askopenfilename(parent = root2, filetypes =[('.def', '*.def'),('All', '*.*'),('.res', '*.res')], title ="Select Simulation run files...", multiple = True)[1:-1].split('} {')
def ext_process():
while i < len(flist1):
p[i]='"%s" -def "%s"'%(cfx5solvepath,flist1[i])
while i < len(p):
root2 = Tk()
root2.title("NEW WINDOW")
frame21=Frame(root2, borderwidth=3, relief="solid").pack()
w21= Button(root2,fg="blue", text="Choose files to submit",command=file_chooser_default).pack()
w2a1=Button(root2,fg="white", text= 'Display chosen file names and order', command=lambda:print_var(flist1)).pack()
w2b1= Button (root2,fg="white", bg="red", text="S U B M I T", command=ext_process).pack()
Please let me know if you require anything else. Look forward to your help.
On incorporating the changes suggested by #Tim , the GUI is left free. Since there is a specific sub-program associated with the main solver program to stop the job, I am able to stop the job using the right command.
Once the currently running job is stopped, the next job on the list starts up, automatically, as I was hoping.
This is the code used for stopping the job :
def stop_select(): #Choose the currently running files which are to be stopped
global flist3
flist3=askdirectory().split('} {')
def sim_stop(): #STOP the chosen simulation
st='"%s" -directory "%s"'%(defcfx5stoppath,flist3[0]))
ret1=tkMessageBox.showinfo("INFO","Chosen simulation stopped successfully")
os.chdir("%s" %currentwd)
*Once the above jobs are Completed, using start_new_thread, the GUI doesn't respond. The GUI works while the jobs are running in the background. But the start_new_thread documentation says that the thread is supposed to exit silently when the function returns.
*Additionally, I have a HTML log file that is written into/updated as each job completes. When I use start_new_thread ,the log file content is visible only AFTER all the jobs complete. The contents, along with the time stamps are however correct. Without using start_new_thread, I was able to refresh the HTML file to get the updated submission log.
***On exiting the GUI program using the Task manager several times, I am suddenly unable to use the start_new_thread function !! I have tried reinstalling PYscripter and restarting the computer as well. I can't figure out anything sensible from the traceback, which is:
Traceback (most recent call last):
File "<string>", line 532, in write
File "C:\Portable Python\App\lib\site-packages\rpyc\core\", line 439, in _async_request
seq = self._send_request(handler, args)
File "C:\Portable Python\App\lib\site-packages\rpyc\core\", line 229, in _send_request
self._send(consts.MSG_REQUEST, seq, (handler, self._box(args)))
File "C:\Portable Python\App\lib\site-packages\rpyc\core\", line 244, in _box
if brine.dumpable(obj):
File "C:\Portable Python\App\lib\site-packages\rpyc\core\", line 369, in dumpable
return all(dumpable(item) for item in obj)
File "C:\Portable Python\App\lib\site-packages\rpyc\core\", line 369, in <genexpr>
return all(dumpable(item) for item in obj)
File "C:\Portable Python\App\lib\site-packages\rpyc\core\", line 369, in dumpable
return all(dumpable(item) for item in obj)
File "C:\Portable Python\App\lib\site-packages\rpyc\core\", line 369, in <genexpr>
return all(dumpable(item) for item in obj)
File "C:\Portable Python\App\Python_Working_folder\", line 138, in ext_process
File "C:\Portable Python\App\lib\", line 493, in call
return Popen(*popenargs, **kwargs).wait()
File "C:\Portable Python\App\lib\", line 679, in __init__
errread, errwrite)
File "C:\Portable Python\App\lib\", line 896, in _execute_child
WindowsError: [Error 2] The system cannot find the file specified
I'd suggest using a separate thread for the job launching. The simplest way would be to use the start_new_thread method from the thread module.
Change the submit button's command to command=lambda:thread.start_new_thread(ext_process, ())
You will probably want to disable the button when it's clicked and enable it when the launching is complete. This can be done inside ext_process.
It becomes more complicated if you want to allow the user to cancel jobs. This solution won't handle that.
