Concurrent.futures opens new windows in tkinter instead of running the function - python

I am trying to run a function concurrently over multiple files in a gui using tkinter and concurrent.futures
Outside of the GUI, this script works fine. However, whenever I translate it over into the GUI script, instead of running the function in parallel, the script opens up 5 new gui tkinter windows (the number of windows it opens is equal to the number of processors I allow the program to use).
Ive looked over the code thoroughly and just cant understand why it is opening new windows as opposed to just running the function over the files.
Can anyone see something I am missing?
An abridged version of the code is below. I have cut out a significant part of the code and only left in the parts pertinent to parralelization. This code undoubtedly has variables in it that I have not defined in this example.
import pandas as pd
import numpy as np
import glob
from pathlib import Path
from tkinter import *
from tkinter import filedialog
from concurrent.futures import ProcessPoolExecutor
window = Tk()
window.title('Problem with parralelizing')
window.geometry('1000x700')
def calculate():
#establish where the files are coming from to operate on
folder_input = folder_entry_var.get()
#establish the number of processors to use
nbproc = int(np_var.get())
#loop over files to get a list of file to be worked on by concurrent.futures
files = []
for file in glob.glob(rf'{folder_input}'+'//*'):
files.append(file)
#this function gets passed to concurrent.futures. I have taken out a significant portion of
#function itself as I do not believe the problem resides in the function itself.
def process_file(filepath):
excel_input = excel_entry_var.get()
minxv = float(min_x_var.get())
maxxv = float(man_x_var.get())
output_dir = odir_var.get()
path = filepath
event_name = Path(path).stem
event['event_name'] = event_name
min_x = 292400
max_x = 477400
list_of_objects = list(event.object.unique())
missing_master_surface = []
for line in list_of_objects:
df = event.loc[event.object == line]
current_y = df.y.max()
y_cl = df.x.values.tolist()
full_ys = np.arange(min_x,max_x+200,200).tolist()
for i in full_ys:
missing_values = []
missing_v_y = []
exist_yn = []
event_name_list = []
if i in y_cl:
next
elif i not in y_cl:
missing_values.append(i)
missing_v_y.append(current_y)
exist_yn.append(0)
event_name_list.append(event_name)
# feed the function to processpool executer to run. At this point, I hear the processors
# spin up, but all it does is open 5 new tkinter windows (the number of windows is proportionate
#to the number of processors I give it to run
if __name__ == '__main__':
with ProcessPoolExecutor(max_workers=nbproc) as executor:
executor.map(process_file, files)
window.mainloop()

Ive looked over the code thoroughly and just cant understand why it is opening new windows as opposed to just running the function over the files.
Each process has to reload your code. At the very top of your code you do window = Tk(). That is why you get one window per process.

Related

How to call a linux command line program in parallel with python

I have a command-line program which runs on single core. It takes an input file, does some calculations, and returns several files which I need to parse to store the produced output.
I have to call the program several times changing the input file. To speed up the things I was thinking parallelization would be useful.
Until now I have performed this task calling every run separately within a loop with the subprocess module.
I wrote a script which creates a new working folder on every run and than calls the execution of the program whose output is directed to that folder and returns some data which I need to store. My question is, how can I adapt the following code, found here, to execute my script always using the indicated amount of CPUs, and storing the output.
Note that each run has a unique running time.
Here the mentioned code:
import subprocess
import multiprocessing as mp
from tqdm import tqdm
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
def work(sec_sleep):
command = ['python', 'worker.py', sec_sleep]
subprocess.call(command)
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
pool = mp.Pool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()
I am not entirely clear what your issue is. I have some recommendations for improvement below, but you seem to claim on the page that you link to that everything works as expected and I don't see anything very wrong with the code as long as you are running on Linux.
Since the subprocess.call method is already creating a new process, you should just be using multithreading to invoke your worker function, work. But had you been using multiprocessing and your platform was one that used the spawn method to create new processes (such as Windows), then having the creation of the progress bar outside of the if __name__ = '__main__': block would have resulted in the creation of 4 additional progress bars that did nothing. Not good! So for portability it would have been best to move its creation to inside the if __name__ = '__main__': block.
import subprocess
from multiprocessing.pool import ThreadPool
from tqdm import tqdm
def work(sec_sleep):
command = ['python', 'worker.py', sec_sleep]
subprocess.call(command)
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
pool = ThreadPool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()
Note: If your worker.py program prints to the console, it will mess up the progress bar (the progress bar will be re-written repeatedly on multiple lines).
Have you considered instead importing worker.py (some refactoring of that code might be necessary) instead of invoking a new Python interpreter to execute it (in this case you would want to be explicitly using multiprocessing). On Windows this might not save you anything since a new Python interpreter would be executed for each new process anyway, but this could save you on Linux:
import subprocess
from multiprocessing.pool import Pool
from worker import do_work
from tqdm import tqdm
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
pool = Pool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(do_work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()

Accesing variables in a thread that are being modified in the main program

I'm creating a communication program to make a raspberry send me information to the computer. This information will need to be manipulated by other threads. So, in the main thread, I'm actually initializing the server and so on, and manipulating the variable, and in the secondary thread, I'm manipulating the data.
The code I'm using is the following:
import sys # Necessaty to insert the path of files that are in other folders
from communication_class import Server
import pickle
import threading
import time
sys.path.append('/home/pablo/Desktop/MY_THESIS/MasterThesisFiles_AR/measurement_processing/')
import filtering
computer = Server() # Needs to be initialized before-hand so it can be used in the
computer.ChangePort(5020)
def DataEvaluation(n_var, n_states):
print("Inside Data Evaluation")
while True:
print(computer.data)
#time.sleep(1)
data_thread = threading.Thread(target = DataEvaluation, args = (9, 5))
computer.create_socket()
computer.create_server()
computer.start(computer.handle_readings)
The problem is that the data_thread isn't doing what it says, or at least it doesn't print any information (this is, the print(computer.data) and the print("Inside Data Evaluation") don't appear to be executed). Are they being executed, but because they are in other threads, they don't appear, or what is going on?

Creating main process for a for loop

This program returns the resolution of the video but since I need for a large scale project I need multiprocessing. I have tried using and parallel processing using a different function but that would just run it multiple times not making it efficent I am posting the entire code. Can you help me create a main process that takes all cores.
import os
from tkinter.filedialog import askdirectory
from moviepy.editor import VideoFileClip
if __name__ == "__main__":
dire = askdirectory()
d = dire[:]
print(dire)
death = os.listdir(dire)
print(death)
for i in death: #multiprocess this loop
dire = d
dire += f"/{i}"
v = VideoFileClip(dire)
print(f"{i}: {v.size}")
This code works fine but I need help with creating a main process(uses all cores) for the for loop alone. can you excuse the variables names I was angry at multiprocessing. Also if you have tips on making the code efficient I would appreciate it.
You are, I suppose, assuming that every file in the directory is a video clip. I am assuming that processing the video clip is an I/O bound "process" for which threading is appropriate. Here I have rather arbitrarily crated a thread pool size of 20 threads this way:
MAX_WORKERS = 20 # never more than this
N_WORKERS = min(MAX_WORKERS, len(death))
You would have to experiment with how large MAX_WORKERS could be before performance degrades. This might be a low number not because your system cannot support lots of threads but because concurrent access to multiple files on your disk that may be spread across the medium may be inefficient.
import os
from tkinter.filedialog import askdirectory
from moviepy.editor import VideoFileClip
from concurrent.futures import ThreadPoolExecutor as Executor
from functools import partial
def process_video(parent_dir, file):
v = VideoFileClip(f"{parent_dir}/{file}")
print(f"{file}: {v.size}")
if __name__ == "__main__":
dire = askdirectory()
print(dire)
death = os.listdir(dire)
print(death)
worker = partial(process_video, dire)
MAX_WORKERS = 20 # never more than this
N_WORKERS = min(MAX_WORKERS, len(death))
with Executor(max_workers=N_WORKERS) as executor:
results = executor.map(worker, death) # results is a list: [None, None, ...]
Update
According to #Reishin, moviepy results in executing the ffmpeg executable and thus ends up creating a process in which the work is being done. So there us no point in also using multiprocessing here.
moviepy is just an wrapper around ffmpeg and designed to edit clips thus working with one file at time - the performance is quite poor. Invoking each time the new process for a number of files is time consuming. At the end, need of multiple processes might be a result of choice of wrong lib.
I'd like to recommend to use pyAV lib instead, which provides direct py binding for ffmpeg and good performance:
import av
import os
from tkinter.filedialog import askdirectory
import multiprocessing
from concurrent.futures import ThreadPoolExecutor as Executor
MAX_WORKERS = int(multiprocessing.cpu_count() * 1.5)
def get_video_resolution(path):
container = None
try:
container = av.open(path)
frame = next(container.decode(video=0))
return path, f"{frame.width}x{frame.height}"
finally:
if container:
container.close()
def files_to_proccess():
video_dir = askdirectory()
return (full_file_path for f in os.listdir(video_dir) if (full_file_path := os.path.join(video_dir, f)) and not os.path.isdir(full_file_path))
def main():
for f in files_to_proccess():
print(f"{os.path.basename(f)}: {get_video_resolution(f)[1]}")
def main_multi_threaded():
with Executor(max_workers=MAX_WORKERS) as executor:
for path, resolution in executor.map(get_video_resolution, files_to_proccess()):
print(f"{os.path.basename(path)}: {resolution}")
if __name__ == "__main__":
#main()
main_multi_threaded()
Above are single and multi-threaded implementation, with optimal parallelism setting (in case if multithreading is something absolute required)

Progress Check while running a loop with multiprocessing pool.apply.async

I have dig through everywhere but now I am stuck, and i need the help of the community. I am not a programmer and barely use python inside a VFX program called Houdini.
Using multiprocessing I am running wedges of tasks in batches using another program called hython.
Task creates n amount of folders and populates these folders with x amount of files each with equally total amount of files such as
/files/wedge_1/file1...file2
/files/wedge_2/file1...file2
pool decides how many it can ran these tasks in batches. I am trying to implement a progress bar that runs along the side of my code and checks the files every x amount until total number of files = total number of files required.
Other possible option is that hython task already can output an alfred progress report, but since everything runs in together i get several copies of same frame printed in terminal, which doesn't tel me from which loop they are coming from.
here is the code for your considerations.
# Importing all needed modules
import multiprocessing
from multiprocessing.pool import ThreadPool
import time, timeit
import os
#My Variables
hou.hipFile.load("/home/tricecold/pythonTest/HoudiniWedger/HoudiniWedger.hiplc") #scene file
wedger = hou.parm('/obj/geo1/popnet/source_first_input/seed') #wedged parameter
cache = hou.node('/out/cacheme') #runs this node
total_tasks = 10 #Wedge Amount
max_number_processes = 5 #Batch Size
FileRange = abs(hou.evalParmTuple('/out/cacheme/f')[0] - hou.evalParmTuple('/out/cacheme/f')[1])
target_dir = os.path.dirname(hou.evalParm('/out/cacheme/sopoutput')) + "/"
totals = FileRange * total_tasks
def cacheHoudini(wedge=total_tasks): #houdini task definiton
wedger.set(wedge)
time.sleep(0.1)
cache.render(verbose=False)
def files(wedge=total_tasks): #figure out remaining files
count = 0
while (count < totals):
num = len([name for name in os.listdir(target_dir)])
count = count + 1
print (count)
if __name__ == '__main__':
pool = multiprocessing.Pool(max_number_processes) #define pool size
#do i need to run my progress function files here
for wedge in range(0,total_tasks):
#or I add function files here, not really sure
pool.apply_async(cacheHoudini,args=(wedge,)) #run tasks
pool.close() # After all threads started we close the pool
pool.join() # And wait until all threads are done

Autodesk's Fbx Python and threading

I'm trying to use the fbx python module from autodesk, but it seems I can't thread any operation. This seems due to the GIL not relased. Has anyone found the same issue or am I doing something wrong? When I say it doesn't work, I mean the code doesn't release the thread and I'm not be able to do anything else, while the fbx code is running.
There isn't much of code to post, just to know whether it did happen to anyone to try.
Update:
here is the example code, please note each fbx file is something like 2GB
import os
import fbx
import threading
file_dir = r'../fbxfiles'
def parse_fbx(filepath):
print '-' * (len(filepath) + 9)
print 'parsing:', filepath
manager = fbx.FbxManager.Create()
importer = fbx.FbxImporter.Create(manager, '')
status = importer.Initialize(filepath)
if not status:
raise IOError()
scene = fbx.FbxScene.Create(manager, '')
importer.Import(scene)
# freeup memory
rootNode = scene.GetRootNode()
def traverse(node):
print node.GetName()
for i in range(0, node.GetChildCount()):
child = node.GetChild(i)
traverse(child)
# RUN
traverse(rootNode)
importer.Destroy()
manager.Destroy()
files = os.listdir(file_dir)
tt = []
for file_ in files:
filepath = os.path.join(file_dir, file_)
t = threading.Thread(target=parse_fbx, args=(filepath,))
tt.append(t)
t.start()
One problem I see is with your traverse() function. It's calling itself recursively potentially a huge number of times. Another is having all the threads printing stuff at the same time. Doing that properly requires coordinating access to the shared output device (i.e. the screen). A simple way to do that is by creating and using a global threading.Lock object.
First create a global Lock to prevent threads from printing at same time:
file_dir = '../fbxfiles' # an "r" prefix needed only when path contains backslashes
print_lock = threading.Lock() # add this here
Then make a non-recursive version of traverse() that uses it:
def traverse(rootNode):
with print_lock:
print rootNode.GetName()
for i in range(node.GetChildCount()):
child = node.GetChild(i)
with print_lock:
print child.GetName()
It's not clear to me exactly where the reading of each fbxfile takes place. If it all happens as a result of the importer.Import(scene) call, then that is the only time any other threads will be given a chance to run — unless some I/O is [also] done within the traverse() function.
Since printing is most definitely a form of output, thread switching will also be able to occur when it's done. However, if all the function did was perform computations of some kind, no multi-threading would take place within it during its execution.
Once you get the multi-reading working, you may encounter insufficient memory issues if multiple 2GB fbxfiles are being read into memory simultaneously by the various different threads.

Categories