Make python script processing large number of files faster

Make python script processing large number of files faster - python

I have written a python script which takes input as a directory and lists all files in that directory, it then decompresses each of these files and does some extra processing it. The code is very straightforward, uses a list of files from os.listdir( directory ) and for each file in the list it decompresses it and then executes a bunch of different system calls on it. My question is , is there any way to make the loop executions parallel or make the code run faster leveraging the cores on the cpu, and what might that be, below is some demo code to depict what I am aiming to optimize :
files = os.listdir( directory )
for file in files:
os.system( "tar -xvf %s" %file )
os.system( "Some other sys call" )
os.system( "One more sys call" )
EDIT: The sys calls are the only way possible since I am using certain CLI custom made utilities that expect input as decompressed files, hence the decompression.

Note os.system() is synchronous, i.e. python waits for the task to complete before going to the next line.
Here is a simplification of what I do on Windows 7 and Python 2.66
You should be able to easily modify this for your needs.
1. create and run a process for each task I want to run in parallel
2. after they are all started I wait for them to complete
import win32api, win32con, win32process, win32event
def CreateMyProcess2(cmd):
''' create process width no window that runs a task or without arguments'''
si = win32process.STARTUPINFO()
info = win32process.CreateProcess(
None, # AppName
cmd, # Command line
None, # Process Security
None, # Thread Security
0, # inherit Handles?
win32process.NORMAL_PRIORITY_CLASS,
None, # New environment
None, # Current directory
si) # startup info
return info[0]
# info is tuple (hProcess, hThread, processId, threadId)
if __name__ == '__main__' :
handles = []
cmd = 'cmd /c "dir/w"'
handle = CreateMyProcess2(cmd)
handles.append(handle)
cmd = 'cmd /c "path"'
handle = CreateMyProcess2(cmd)
handles.append(handle)
rc = win32event.WaitForMultipleObjects(
handles, # sequence of objects (here = handles) to wait for
1, # wait for them all (use 0 to wait for just one)
15000) # timeout in milli-seconds
print rc
# rc = 0 if all tasks have completed before the time out

Related

Python3/Linux - Open text file in default editor and wait until done

I need to wait until the user is done editing a text file in the default graphical application (Debian and derivates).
If I use xdg-open with subprocess.call (which usually waits) it will continue after opening the file in the editor. I assume because xdg-open itself starts the editor asynchronously.
I finally got a more or less working code by retrieving the launcher for the text/plain mime-type and use that with Gio.DesktopAppInfo.new to get the command for the editor. Provided that the editor is not already open in which case the process ends while the editor is still open.
I have added solutions checking the process.pid and polling for the process. Both end in an indefinite loop.
It seems such a overly complicated way to wait for the process to finish. So, is there a more robust way to do this?
#! /usr/bin/env python3
import subprocess
from gi.repository import Gio
import os
from time import sleep
import sys
def open_launcher(my_file):
print('launcher open')
app = subprocess.check_output(['xdg-mime', 'query', 'default', 'text/plain']).decode('utf-8').strip()
print(app)
launcher = Gio.DesktopAppInfo.new(app).get_commandline().split()[0]
print(launcher)
subprocess.call([launcher, my_file])
print('launcher close')
def open_xdg(my_file):
print('xdg open')
subprocess.call(['xdg-open', my_file])
print('xdg close')
def check_pid(pid):
""" Check For the existence of a unix pid. """
try:
os.kill(int(pid), 0)
except OSError:
return False
else:
return True
def open_pid(my_file):
pid = subprocess.Popen(['xdg-open', my_file]).pid
while check_pid(pid):
print(pid)
sleep(1)
def open_poll(my_file):
proc = subprocess.Popen(['xdg-open', my_file])
while not proc.poll():
print(proc.poll())
sleep(1)
def open_ps(my_file):
subprocess.call(['xdg-open', my_file])
pid = subprocess.check_output("ps -o pid,cmd -e | grep %s | head -n 1 | awk '{print $1}'" % my_file, shell=True).decode('utf-8')
while check_pid(pid):
print(pid)
sleep(1)
def open_popen(my_file):
print('popen open')
process = subprocess.Popen(['xdg-open', my_file])
process.wait()
print(process.returncode)
print('popen close')
# This will end the open_xdg function while the editor is open.
# However, if the editor is already open, open_launcher will finish while the editor is still open.
#open_launcher('test.txt')
# This solution opens the file but the process terminates before the editor is closed.
#open_xdg('test.txt')
# This will loop indefinately printing the pid even after closing the editor.
# If you check for the pid in another terminal you see the pid with: [xdg-open] <defunct>.
#open_pid('test.txt')
# This will print None once after which 0 is printed indefinately: the subprocess ends immediately.
#open_poll('test.txt')
# This seems to work, even when the editor is already open.
# However, I had to use head -n 1 to prevent returning multiple pids.
#open_ps('test.txt')
# Like open_xdg, this opens the file but the process terminates before the editor is closed.
open_popen('test.txt')

Instead of trying to poll a PID, you can simply wait for the child process to terminate, using subprocess.Popen.wait():
Wait for child process to terminate. Set and return returncode attribute.
Additionally, getting the first part of get_commandline() is not guaranteed to be the launcher. The string returned by get_commandline() will match the Exec key spec, meaning the %u, %U, %f, and %F field codes in the returned string should be replaced with the correct values.
Here is some example code, based on your xdg-mime approach:
#!/usr/bin/env python3
import subprocess
import shlex
from gi.repository import Gio
my_file = 'test.txt'
# Get default application
app = subprocess.check_output(['xdg-mime', 'query', 'default', 'text/plain']).decode('utf-8').strip()
# Get command to run
command = Gio.DesktopAppInfo.new(app).get_commandline()
# Handle file paths with spaces by quoting the file path
my_file_quoted = "'" + my_file + "'"
# Replace field codes with the file path
# Also handle special case of the atom editor
command = command.replace('%u', my_file_quoted)\
.replace('%U', my_file_quoted)\
.replace('%f', my_file_quoted)\
.replace('%F', my_file_quoted if app != 'atom.desktop' else '--wait ' + my_file_quoted)
# Run the default application, and wait for it to terminate
process = subprocess.Popen(
shlex.split(command), stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
process.wait()
# Now the exit code of the text editor process is available as process.returncode
I have a few remarks on my sample code.
Remark 1: Handling spaces in file paths
It is important the file path to be opened is wrapped in quotes, otherwise shlex.split(command) will split the filename on spaces.
Remark 2: Escaped % characters
The Exec key spec states
Literal percentage characters must be escaped as %%.
My use of replace() then could potentially replace % characters that were escaped. For simplicity, I chose to ignore this edge case.
Remark 3: atom
I assumed the desired behaviour is to always wait until the graphical editor has closed. In the case of the atom text editor, it will terminate immediately on launching the window unless the --wait option is provided. For this reason, I conditionally add the --wait option if the default editor is atom.
Remark 4: subprocess.DEVNULL
subprocess.DEVNULL is new in python 3.3. For older python versions, the following can be used instead:
with open(os.devnull, 'w') as DEVNULL:
process = subprocess.Popen(
shlex.split(command), stdout=DEVNULL, stderr=DEVNULL)
Testing
I tested my example code above on Ubuntu with the GNOME desktop environment. I tested with the following graphical text editors: gedit, mousepad, and atom.

loop over a batch script that does not terminate

I m trying to execute several batch-scripts in a python loop. However the said bat-scripts contain cmd /K and thus do not "terminate" (for lack of a better word). Therefore python calls the first script and waits forever...
Here is a pseudo-code that gives an idea of what I am trying to do:
import subprocess
params = [MYSCRIPT, os.curdir]
for level in range(10):
subprocess.call(params)
My question is: "Is there a pythonic solution to get the console command back and resume looping?"
EDIT: I am now aware that it is possible to launch child processes and continue without waiting for them to return, using
Popen(params,shell=False,stdin=None,stdout=None,stderr=None,close_fds=True)
However this would launch my entire loop almost simultaneously. Is there a way to wait for the child process to execute its task and return when it hits the cmd /K and becomes idle.

There is no built in way, but it's something you can implement.
Examples are with bash since I don't have access to a Windows machine, but should be similar for cmd \K
It might be as easy as:
import subprocess
# start the process in the background
process = subprocess.Popen(
['bash', '-i'],
stdout=subprocess.PIPE,
stdin=subprocess.PIPE
)
# will throw IO error if process terminates by this time for some reason
process.stdin.write("exit\n")
process.wait()
This will send an exit command to the shell, which should be processed just as the script terminates causing it to exit ( effectively canceling out the \K )
Here's a more elaborate answer in case you need a solution that checks for some output:
import subprocess
# start the process in the background
process = subprocess.Popen(
['bash', '-i'],
stdout=subprocess.PIPE,
stdin=subprocess.PIPE
)
# Wait for the process to terminate
process.poll()
while process.returncode is None:
# read the output from the process
# note that can't use readlines() here as it would block waiting for the process
lines = [ x for x in process.stdout.read(5).split("\n") if x]
if lines:
# if you want the output to show, you'll have to print it yourself
print(lines )
# check for some condition in the output
if any((":" in x for x in lines)):
# terminate the process
process.kill()
# alternatively could send it some input to have it terminate
# process.stdin.write("exit\n")
# Check for new return code
process.poll()
The complication here is with reading the output, as if you try to read more than is available, the process will block.

Here is something I use where I start a bunch of processes (2 in this example) and wait for them at the end before the program terminates. It can be modified to wait for specific processes at different times (see comments). In this example one process prints out the %path% and the other prints the directory contents.
import win32api, win32con, win32process, win32event
def CreateMyProcess2(cmd):
''' create process width no window that runs sdelete with a bunch of arguments'''
si = win32process.STARTUPINFO()
info = win32process.CreateProcess(
None, # AppName
cmd, # Command line
None, # Process Security
None, # Thread Security
0, # inherit Handles?
win32process.NORMAL_PRIORITY_CLASS,
None, # New environment
None, # Current directory
si) # startup info
return info[0]
# info is tuple (hProcess, hThread, processId, threadId)
if __name__ == '__main__' :
handles = []
cmd = 'cmd /c "dir/w"'
handle = CreateMyProcess2(cmd)
handles.append(handle)
cmd = 'cmd /c "path"'
handle = CreateMyProcess2(cmd)
handles.append(handle)
rc = win32event.WaitForMultipleObjects(
handles, # sequence of objects (here = handles) to wait for
1, # wait for them all (use 0 to wait for just one)
15000) # timeout in milli-seconds
print rc
# rc = 0 if all tasks have completed before the time out
Approximate Output (edited for clarity):
PATH=C:\Users\Philip\algs4\java\bin;C:\Users\Philip\bin;C:\Users\Philip\mksnt\ etc......
Volume in drive C has no label.
Volume Serial Number is 4CA0-FEAD
Directory of C:\Users\Philip\AppData\Local\Temp
[.]
[..]
FXSAPIDebugLogFile.txt
etc....
1 File(s) 0 bytes
3 Dir(s) 305,473,040,384 bytes free
0 <-- value of "rc"

permanently change directory python scripting/what environment do python scripts run in?

I have a small git_cloner script that clones my companies projects correctly. In all my scripts, I use a func that hasn't given me problems yet:
def call_sp(
command, **arg_list):
p = subprocess.Popen(command, shell=True, **arg_list)
p.communicate()
At the end of this individual script, I use:
call_sp('cd {}'.format(branch_path))
This line does not change the terminal I ran my script in to the directory branch_path, in fact, even worse, it annoyingly asks me for my password! When removing the cd yadayada line above, my script no longer demands a password before completing. I wonder:
How are these python scripts actually running? Since the cd command had no permanent effect. I assume the script splits its own private subprocess separate from what the terminal is doing, then kills itself when the script finishes?
Based on how #1 works, how do I force my scripts to change the terminal directory permanently to save me time,
Why would merely running a change directory ask me for my password?
The full script is below, thank you,
Cody
#!/usr/bin/env python
import subprocess
import sys
import time
from os.path import expanduser
home_path = expanduser('~')
project_path = home_path + '/projects'
d = {'cwd': ''}
#calling from script:
# ./git_cloner.py projectname branchname
# to make a new branch say ./git_cloner.py project branchname
#interactive:
# just run ./git_cloner.py
if len(sys.argv) == 3:
project = sys.argv[1]
branch = sys.argv[2]
if len(sys.argv) < 3:
while True:
project = raw_input('Enter a project name (i.e., mainworkproject):\n')
if not project:
continue
break
while True:
branch = raw_input('Enter a branch name (i.e., dev):\n')
if not branch:
continue
break
def call_sp(command, **arg_list):
p = subprocess.Popen(command, shell=True, **arg_list)
p.communicate()
print "making new branch \"%s\" in project \"%s\"" % (branch, project)
this_project_path = '%s/%s' % (project_path, project)
branch_path = '%s/%s' % (this_project_path, branch)
d['cwd'] = project_path
call_sp('mkdir %s' % branch, **d)
d['cwd'] = branch_path
git_string = 'git clone ssh://git#git/home/git/repos/{}.git {}'.format(project, d['cwd'])
#see what you're doing to maybe need to cancel
print '\n'
print "{}\n\n".format(git_string)
call_sp(git_string)
time.sleep(30)
call_sp('git checkout dev', **d)
time.sleep(2)
call_sp('git checkout -b {}'.format(branch), **d)
time.sleep(5)
#...then I make some symlinks, which work
call_sp('cp {}/dev/settings.py {}/settings.py'.format(project_path, branch_path))
print 'dont forget "git push -u origin {}"'.format(branch)
call_sp('cd {}'.format(branch_path))

You cannot use Popen to change the current directory of the running script. Popen will create a new process with its own environment. If you do a cd within that, it will change directory for that running process, which will then immediately exit.
If you want to change the directory for the script you could use os.chdir(path), then all subsequent commands in the script will be run from that new path.
Child processes cannot alter the environment of their parents though, so you can't have a process you create change the environment of the caller.

Python running synchronously? Running one executable at a time

Trying to use python to control numerous compiled executables, but running into timeline issues! I need to be able to run two executables simultaneously, and also be able to 'wait' until an executable has finished prior to starting another one. Also, some of them require superuser. Here is what I have so far:
import os
sudoPassword = "PASS"
executable1 = "EXEC1"
executable2 = "EXEC2"
executable3 = "EXEC3"
filename = "~/Desktop/folder/"
commandA = filename+executable1
commandB = filename+executable2
commandC = filename+executable3
os.system('echo %s | sudo %s; %s' % (sudoPassword, commandA, commandB))
os.system('echo %s | sudo %s' % (sudoPassword, commandC))
print ('DONESIES')
Assuming that os.system() waits for the executable to finish prior to moving to the next line, this should run EXEC1 and EXEC2 simultaneously, and after they finish run EXEC3...
But it doesn't. Actually, it even prints 'DONESIES' in the shell before commandB even finishes...
Please help!

Your script will still execute all 3 commands sequentially. In shell scripts, the semicolon is just a way to put more than one command on one line. It doesn't do anything special, it just runs them one after the other.
If you want to run external programs in parallel from a Python program, use the subprocess module: https://docs.python.org/2/library/subprocess.html

Use subprocess.Popen to run multiple commands in the background. If you just want the program's stdout/err to go to the screen (or get dumped completely) its pretty straight forward. If you want to process the output of the commands... that gets more complicated. You'd likely start a thread per command.
But here is the case that matches your example:
import os
import subprocess as subp
sudoPassword = "PASS"
executable1 = "EXEC1"
executable2 = "EXEC2"
executable3 = "EXEC3"
filename = os.path.expanduser("~/Desktop/folder/")
commandA = os.path.join(filename, executable1)
commandB = os.path.join(filename, executable2)
commandC = os.path.join(filename, executable3)
def sudo_cmd(cmd, password):
p = subp.Popen(['sudo', '-S'] + cmd, stdin=subp.PIPE)
p.stdin.write(password + '\n')
p.stdin.close()
return p
# run A and B in parallel
exec_A = sudo_cmd([commandA], sudoPassword)
exec_B = sudo_cmd([commandB], sudoPassword)
# wait for A before starting C
exec_A.wait()
exec_C = sudo_cmd([commandC], sudoPassword)
# wait for the stragglers
exec_B.wait()
exec_C.wait()
print ('DONESIES')

Thread in Thread

My company is working for visual effects and we set up an internal shot playback via a browser for our clients. For that we need to upload the video file to a FTP server.
I want to convert a image sequence to mp4 and upload this file directly after the rendering will finish.
For that I use:
one command prompt to convert
one command prompt to get an `md5hash
one for uploading the file
I already achieved that on my local computer, where I just chained os.system('command').
After recognizing that the program freezes very long with longer image sequences I changed the script to spawn a thread using the os.system chain.
But on the Render Farm Server this script does not actually work.
The RenderFarm Server runs Python 2.5
There are some code examples:
class CopraUpload(threading.Thread):
# initializing Thread
# via super constructor
def __init__(self):
threading.Thread.__init__(self)
# return the size of the
# created mp4 file
#
# #return: the file size in byte
def _getFileSize(self):
# creates a random id for organising
# the server upload used as flag
#
# #return: a hash
def _getHash(self):
self.fileLoc = str(self.outputfileName + '.mp4')
self.fileLoc = os.path.normpath(self.fileLoc)
return str(os.path.getsize(self.fileLoc))
# integrates the "missing" data for the xml file
# generated post render from the mp4 file
def _setPreviewDataToXML(self):
self.xmlFile = str(self.outputfileName + '_copraUpload.xml')
self.xmlFile = os.path.normpath(self.xmlFile)
ett = ET.parse(self.xmlFile)
root = ett.getroot()
for child in root.getiterator('preview_size'):
child.text = self._getFileSize()
for child in root.getiterator('preview_md5hash'):
child.text = self._getHash()
ett.write(self.xmlFile)
# create a connection to a ftp server
# and copies the mp4 file and the xml file
# on the server
def _uploadToCopra(self):
os.system(self.uploadCommand)
#process = Popen(self.uploadCommand)
# the main function of the program
# called via start from a Thread Object
def run(self):
# the command which will be send to the commando shell
# for further adjustments see ffmpeg help with ffmpeg.exe -h
FinalCommand = self.ffmpegLocation + " -r "+ self.framerate + " -i " + self.inputPath + " -an -strict experimental -s hd720 -vcodec libx264 -preset slow -profile:v baseline -level 31 -refs 1 -maxrate 6M -bufsize 10M -vb 6M -threads 0 -g 8 -r " + self.framerate + " " + self.outputfileName + ".mp4 -y"
FinalCommandList = FinalCommand.split(" ")
# calling the program
print "Start ffmpeg convertion"
outInfo = os.path.normpath("C:\\Users\\sarender\\Desktop\\stdoutFFMPEG.txt")
outError = os.path.normpath("C:\\Users\\sarender\\Desktop\\stderrFFMPEG.txt")
stdoutFile = open(outInfo,"w")
stderrFile = open(outError,"w")
handle = subp.check_all(FinalCommandList,stdout = stdoutFile,stderr = stderrFile)
handle.communicate()
stdoutFile.close()
stderrFile.close()
print "Convertion from ffmpeg done"
# fill the xml file with the missing data
# - preview file size
# - preview md5hash
self._setPreviewDataToXML()
self._uploadToCopra()
print "---------------------------------->FINISHED------------------------------------------------------>"
# Creates a callable Thread for the Copra Upload.
# start is calling the run method which will start the Uploading
and the main start:
if "$(RenderSet.writenode)" == "PREVIEW":
print "---------------------------------->Initializing Script------------------------------------------------------>"
process = CopraUpload()
process.start()
What happens:
The script starts after the rendering and ffmpeg converts the image sequence and creates an mp4. But it stops after that. It does not print "Conversion from ffmpeg complet". Just stops the script.
It actually should create the Thread converting with ffmpeg and wait until it finishes. After it should write some stuff in an xml file and upload both to the server.
Do I miss something? Is subprocess within a thread not the way to go? But I need a Thread because I can not deadlock the render management server.

My guess is that the command fails and throws an exception in handle.communicate(). Make sure you catch all exceptions and log them somewhere because threads have no way to pass on exceptions.
Also, you shouldn't use FinalCommand.split(" ") - if the file names contain spaces, this will fail in odd (and unpredictable) ways. Create a list instead pass that list to subp:
FinalCommand = [
self.ffmpegLocation,
"-r", self.framerate,
"-i", self.inputPath,
"-an",
"-strict experimental",
...
]
Much more readable as well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Make python script processing large number of files faster - python

Related

Python3/Linux - Open text file in default editor and wait until done

loop over a batch script that does not terminate

permanently change directory python scripting/what environment do python scripts run in?

Python running synchronously? Running one executable at a time

Thread in Thread

Categories

Resources