Searching files in parallel

Searching files in parallel - python

I want to make a command that searches, in parallel, a given number of files for a given word, where...
ppatternsearch [-p n] word {files}
ppatternsearch is the command name
-p is an option that defines the level of parallelization
n is the number of processes/threads that the -p option will
create for the word search
word is the word I'll be searching for
files is, as you can imagine, the files I'll be searching through.
I want to do this in 2 ways - one with processes and another with threads. In the end, the parent process/main thread returns the number of lines where it found the word that was being searched.
Thing is, I've developed some code already and I've hit a wall. I have no idea where to go from here.
import argparse, os, sys, time
num_lines_with_pattern = []
def pattern_finder(pattern, file_searched):
counter = 0
with open(file_searched, 'r') as ficheiro_being_read:
for line in ficheiro_being_read:
if pattern in line:
print line
counter += 1
num_lines_with_pattern.append(counter)
parser = argparse.ArgumentParser()
parser.add_argument('-p', type = int, default = 1, help = Defines command parallelization.')
args = parser.parse_args()

The next step is it import threading or multiprocessing and launch pattern_finder the appropriate number of times.
You'll probably also want to look into queue.Queue so your results aren't printed jumbled up.

The problem may be I/O bound and therefore introducing multiple threads/processes won't make your hard disk work any faster.
Though it should be easy to check. To run pattern_finder() using a process pool:
#!/usr/bin/env python
from functools import partial
from multiprocessing import Pool, cpu_count
def pattern_finder(pattern, file_searched):
...
return file_searched, number_of_lines_with_pattern
if __name__ == "__main__":
pool = Pool(n or cpu_count() + 1)
search = partial(pattern_finder, word)
for filename, count in pool.imap_unordered(search, files):
print("Found {count} lines in {filename}".format(**vars()))

Related

How to run multiple python scripts simultaneously from a wrapper script in such a way that CPU utilization is maximized?

I have to run about 200-300 python scripts daily having different arguments, for example:
python scripts/foo.py -a bla -b blabla ..
python scripts/foo.py -a lol -b lolol ..
....
Lets say I already have all these arguments for every script present inside a list, and I would like to concurrently execute them such that the CPU is always busy. How can I do so?'
My current solution:
script for running multiple processes:
workers = 15
for i in range(0,len(jobs),workers):
job_string = ""
for j in range(i,min(i+workers,len(jobs))):
job_string += jobs[j] + " & "
if len(job_string) == 0:
continue
print(job_string)
val = subprocess.check_call("./scripts/parallelProcessing.sh '%s'" % job_string,shell=True)
scripts/parallelProcessing.sh (used in the above script)
echo $1
echo "running scripts in parallel"
eval $1
wait
echo "done processing"
Drawback:
I am executing K processes in a batch, and then another K and so on. But CPU cores utilization is much lower as the number of running processes keep reducing, and eventually only one process is running at a time (for a given batch). As a result, the time taken to complete all the processes is significant.
One simple solution is to ensure K processes are always running, i.e once the previous process gets completed, a new one must be scheduled. But I am not sure how to implement such a solution.
Expectations:
As the task is not very latency sensitive, I am looking forward to a simple solution which keeps CPU mostly busy.
Note: Any two of those processes can execute simultaneously without any concurrency issues. The host where these processes run has python2.

This is a technique I developed for calling many external programs using subprocess.Popen. In this example, I'm calling convert make JPEG images from DICOM files.
In short; it uses manageprocs to keep checking a list of running subprocesses. If one is finished, it is removed and a new one is started as long as unprocesses files remain. After that, the remaining processes are watched until they are all finished.
from datetime import datetime
from functools import partial
import argparse
import logging
import os
import subprocess as sp
import sys
import time
def main():
"""
Entry point for dicom2jpg.
"""
args = setup()
if not args.fn:
logging.error("no files to process")
sys.exit(1)
if args.quality != 80:
logging.info(f"quality set to {args.quality}")
if args.level:
logging.info("applying level correction.")
start_partial = partial(start_conversion, quality=args.quality, level=args.level)
starttime = str(datetime.now())[:-7]
logging.info(f"started at {starttime}.")
# List of subprocesses
procs = []
# Do not launch more processes concurrently than your CPU has cores.
# That will only lead to the processes fighting over CPU resources.
maxprocs = os.cpu_count()
# Launch and mange subprocesses for all files.
for path in args.fn:
while len(procs) == maxprocs:
manageprocs(procs)
procs.append(start_partial(path))
# Wait for all subprocesses to finish.
while len(procs) > 0:
manageprocs(procs)
endtime = str(datetime.now())[:-7]
logging.info(f"completed at {endtime}.")
def start_conversion(filename, quality, level):
"""
Convert a DICOM file to a JPEG file.
Removing the blank areas from the Philips detector.
Arguments:
filename: name of the file to convert.
quality: JPEG quality to apply
level: Boolean to indicate whether level adustment should be done.
Returns:
Tuple of (input filename, output filename, subprocess.Popen)
"""
outname = filename.strip() + ".jpg"
size = "1574x2048"
args = [
"convert",
filename,
"-units",
"PixelsPerInch",
"-density",
"300",
"-depth",
"8",
"-crop",
size + "+232+0",
"-page",
size + "+0+0",
"-auto-gamma",
"-quality",
str(quality),
]
if level:
args += ["-level", "-35%,70%,0.5"]
args.append(outname)
proc = sp.Popen(args, stdout=sp.DEVNULL, stderr=sp.DEVNULL)
return (filename, outname, proc)
def manageprocs(proclist):
"""Check a list of subprocesses for processes that have ended and
remove them from the list.
Arguments:
proclist: List of tuples. The last item in the tuple must be
a subprocess.Popen object.
"""
for item in proclist:
filename, outname, proc = item
if proc.poll() is not None:
logging.info(f"conversion of “{filename}” to “{outname}” finished.")
proclist.remove(item)
# since manageprocs is called from a loop, keep CPU usage down.
time.sleep(0.05)
if __name__ == "__main__":
main()
I've left out setup(); it's using argparse to deal with command-line arguments.
Here the thing to be processed is just a list of file names.
But it could also be (in your case) a list of tuples of script names and arguments.

How to Make this python program multiprocessing or multithreading

I have this python program, now i want to do multiprocessing or multithreading for this. Please help me achieve this.
import os, sys, codecs, random, time ,subprocess
years = ["2016","2017","2018","2019","2020"]
rf = open('URL.txt', 'r')
lines = rf.readlines()
rf.close()
list = []
for element in lines:
list.append(element.strip())
files=["myfile1.txt","myfile2.txt"]
for url in list:
for year in years":
for file in files:
os.system('python myfile.py -u' +url+ ' -y' +year+ '-f' +file)
time.sleep(5)
I want to finish one url in one process or one thread.

You would add:
from multiprocessing import Pool
You would separate your work into a function:
def myfunc(url, year, file):
os.system('python myfile.py -u' +url+ ' -y' +year+ '-f' +file)
And then in place of the loop, you would make a list of argument tuples and send it to a pool using starmap:
pool = Pool(4) # <== number of processes to run in parallel
args = [(url, year, file) for url in lst for year in years for file in files]
pool.starmap(myfunc, args)
(Here I changed list to lst -- please also change the lines in your code that use list to lst instead, because list is a builtin.)
Update - just noticed "I want to finish one url in one process or one thread."
You can do a more coarse-grained division by putting some of the looping into the payload function:
def myfunc(url):
for year in years:
for file in files:
os.system('python myfile.py -u' +url+ ' -y' +year+ '-f' +file)
and then call it with just the URL - as it is only one argument, you don't need starmap any more, just map should work and the list of URLs
pool.map(myfunc, lst)
However, there is not much reason to divide it up in this way if the years and files can be done independently in parallel, because the coarse-grained division might mean that the job takes longer to complete (some processes are idle at the end while one is still working on a URL that is slow for some reason). I would still suggest the first approach.

How to parallelise python script for processing 10,000 files?

I have more than 10,000 C files, which i need to pass each one of them to some application foo.exe for processing and generating dis-assembly files for each one of the C files,i.e. at the end of this process i will have 10,000 lst/output files! Assuming that, this process is not IO-Bound (despite the fact that foo.exe Write new lst file to disk for each c file. is it right assumption ?).
My task is
To implement parallel python program to get the job done in minimum time! by utilizing all cpu cores for this task.
My approach
I have implemented this program and it works for me, the pseudo code listed below:
iterate over all c files and push the abs path for each one in a global List, files_list.
calculate the cpu logical cores number (with psutil py module), this will be the maximum threads to be dispatched later. lets assume it is 8 threads.
generate new list, workers_list (its a list of lists) which contains the intervals or indexes (L_index, R_index) yielded from division of files_list by 8 . e.g. if i have 800 c files then workers_list will look like this: workers_list = [[0-99],[100,199],...,[700,799]].
dispatch 8 thread, workers, which each one will manipulate single entry in workers_list. each thread will open process (subprocess.call(...)) and call the foo.exe on the current c file.
posting the relevant code below:
The relevant Code
import multiprocessing
import subprocess
import psutil
import threading
import os
class LstGenerator(object):
def __init__(self):
self.elfdumpExePath = r"C:\.....\elfdump.exe" #abs path to the executable
self.output_dir = r"C:\.....\out" #abs path to where i want the lst files to be generated
self.files = [] # assuming that i have all the files in this list (abs path for each .C file)
def slice(self, files):
files_len = len(files)
j = psutil.cpu_count()
slice_step = files_len / j
workers_list = []
lhs = 0
rhs = slice_step
while j:
workers_list.append(files[lhs:rhs])
lhs += slice_step
rhs += slice_step
j -= 1
if j == 1: # last iteration
workers_list.append(files[lhs:files_len])
break
for each in workers_list: #for debug only
print len(each)
return workers_list
def disassemble(self, objectfiles):
for each_object in objectfiles:
cmd = "{elfdump} -T {object} -o {lst}".format(
elfdump=self.elfdumpExePath,
object=each_object,
lst=os.path.join(self.outputs, os.path.basename(each_object).rstrip('o') + 'lst'))
p = subprocess.call(cmd, shell=True)
def execute(self):
class FuncThread(threading.Thread):
def __init__(self, target, *args):
self._target = target
self._args = args
threading.Thread.__init__(self)
workers = []
for portion in self.slice(self.files):
workers.append(FuncThread(self.disassemble, portion))
# dispatch the workers
for worker in workers:
worker.start()
# wait or join the previous dispatched workers
for worker in workers:
worker.join()
if __name__ == '__main__':
lst_gen = LstGenerator()
lst_gen.execute()
My Questions
can i do this in more efficient way?
do python have standard lib or module that can get the job done and reduce my code/logic complexity? maybe multiprocessing.Pool?
running on windows, with python 2.7!
thanks

Yes, multiprocessing.Pool can help with this. That also does the work of sharding the list of inputs for each CPU. Here is python code (untested) that should get you on your way.
import multiprocessing
import os
import subprocess
def convert(objectfile):
elfdumpExePath = "C:\.....\elfdump.exe"
output_dir = "C:\.....\out"
cmd = "{elfdump} -T {obj} -o {lst}".format(
elfdump=elfdumpExePath,
obj=objectfile,
lst=os.path.join(output_dir, os.path.basename(objectfile).rstrip('o') + 'lst'))
return cmd
files = ["foo.c", "foo1.c", "foo2.c"]
p = multiprocessing.Pool()
outputs = p.map(convert, files)
Keep in mind that your worker function (convert above) must accept one argument. So if you need to pass in an input path and output path, that must be done as a single argument, and your list of filenames will have to be transformed into a list of pairs, where each pair is input and output.
The answer above is for python 2.7, but keep in mind that python2 has reached its end-of-life. In python3, you can use multiprocessing.Pool in a with statement so that it cleans up on its own.

Posting an answer for my question after strugling with it for a while, and noticing that i can import concurrent.futures in python2.x! this approach reduce code complexity ro minimum and even improve the execution time. unlike my first thoughts these processes is more IO-bound than cpu-bound! yet, the the time efficiency that i have get was enough convinient for run the program with multi-process.
concurrent.futures
The concurrent.futures module provides a high-level interface for asynchronously executing callables.
The asynchronous execution can be performed with threads, using
ThreadPoolExecutor, or separate processes, using ProcessPoolExecutor.
Both implement the same interface, which is defined by the abstract
Executor class.
class concurrent.futures.Executor
An abstract class that provides
methods to execute calls asynchronously. It should not be used
directly, but through its concrete subclasses.
submit(fn, *args, **kwargs)
Schedules the callable, fn, to be executed as fn(*args **kwargs) and
returns a Future object representing the execution of the callable.
for further reading please folow the like below:
parallel tasks with concurrent.futures
import multiprocessing
import subprocess
import psutil
import threading
import os
import concurrent.futures
class LstGenerator(object):
def __init__(self):
self.elfdumpExePath = r"C:\.....\elfdump.exe" #abs path to the executable
self.output_dir = r"C:\.....\out" #abs path to where i want the lst files to be generated
self.files = [] # assuming that i have all the files in this list (abs path for each .C file)
def disassemble(self, objectfile):
cmd = "{elfdump} -T {object} -o {lst}".format(
elfdump=self.elfdumpExePath,
object=objectfile,
lst=os.path.join(self.outputs, os.path.basename(objectfile).rstrip('o') + 'lst'))
return subprocess.call(cmd, shell=True,stdout=subprocess.PIPE)
def execute(self):
with concurrent.futures.ProcessPoolExecutor() as executor:
results = [executor.submit(self.disassemble(file)) for file in self.files]
if __name__ == '__main__':
lst_gen = LstGenerator()
lst_gen.execute()

Can linecache be used for concurrent reading?

I want to read and process a file by using multiprocessing with low memory consumption, high throughput (sentence/s), and - especially important - ordered results.
I was wondering whether we can use linecache's getline for this purpose. The following code reads a file, hopefully in parallel, and executes some function on the lines that are gathered in the subprocess. Here I opted for running some tokenisation on the files with spaCy.
import datetime
from multiprocessing import Pool, current_process
from os import cpu_count
from pathlib import Path
from functools import partial
from linecache import getline
import spacy
class Processor:
def __init__(self, spacy_model='en_core_web_sm', batch_size=2048):
self.nlp = spacy.load(spacy_model, disable=['ner', 'textcat'])
self.batch_size = batch_size
#staticmethod
def get_n_lines(pfin):
with pfin.open(encoding='utf-8') as fhin:
for line_idx, _ in enumerate(fhin, 1):
pass
return line_idx
def process_file(self, fin):
pfin = Path(fin).resolve()
total_lines = self.get_n_lines(pfin)
start_time = datetime.datetime.now()
procfunc = partial(self.process_batch, pfin)
with Pool(cpu_count() - 1) as pool:
# map the starting indexex to the processes
for _ in pool.imap(procfunc, range(0, total_lines+1, self.batch_size)):
pass
print('done', (datetime.datetime.now() - start_time).total_seconds())
def process_batch(self, pfin, start):
lines = [getline(str(pfin), i) for i in range(start, start+self.batch_size)]
# Parse text with spaCy
docs = list(self.nlp.pipe(lines))
# Chop into sentences
spacy_sents = [str(sent) for doc in docs for sent in doc.sents]
return str(current_process()), spacy_sents
if __name__ == '__main__':
fn = r'data/train.tok.low.en'
proc = Processor()
proc.process_file(fn)
I found that on my work laptop, running with 3 active cores on a file of 140K sentences the duration is 261 seconds. When running with a single core (n_workers=1), the processing time is 431 seconds. I am not sure how to interpret this difference but I guess it comes down to the question: does linecache.getline allow for concurrent reading? Parallel execution s faster, but considering getline expects a file name (rather than a file object), I expect it to have to open the file every time and as such blocking access for other processes. Is this assumption correct because parallel execution still seems much faster? Is there a better way to read files fast and in parallel whilst also keeping the results ordered?

You don't need linecache, and it doesn't help.
First, you don't need any special tricks to read the same file simultaneously from multiple processes. You can just do it. It'll work.
Second, linecache loads a whole file immediately as soon as a single line is requested from that file. You're not splitting the work of reading the file at all. You're doing more I/O than if you just had the parent process read the file and let the workers inherit the data. If you're getting any speedup from multiprocessing, it's probably due to parallelizing the NLP work, not the file reading.
Third, linecache is designed to support the traceback module, and it does a lot of stuff that doesn't make sense for a general-purpose file reading module, including searching the import path for a file if it doesn't find the file in the current directory.

Python concurrent.futures using subprocess, running several python script

I want to run several python script at the same time using concurrent.futures.
The serial version of my code go and look for a specific python file in folder and execute it.
import re
import os
import glob
import re
from glob import glob
import concurrent.futures as cf
FileList = [];
import time
FileList = [];
start_dir = os.getcwd();
pattern = "Read.py"
for dir,_,_ in os.walk(start_dir):
FileList.extend(glob(os.path.join(dir,pattern))) ;
FileList
i=0
for file in FileList:
dir=os.path.dirname((file))
dirname1 = os.path.basename(dir)
print(dirname1)
i=i+1
Str='python '+ file
print(Str)
completed_process = subprocess.run(Str)`
for the Parallel version of my code:
def Python_callback(future):
print(future.run_type, future.jid)
return "One Folder finished executing"
def Python_execute():
from concurrent.futures import ProcessPoolExecutor as Pool
args = FileList
pool = Pool(max_workers=1)
future = pool.submit(subprocess.call, args, shell=1)
future.run_type = "run_type"
future.jid = FileList
future.add_done_callback(Python_callback)
print("Python executed")
if __name__ == '__main__':
import subprocess
Python_execute()
The issue is that I am not sure how to pass each element of the FileList to separate cpu
Thanks for your help in advance

The smallest change is to use submit once for each element, instead of once for the whole list:
futures = []
for file in FileList:
future = pool.submit(subprocess.call, file, shell=1)
future.blah blah
futures.append(future)
The futures list is only necessary if you want to do something with the futures—wait for them to finish, check their return values, etc.
Meanwhile, you're explicitly creating the pool with max_workers=1. Not surprisingly, this means you'll only get 1 worker child process, so it'll end up waiting for one subprocess to finish before grabbing the next one. If you want to actually run them concurrently, remove that max_workers and let it default to one per core (or pass max_workers=8 or some other number that's not 1, if you have a good reason to override the default).
While we're at it, there are a lot of ways to simplify what you're doing:
Do you really need multiprocessing here? If you need to communicate with each subprocess, that can be painful to do in a single thread—but threads, or maybe asyncio, will work just as well as processes here.
More to the point, it doesn't look like you actually do need anything but launch the process and wait for it to finish, and that can be done in simple, synchronous code.
Why are you building a string and using shell=1 instead of just passing a list and not using the shell? Using the shell unnecessarily creates overhead, safety problems, and debugging annoyances.
You really don't need the jid on each future—it's just the list of all of your invocation strings, which can't be useful. What might be more useful is some kind of identifier, or the subprocess return code, or… probably lots of other things, but they're all things that could be done by reading the return value of subprocess.call or a simple wrapper.
You really don't need the callback either. If you just gather all the futures in a list and as_completed it, you can print the results as they show up more simply.
If you do both of the above, you've got nothing left but a pool.submit inside the loop—which means you can replace the entire loop with pool.map.
You rarely need, or want, to mix os.walk and glob. When you actually have a glob pattern, apply fnmatch over the files list from os.walk. But here, you're just looking for a specific filename in each dir, so really, all you need to filter on is file == 'Read.py'.
You're not using the i in your loop. But if you do need it, it's better to do for i, file in enumerate(FileList): than to do for file in FileList: and manually increment an i.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Searching files in parallel - python

The next step is it import threading or multiprocessing and launch pattern_finder the appropriate number of times. You'll probably also want to look into queue.Queue so your results aren't printed jumbled up.

Related

How to run multiple python scripts simultaneously from a wrapper script in such a way that CPU utilization is maximized?

How to Make this python program multiprocessing or multithreading

How to parallelise python script for processing 10,000 files?

Can linecache be used for concurrent reading?

Python concurrent.futures using subprocess, running several python script

Categories

Resources