I have a function generate(file_path) which returns an integer index and a numpy array. The simplified of generate function is as follows:
def generate(file_path):
temp = np.load(file_path)
#get index from the string file_path
idx = int(file_path.split["_"][0])
#do some mathematical operation on temp
result = operate(temp)
return idx, result
I need to glob through a directory and collect the results of generate(file_path) into a hdf5 file. My serialization code is as follows:
for path in glob.glob(directory):
idx, result = generate(path)
hdf5_file["results"][idx,:] = result
hdf5_file.close()
I hope to write a multi-thread or multi-process code to speed up the above code. How could I modify it? Pretty thanks!
My try is to modify my generate function and to modify my "main" as follows:
def generate(file_path):
temp = np.load(file_path)
#get index from the string file_path
idx = int(file_path.split["_"][0])
#do some mathematical operation on temp
result = operate(temp)
hdf5_path = "./result.hdf5"
hdf5_file = h5py.File(hdf5_path, 'w')
hdf5_file["results"][idx,:] = result
hdf5_file.close()
if __name__ == '__main__':
##construct hdf5 file
hdf5_path = "./output.hdf5"
hdf5_file = h5py.File(hdf5_path, 'w')
hdf5_file.create_dataset("results", [2000,15000], np.uint8)
hdf5_file.close()
path_ = "./compute/*"
p = Pool(mp.cpu_count())
p.map(generate, glob.glob(path_))
hdf5_file.close()
print("finished")
However, it does not work. It will throw error
KeyError: "Unable to open object (object 'results' doesn't exist)"
You can use a thread or process pool to execute multiple function calls concurrently. Here is an example which uses a process pool:
from concurrent.futures import ProcessPoolExecutor
from time import sleep
def generate(file_path: str) -> int:
sleep(1.0)
return file_path.split("_")[1]
def main():
file_paths = ["path_1", "path_2", "path_3"]
with ProcessPoolExecutor() as pool:
results = pool.map(generate, file_paths)
for result in results:
# Write to the HDF5 file
print(result)
if __name__ == "__main__":
main()
Note that you should not write to the same HDF5 file concurrently, i.e. the file writing should not append in the generate function.
I detected some errors in initialising the dataset after examining your code;
You produced the hdf5 file with the path ""./result.hdf5" inside the generate function.
However, I think you neglected to create a "results" dataset beneath that file, as that is what is causing the Object Does Not Exist issue.
Kindly reply if you still face the same issue with error message
Related
I'm trying to read thousands of json file from directory and process each file separately and store the result in a dictionary. I already write a working code for sequential execution. Now i want to take the leverage of multi-processing for speed up the whole process.
So far what i did -
import json
import os
from multiprocessing import Process, Manager
def read_file(file_name):
'''
Read the given json file and return data
'''
with open(file_name) as file :
data = json.load(file)
return data
def do_some_process(data):
'''
Some calculation will be done here
and return the result
'''
return some_result
def process_each_file(file, result):
file_name = file.split('.')[0]
# reading data from file
data = read_file('../data/{}'.format(file))
processed_result = do_some_process(data)
result[file_name] = processed_result
if __name__ == '__main__':
manager = Manager()
result = manager.dict()
file_list = os.listdir("../data")
all_process = [Process(target=process_each_file, args=(file, result, ))
for file in file_list if file.endswith(".json")]
for p in all_process:
p.start()
for p in all_process:
p.join()
'''
Do some further work with 'rusult' variable
'''
When i run this code it shows OSError: [Errno 24] Too many open files
How can i achive my goal ?
To read and process multiple JSON files using Python's multiprocessing module, you can use the following approach:
import os
import json
from multiprocessing import Pool
# List all the JSON files in the current directory
json_files = [f for f in os.listdir('.') if f.endswith('.json')]
def process_data(data):
return data
def process_json_file(filename):
with open(filename, 'r') as f:
data = json.load(f)
# Process the data here...
processed_data = process_data(data)
return processed_data
# Create a pool of workers to process the files concurrently
with Pool() as pool:
# Apply the processing function to each JSON file concurrently
results = pool.map(process_json_file, json_files)
# Do something with the results
for result in results:
print(result)
I am trying to open up some huge json files
papers0 = []
papers1 = []
papers2 = []
papers3 = []
papers4 = []
papers5 = []
papers6 = []
papers7 = []
for x in range(8):
for line in open(f'part_00{x}.json', 'r'):
globals()['papers%s' % x].append(json.loads(line))
However the process above is slow. I wonder if there is some parallelization trick or some other in order to speed it up.
Thank you
If the JSON files are very large then loading them (as Python dictionaries) will be I/O bound. Therefore, multithreading would be appropriate for parallelisation.
Rather than having discrete variables for each dictionary, why not have a single dictionary keyed on the significant numeric part of the filename(s).
For example:
from concurrent.futures import ThreadPoolExecutor as TPE
from json import load as LOAD
from sys import stderr as STDERR
NFILES = 8
JDATA = {}
def get_json(n):
try:
with open(f'part_00{n}.json') as j:
return n, LOAD(j)
except Exception as e:
print(e, file=STDERR)
return n, None
def main():
with TPE() as tpe:
JDATA = dict(tpe.map(get_json, range(NFILES)))
if __name__ == '__main__':
main()
After running this, the dictionary representation of the JSON file part_005.json (for example) would be accessible as JDATA[5]
Note that if an exception arises during accessing or processing of any of the files, the relevant dictionary value will be None
Below is my most recent attempt; but alas, I print 'current_file' and it's always the same (first) .zip file in my directory?
Why/how can I iterate this to get to the next file in my zip directory?
my DIRECTORY_LOCATION has 4 zip files in it.
def find_file(cls):
listOfFiles = os.listdir(config.DIRECTORY_LOCATION)
total_files = 0
for entry in listOfFiles:
total_files += 1
# if fnmatch.fnmatch(entry, pattern):
current_file = entry
print (current_file)
""""Finds the excel file to process"""
archive = ZipFile(config.DIRECTORY_LOCATION + "/" + current_file)
for file in archive.filelist:
if file.filename.__contains__('Contact Frog'):
return archive.extract(file.filename, config.UNZIP_LOCATION)
return FileNotFoundError
find_file usage:
excel_data = pandas.read_excel(self.find_file())
Update:
I just tried changing return to yield at:
yield archive.extract(file.filename, config.UNZIP_LOCATION)
and now getting the below error at my find_file line.
ValueError: Invalid file path or buffer object type: <class 'generator'>
then I alter with the generator obj as suggested in comments; i.e.:
generator = self.find_file(); excel_data = pandas.read_excel(generator())
and now getting this error:
generator = self.find_file(); excel_data = pandas.read_excel(generator())
TypeError: 'generator' object is not callable
Here is my /main.py if helpful
"""Start Point"""
from data.find_pending_records import FindPendingRecords
from vital.vital_entry import VitalEntry
import sys
import os
import config
import datetime
# from csv import DictWriter
if __name__ == "__main__":
try:
for file in os.listdir(config.DIRECTORY_LOCATION):
if 'VCCS' in file:
PENDING_RECORDS = FindPendingRecords().get_excel_data()
# Do operations on PENDING_RECORDS
# Reads excel to map data from excel to vital
MAP_DATA = FindPendingRecords().get_mapping_data()
# Configures Driver
VITAL_ENTRY = VitalEntry()
# Start chrome and navigate to vital website
VITAL_ENTRY.instantiate_chrome()
# Begin processing Records
VITAL_ENTRY.process_records(PENDING_RECORDS, MAP_DATA)
except:
print("exception occured")
raise
It is not tested.
def find_file(cls):
listOfFiles = os.listdir(config.DIRECTORY_LOCATION)
total_files = 0
for entry in listOfFiles:
total_files += 1
# if fnmatch.fnmatch(entry, pattern):
current_file = entry
print (current_file)
""""Finds the excel file to process"""
archive = ZipFile(config.DIRECTORY_LOCATION + "/" + current_file)
for file in archive.filelist:
if file.filename.__contains__('Contact Frog'):
yield archive.extract(file.filename, config.UNZIP_LOCATION)
This is just your function rewritten with yield instead of return.
I think it should be used in the following way:
for extracted_archive in self.find_file():
excel_data = pandas.read_excel(extracted_archive)
#do whatever you want to do with excel_data here
self.find_file() is a generator, should be used like an iterator (read this answer for more details).
Try to integrate the previous loop in your main script. Each iteration of the loop, it will read a different file in excel_data, so in the body of the loop you should also do whatever you need to do with the data.
Not sure what you mean by:
just one each time the script is executed
Even with yield, if you execute the script multiple times, you will always start from the beginning (and always get the first file). You should read all of the files in the same execution.
I'm new to python, and especially new to multiprocessing/multithreading. I have trouble reading the documentation, or finding a sufficiently similar example to work off of.
The part that I am trying to divide among multiple cores is italicized, the rest is there for context. There are three functions that are defined elsewhere in the code, NextFunction(), QualFunction(), and PrintFunction(). I don't think what they do is critical to parallelizing this code, so I did not include their definitions.
Can you help me parallelize this?
So far, I've looked at
https://docs.python.org/2/library/multiprocessing.html
Python Multiprocessing a for loop
and I've tried the equivalents for multithreading, and I've tried ipython.parallel as well.
The code is intended to pull data from a file, process it through a few functions and print it, checking for various conditions along the way.
The code looks like:
def main(arg, obj1Name, obj2Name):
global dblen
records = fasta(refName)
for i,r in enumerate(records):
s = r.fastasequence
idnt = s.name.split()[0]
reference[idnt] = s.seq
names[i] = idnt
dblen += len(s.seq)
if taxNm == None: taxid[idnt] = GetTaxId(idnt).strip()
records.close()
print >> stderr, "Read it"
# read the taxids
if taxNm != None:
file = open(taxNm, "r")
for line in file:
idnt,tax = line.strip().split()
taxid[idnt] = tax
file.close()
File1 = pysam.Samfile(obj1Name, "rb")
File2 = pysam.Samfile(obj2Name, "rb")
***for obj1s,obj2s in NextFunction(File1, File2):
qobj1 = []
qobj2 = []
lobj1s = list(QualFunction(obj1s))
lobj2s = list(QualFunction(obj2s))
for obj1,ftrs1 in lobj1s:
for obj2,ftrs2 in lobj2s:
if (obj1.tid == obj2.tid):
qobj1.append((obj1,ftrs1))
qobj2.append((obj2,ftrs2))
for obj,ftrs in qobj1:
PrintFunction(obj, ftrs, "1")
for obj,ftrs in qobj2:
PrintFunctiont(obj, ftrs, "2")***
File1.close()
File2.close()
And is called by
if __name__ == "__main__":
etc
I'm having the following problem in python.
I need to do some calculations in parallel whose results I need to be written sequentially in a file. So I created a function that receives a multiprocessing.Queue and a file handle, do the calculation and print the result in the file:
import multiprocessing
from multiprocessing import Process, Queue
from mySimulation import doCalculation
# doCalculation(pars) is a function I must run for many different sets of parameters and collect the results in a file
def work(queue, fh):
while True:
try:
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
except:
break
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fh = open("foo", "w")
workQueue = Queue()
parList = # list of conditions for which I want to run doCalculation()
for x in parList:
workQueue.put(x)
processes = [Process(target = writefh, args = (workQueue, fh)) for i in range(nthreads)]
for p in processes:
p.start()
for p in processes:
p.join()
fh.close()
But the file ends up empty after the script runs. I tried to change the worker() function to:
def work(queue, filename):
while True:
try:
fh = open(filename, "a")
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
fh.close()
except:
break
and pass the filename as parameter. Then it works as I intended. When I try to do the same thing sequentially, without multiprocessing, it also works normally.
Why it didn't worked in the first version? I can't see the problem.
Also: can I guarantee that two processes won't try to write the file simultaneously?
EDIT:
Thanks. I got it now. This is the working version:
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
from random import uniform
def doCalculation(par):
t = uniform(0,2)
sleep(t)
return par * par # just to simulate some calculation
def feed(queue, parlist):
for par in parlist:
queue.put(par)
def calc(queueIn, queueOut):
while True:
try:
par = queueIn.get(block = False)
print "dealing with ", par, ""
res = doCalculation(par)
queueOut.put((par,res))
except:
break
def write(queue, fname):
fhandle = open(fname, "w")
while True:
try:
par, res = queue.get(block = False)
print >>fhandle, par, res
except:
break
fhandle.close()
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fname = "foo"
workerQueue = Queue()
writerQueue = Queue()
parlist = [1,2,3,4,5,6,7,8,9,10]
feedProc = Process(target = feed , args = (workerQueue, parlist))
calcProc = [Process(target = calc , args = (workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target = write, args = (writerQueue, fname))
feedProc.start()
for p in calcProc:
p.start()
writProc.start()
feedProc.join ()
for p in calcProc:
p.join()
writProc.join ()
You really should use two queues and three separate kinds of processing.
Put stuff into Queue #1.
Get stuff out of Queue #1 and do calculations, putting stuff in Queue #2. You can have many of these, since they get from one queue and put into another queue safely.
Get stuff out of Queue #2 and write it to a file. You must have exactly 1 of these and no more. It "owns" the file, guarantees atomic access, and absolutely assures that the file is written cleanly and consistently.
If anyone is looking for a simple way to do the same, this can help you.
I don't think there are any disadvantages to doing it in this way. If there are, please let me know.
import multiprocessing
import re
def mp_worker(item):
# Do something
return item, count
def mp_handler():
cpus = multiprocessing.cpu_count()
p = multiprocessing.Pool(cpus)
# The below 2 lines populate the list. This listX will later be accessed parallely. This can be replaced as long as listX is passed on to the next step.
with open('ExampleFile.txt') as f:
listX = [line for line in (l.strip() for l in f) if line]
with open('results.txt', 'w') as f:
for result in p.imap(mp_worker, listX):
# (item, count) tuples from worker
f.write('%s: %d\n' % result)
if __name__=='__main__':
mp_handler()
Source: Python: Writing to a single file with queue while using multiprocessing Pool
There is a mistake in the write worker code, if the block is false, the worker will never get any data. Should be as follows:
par, res = queue.get(block = True)
You can check it by adding line
print "QSize",queueOut.qsize()
after the
queueOut.put((par,res))
With block=False you would be getting ever increasing length of the queue until it fills up, unlike with block=True where you get always "1".