I'm trying to read thousands of json file from directory and process each file separately and store the result in a dictionary. I already write a working code for sequential execution. Now i want to take the leverage of multi-processing for speed up the whole process.
So far what i did -
import json
import os
from multiprocessing import Process, Manager
def read_file(file_name):
'''
Read the given json file and return data
'''
with open(file_name) as file :
data = json.load(file)
return data
def do_some_process(data):
'''
Some calculation will be done here
and return the result
'''
return some_result
def process_each_file(file, result):
file_name = file.split('.')[0]
# reading data from file
data = read_file('../data/{}'.format(file))
processed_result = do_some_process(data)
result[file_name] = processed_result
if __name__ == '__main__':
manager = Manager()
result = manager.dict()
file_list = os.listdir("../data")
all_process = [Process(target=process_each_file, args=(file, result, ))
for file in file_list if file.endswith(".json")]
for p in all_process:
p.start()
for p in all_process:
p.join()
'''
Do some further work with 'rusult' variable
'''
When i run this code it shows OSError: [Errno 24] Too many open files
How can i achive my goal ?
To read and process multiple JSON files using Python's multiprocessing module, you can use the following approach:
import os
import json
from multiprocessing import Pool
# List all the JSON files in the current directory
json_files = [f for f in os.listdir('.') if f.endswith('.json')]
def process_data(data):
return data
def process_json_file(filename):
with open(filename, 'r') as f:
data = json.load(f)
# Process the data here...
processed_data = process_data(data)
return processed_data
# Create a pool of workers to process the files concurrently
with Pool() as pool:
# Apply the processing function to each JSON file concurrently
results = pool.map(process_json_file, json_files)
# Do something with the results
for result in results:
print(result)
Related
I have a function generate(file_path) which returns an integer index and a numpy array. The simplified of generate function is as follows:
def generate(file_path):
temp = np.load(file_path)
#get index from the string file_path
idx = int(file_path.split["_"][0])
#do some mathematical operation on temp
result = operate(temp)
return idx, result
I need to glob through a directory and collect the results of generate(file_path) into a hdf5 file. My serialization code is as follows:
for path in glob.glob(directory):
idx, result = generate(path)
hdf5_file["results"][idx,:] = result
hdf5_file.close()
I hope to write a multi-thread or multi-process code to speed up the above code. How could I modify it? Pretty thanks!
My try is to modify my generate function and to modify my "main" as follows:
def generate(file_path):
temp = np.load(file_path)
#get index from the string file_path
idx = int(file_path.split["_"][0])
#do some mathematical operation on temp
result = operate(temp)
hdf5_path = "./result.hdf5"
hdf5_file = h5py.File(hdf5_path, 'w')
hdf5_file["results"][idx,:] = result
hdf5_file.close()
if __name__ == '__main__':
##construct hdf5 file
hdf5_path = "./output.hdf5"
hdf5_file = h5py.File(hdf5_path, 'w')
hdf5_file.create_dataset("results", [2000,15000], np.uint8)
hdf5_file.close()
path_ = "./compute/*"
p = Pool(mp.cpu_count())
p.map(generate, glob.glob(path_))
hdf5_file.close()
print("finished")
However, it does not work. It will throw error
KeyError: "Unable to open object (object 'results' doesn't exist)"
You can use a thread or process pool to execute multiple function calls concurrently. Here is an example which uses a process pool:
from concurrent.futures import ProcessPoolExecutor
from time import sleep
def generate(file_path: str) -> int:
sleep(1.0)
return file_path.split("_")[1]
def main():
file_paths = ["path_1", "path_2", "path_3"]
with ProcessPoolExecutor() as pool:
results = pool.map(generate, file_paths)
for result in results:
# Write to the HDF5 file
print(result)
if __name__ == "__main__":
main()
Note that you should not write to the same HDF5 file concurrently, i.e. the file writing should not append in the generate function.
I detected some errors in initialising the dataset after examining your code;
You produced the hdf5 file with the path ""./result.hdf5" inside the generate function.
However, I think you neglected to create a "results" dataset beneath that file, as that is what is causing the Object Does Not Exist issue.
Kindly reply if you still face the same issue with error message
I am trying to open up some huge json files
papers0 = []
papers1 = []
papers2 = []
papers3 = []
papers4 = []
papers5 = []
papers6 = []
papers7 = []
for x in range(8):
for line in open(f'part_00{x}.json', 'r'):
globals()['papers%s' % x].append(json.loads(line))
However the process above is slow. I wonder if there is some parallelization trick or some other in order to speed it up.
Thank you
If the JSON files are very large then loading them (as Python dictionaries) will be I/O bound. Therefore, multithreading would be appropriate for parallelisation.
Rather than having discrete variables for each dictionary, why not have a single dictionary keyed on the significant numeric part of the filename(s).
For example:
from concurrent.futures import ThreadPoolExecutor as TPE
from json import load as LOAD
from sys import stderr as STDERR
NFILES = 8
JDATA = {}
def get_json(n):
try:
with open(f'part_00{n}.json') as j:
return n, LOAD(j)
except Exception as e:
print(e, file=STDERR)
return n, None
def main():
with TPE() as tpe:
JDATA = dict(tpe.map(get_json, range(NFILES)))
if __name__ == '__main__':
main()
After running this, the dictionary representation of the JSON file part_005.json (for example) would be accessible as JDATA[5]
Note that if an exception arises during accessing or processing of any of the files, the relevant dictionary value will be None
I have a huge zip file with a large number of files. Parsing all these files takes a lot of time, so I thought about using multiprocessing to speed things up. I am not sure how to approach it, as a zipfile.ZipFile in Python is not an iterable.
I am aware that I could extract all contents from the zip file and then iterate over the list of filenames, however, I'd prefer to not have to keep extra free space to hold the extracted data and would like to operate on the ZipFile.
Maybe there is any other solution to this this problem, so I am open to suggestions.
EDIT:
Using the below code technically works, but the problem is that each time the get_content() function runs, it seems the large zip file that I have is being opened again, ultimately taking as long as 15 seconds to reach each file.
import multiprocessing
from zipfile import ZipFile
from multiprocessing import Pool
import time
path = 'zipfile.zip'
def get_file_list(zip_path):
with ZipFile(zip_path, 'r') as zipObj:
listOfiles = zipObj.namelist()
return listOfiles
def get_content(file_name):
start_time = time.time()
with ZipFile(path, 'r') as zipObject:
with zipObject.open(file_name) as file:
content = file.read()
end_time = time.time()
print(f"It took {end_time - start_time} to open this file")
return content
def parse_files():
file_list = get_file_list(path)
with Pool(multiprocessing.cpu_count()) as p:
contents = p.map(get_content, file_list)
print(contents)
parse_files()
import os
import shutil
from zipfile import ZipFile
from multiprocessing import Pool
def create_dummy_zip():
os.mkdir("dummy")
for i in range(100):
with open(f"dummy/{i}.file", "w") as f:
f.write(f"Content: {i}")
shutil.make_archive("dummy", 'zip', "dummy")
shutil.rmtree('dummy')
def delete_dummy():
try:
os.remove("dummy.zip")
shutil.rmtree('dummy')
except:
pass
def get_file_list(zip_path):
with ZipFile(zip_path, 'r') as zipObj:
listOfiles = zipObj.namelist()
return listOfiles
def get_content(file_name):
with ZipFile("dummy.zip", 'r') as zipObject:
with zipObject.open(file_name) as file:
content = file.read()
return content
if __name__ == '__main__':
try:
create_dummy_zip()
file_list = get_file_list("dummy.zip")
with Pool(5) as p:
contents = p.map(get_content, file_list)
print(contents)
delete_dummy()
except:
delete_dummy()
I am beginner to Python and trying to add few lines of code to convert json to csv and back to json. Have thousands of files (size 300 MB) to be converted and processed. With current program (using 1 CPU), i am not able to use 16 CPUs of server and need suggestions to fine tune the program for multiprocessing. Below is my code with python 3.7 version.
import json
import csv
import os
os.chdir('/stagingData/Scripts/test')
for JsonFile in os.listdir(os.getcwd()):
PartialFileName = JsonFile.split('.')[0]
j = 1
with open(PartialFileName +".csv", 'w', newline='') as Output_File:
with open(JsonFile) as fileHandle:
i = 1
for Line in fileHandle:
try:
data = json.loads(Line, parse_float=str)
except:
print("Can't load line {}".format(i))
if i == 1:
header = data.keys()
output = csv.writer(Output_File)
output.writerow(header) #Writes header row
i += 1
output.writerow(data.values()) #writes values row
j += 1
Appreciate suggestions on multiprocessing logic
If you have a single big file that you want to process more effectively I suggest the following:
Split file into chunks
Create a process to process each chunk
(if necessary) merge the processed chunks back into a single file
Something like this:
import csv
import json
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor
source_big_file = Path('/path/to/file')
def chunk_file_by_line(source_filepath: Path, chunk_size: int = 10_000):
chunk_line_size = 10_000
intermediate_file_handlers = {}
last_chunk_filepath = None
with source_big_file.open('r', encoding='utf8') as big:
for line_number, line in big:
group = line_number - (line_number % chunk_line_size)
chunk_filename = f'{source_big_file.stem}.g{group}{source_big_file.suffix}'
chunk_filepath = source_big_file.parent / chunk_filename
if chunk_filepath not in intermediate_file_handlers:
file_handler = chuck_filepath.open('w', encoding='utf8')
intermediate_file_handlers[chunk_filepath] = file_handler
if last_chunk_filepath:
last_file_hanlder = intermediate_file_handlers[last_chunk_filepath]
last_file_handler.close()
yield last_chunk_filepath
else:
file_handler = intermediate_file_handlers[chunk_filepath]
file_handler.write(line)
last_chunk_filepath = chunk_filepath
# output last one
yield last_chunk_filepath
def json_to_csv(json_filepath: Path) -> Path:
csv_filename = f'{json_filepath.stem}.csv'
csv_filepath = json_filepath.parent / csv_filename
with csv_filepath.open('w', encoding='utf8') as csv_out, json_filepath.open('r', encoding='utf8') as json_in:
dwriter = csv.DictWriter(csv_out)
headers_written = False
for json_line in json_in:
data = json.loads(json_line)
if not headers_written:
# create header record
headers = {k:k for k in data.keys()}
dwriter.writeline(headers)
headers_written = True
dwriter.writeline(data)
return csv_filepath
with ProcessPoolExecutor() as pool:
futures = []
for chunk_filepath in chuck_file_by_line(source_big_file):
future = pool.submit(json_to_csv, chunk_filepath)
futures.append(future)
# wait for all to finish
for future in futures:
csv_filepath = future.result(timeout=None) # waits until complete
print(f'conversion complete> csv filepath: {csv_filepath}')
Since you have many files, the simplest multiprocessing example from the documentation should work for you. https://docs.python.org/3.4/library/multiprocessing.html?highlight=process
f(JsonFile):
# open input, output files and convert
with Pool(16) as p:
p.map(f, os.listdir(os.getcwd()))
You could also try replacing listdir with os.scandir(), which doesn't have to return all directory entries before starting.
I have to perform some processing on each line of a file and I have many files present in an input directory. I have to dump the response I get from processing each line (from multiple input files) in to a single result file.
I have decided this flow - Will dump all the input files into a queue and fork 3-4 workers, where each worker works on a unique file, read its content and after processing dump the response into a writer queue. Their will be a separate process which will read this queue and write result in to an output file.
I have comeup with this code-
def write_to_csv(queue):
file_path = os.path.join(os.getcwd(), 'test_dir', "writer.csv")
ofile = open(file_path, "w")
job_writer = csv.writer(ofile, delimiter='\a')
while 1:
line = queue.get()
if line == 'kill':
print("Kill Signal received")
break
if line:job_writer.writerow([str(line).strip()])
ofile.close()
def worker_main(file_queue, writer_queue):
print os.getpid(),"working"
while not file_queue.empty():
file_name = file_queue.get(True)
# somewhere in process_file writer_queue.put(line_resp) is called
# for every line in file_name
process_file(file_name, writer_queue)
if __name__ == "__main__":
file_queue = multiprocessing.Queue()
output_queue = multiprocessing.Queue()
writer_pool = multiprocessing.Pool(1, write_to_csv, (output_queue,))
cwd = os.getcwd()
test_dir = 'test_dir'
file_list = os.listdir(os.path.join(cwd, test_dir))
for file_name in file_list:
file_queue.put(file_name)
reader_pool = multiprocessing.Pool(3, worker_main, (file_queue, output_queue))
reader_pool.close()
reader_pool.join()
output_queue.put("kill")
print("Finished execution")
The code is working fine. But I wonder if it is possible to do the same thing by a single multiprocessing Pool as opposed to using reader_pool and writer_pool in the code above
You could do that by apply_async, also don't set initializer (write_to_csv or worker_main in your case) when creating Pool object, or it would run the task by default.
file_queue = multiprocessing.Queue()
output_queue = multiprocessing.Queue()
cwd = os.getcwd()
test_dir = 'test_dir'
file_list = os.listdir(os.path.join(cwd, test_dir))
for file_name in file_list:
file_queue.put(file_name)
pool = Pool(4)
pool.apply_async(write_to_csv, (output_queue,))
[pool.apply_async(worker_main, (file_queue, output_queue, )) for i in range(3)]
pool.close()
pool.join()