Speed up importing huge json files - python

I am trying to open up some huge json files
papers0 = []
papers1 = []
papers2 = []
papers3 = []
papers4 = []
papers5 = []
papers6 = []
papers7 = []
for x in range(8):
for line in open(f'part_00{x}.json', 'r'):
globals()['papers%s' % x].append(json.loads(line))
However the process above is slow. I wonder if there is some parallelization trick or some other in order to speed it up.
Thank you

If the JSON files are very large then loading them (as Python dictionaries) will be I/O bound. Therefore, multithreading would be appropriate for parallelisation.
Rather than having discrete variables for each dictionary, why not have a single dictionary keyed on the significant numeric part of the filename(s).
For example:
from concurrent.futures import ThreadPoolExecutor as TPE
from json import load as LOAD
from sys import stderr as STDERR
NFILES = 8
JDATA = {}
def get_json(n):
try:
with open(f'part_00{n}.json') as j:
return n, LOAD(j)
except Exception as e:
print(e, file=STDERR)
return n, None
def main():
with TPE() as tpe:
JDATA = dict(tpe.map(get_json, range(NFILES)))
if __name__ == '__main__':
main()
After running this, the dictionary representation of the JSON file part_005.json (for example) would be accessible as JDATA[5]
Note that if an exception arises during accessing or processing of any of the files, the relevant dictionary value will be None

Related

How to parallel the following code using Multiprocessing in Python

I have a function generate(file_path) which returns an integer index and a numpy array. The simplified of generate function is as follows:
def generate(file_path):
temp = np.load(file_path)
#get index from the string file_path
idx = int(file_path.split["_"][0])
#do some mathematical operation on temp
result = operate(temp)
return idx, result
I need to glob through a directory and collect the results of generate(file_path) into a hdf5 file. My serialization code is as follows:
for path in glob.glob(directory):
idx, result = generate(path)
hdf5_file["results"][idx,:] = result
hdf5_file.close()
I hope to write a multi-thread or multi-process code to speed up the above code. How could I modify it? Pretty thanks!
My try is to modify my generate function and to modify my "main" as follows:
def generate(file_path):
temp = np.load(file_path)
#get index from the string file_path
idx = int(file_path.split["_"][0])
#do some mathematical operation on temp
result = operate(temp)
hdf5_path = "./result.hdf5"
hdf5_file = h5py.File(hdf5_path, 'w')
hdf5_file["results"][idx,:] = result
hdf5_file.close()
if __name__ == '__main__':
##construct hdf5 file
hdf5_path = "./output.hdf5"
hdf5_file = h5py.File(hdf5_path, 'w')
hdf5_file.create_dataset("results", [2000,15000], np.uint8)
hdf5_file.close()
path_ = "./compute/*"
p = Pool(mp.cpu_count())
p.map(generate, glob.glob(path_))
hdf5_file.close()
print("finished")
However, it does not work. It will throw error
KeyError: "Unable to open object (object 'results' doesn't exist)"
You can use a thread or process pool to execute multiple function calls concurrently. Here is an example which uses a process pool:
from concurrent.futures import ProcessPoolExecutor
from time import sleep
def generate(file_path: str) -> int:
sleep(1.0)
return file_path.split("_")[1]
def main():
file_paths = ["path_1", "path_2", "path_3"]
with ProcessPoolExecutor() as pool:
results = pool.map(generate, file_paths)
for result in results:
# Write to the HDF5 file
print(result)
if __name__ == "__main__":
main()
Note that you should not write to the same HDF5 file concurrently, i.e. the file writing should not append in the generate function.
I detected some errors in initialising the dataset after examining your code;
You produced the hdf5 file with the path ""./result.hdf5" inside the generate function.
However, I think you neglected to create a "results" dataset beneath that file, as that is what is causing the Object Does Not Exist issue.
Kindly reply if you still face the same issue with error message

Using ray to speed up checking json files

I have over a million json files, and I'm trying to find the fastest way to check first, if they load, and then, if there exists either key_A, key_B, or neither. I thought I might be able to use ray to speed up this process, but opening a file seems to fail with ray.
As a simplification, here's my attempt at just checking whether or not a file will load:
import ray
ray.init()
#ray.remote
class Counter(object):
def __init__(self):
self.good = 0
self.bad = 0
def increment(self, j):
try:
with open(j, 'r') as f:
l = json.load(f)
self.good += 1
except: # all files end up here
self.bad += 1
def read(self):
return (self.good, self.bad)
counter = Counter.remote()
[counter.increment.remote(j) for j in json_paths]
futures = counter.read.remote()
print(ray.get(futures))
But I end up with (0, len(json_paths)) as a result.
For reference, the slightly more complicated actual end goal I have is to check:
new, old, bad = 0,0,0
try:
with open(json_path, 'r') as f:
l = json.load(f)
ann = l['frames']['FrameLabel']['annotations']
first_object = ann[0][0]
except:
bad += 1
return
if 'object_category' in first_object:
new += 1
elif 'category' in first_object:
old += 1
else:
bad += 1
I'd recommend not using Python for this at all, but for example jq.
A command like
jq -c "[input_filename, (.frames.FrameLabel.annotations[0][0]|[.object_category,.category])]" good.json bad.json old.json
outputs
["good.json",["good",null]]
["bad.json",[null,null]]
["old.json",[null,"good"]]
for each of your categories of data, which will be significantly easier to parse.
You can use e.g. the GNU find tool, or if you're feeling fancy, parallel, to come up with the command lines to run.
You could use Python' built-in concurrent module instead to perform your task, which ray might not be best-suited for. Example:
from concurrent.futures import ThreadPoolExecutor
numThreads = 10
def checkFile(path):
return True # parse and check here
with ThreadPoolExecutor(max_workers=numThreads) as pool:
good = sum(pool.map(checkFile, json_paths))
bad = len(json_paths) - good

Appending to list during multiprocessing

I want to check if some element is already present in some list, while i am constantly updating that list.I am using multiprocessing to achieve this, but currently my list gets reinitialised every time.Any suggestions on how i could append to the list without it being reinitialized would be very helpful.Thanks in advance.
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
from PIL import Image
import hashlib
import os
image_hash_list=[]
url_list =[]
some_dict=dict()
def getImages(val):
# import pdb;pdb.set_trace()
#Dowload images
f = open('image_files.txt', 'a')
try:
url=val # preprocess the url from the input val
local=url.split('/')[-1] #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
md5hash = hashlib.md5(Image.open(local).tobytes())
image_hash = md5hash.hexdigest()
global image_hash_list
global url_list
if image_hash not in image_hash_list:
image_hash_list.append(image_hash)
some_dict[image_hash] = 0
os.remove(local)
f.write(url+'\n')
return 1
else:
os.remove(local)
print(some_dict.keys())
except Exception as e:
return 0
# if __name__ == '__main__':
files = "Identity.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=12)
res = pool.map(getImages, lst)
print ("tempw")
Here the image_hash_list get reinitialised every time.
Use a Manager to create shared lists and dicts (and other types too): Sharing state betweek processes.

Trying to parallelize a for loop, which calls a function with arguments

I'm new to python, and especially new to multiprocessing/multithreading. I have trouble reading the documentation, or finding a sufficiently similar example to work off of.
The part that I am trying to divide among multiple cores is italicized, the rest is there for context. There are three functions that are defined elsewhere in the code, NextFunction(), QualFunction(), and PrintFunction(). I don't think what they do is critical to parallelizing this code, so I did not include their definitions.
Can you help me parallelize this?
So far, I've looked at
https://docs.python.org/2/library/multiprocessing.html
Python Multiprocessing a for loop
and I've tried the equivalents for multithreading, and I've tried ipython.parallel as well.
The code is intended to pull data from a file, process it through a few functions and print it, checking for various conditions along the way.
The code looks like:
def main(arg, obj1Name, obj2Name):
global dblen
records = fasta(refName)
for i,r in enumerate(records):
s = r.fastasequence
idnt = s.name.split()[0]
reference[idnt] = s.seq
names[i] = idnt
dblen += len(s.seq)
if taxNm == None: taxid[idnt] = GetTaxId(idnt).strip()
records.close()
print >> stderr, "Read it"
# read the taxids
if taxNm != None:
file = open(taxNm, "r")
for line in file:
idnt,tax = line.strip().split()
taxid[idnt] = tax
file.close()
File1 = pysam.Samfile(obj1Name, "rb")
File2 = pysam.Samfile(obj2Name, "rb")
***for obj1s,obj2s in NextFunction(File1, File2):
qobj1 = []
qobj2 = []
lobj1s = list(QualFunction(obj1s))
lobj2s = list(QualFunction(obj2s))
for obj1,ftrs1 in lobj1s:
for obj2,ftrs2 in lobj2s:
if (obj1.tid == obj2.tid):
qobj1.append((obj1,ftrs1))
qobj2.append((obj2,ftrs2))
for obj,ftrs in qobj1:
PrintFunction(obj, ftrs, "1")
for obj,ftrs in qobj2:
PrintFunctiont(obj, ftrs, "2")***
File1.close()
File2.close()
And is called by
if __name__ == "__main__":
etc

Writing to a file with multiprocessing

I'm having the following problem in python.
I need to do some calculations in parallel whose results I need to be written sequentially in a file. So I created a function that receives a multiprocessing.Queue and a file handle, do the calculation and print the result in the file:
import multiprocessing
from multiprocessing import Process, Queue
from mySimulation import doCalculation
# doCalculation(pars) is a function I must run for many different sets of parameters and collect the results in a file
def work(queue, fh):
while True:
try:
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
except:
break
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fh = open("foo", "w")
workQueue = Queue()
parList = # list of conditions for which I want to run doCalculation()
for x in parList:
workQueue.put(x)
processes = [Process(target = writefh, args = (workQueue, fh)) for i in range(nthreads)]
for p in processes:
p.start()
for p in processes:
p.join()
fh.close()
But the file ends up empty after the script runs. I tried to change the worker() function to:
def work(queue, filename):
while True:
try:
fh = open(filename, "a")
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
fh.close()
except:
break
and pass the filename as parameter. Then it works as I intended. When I try to do the same thing sequentially, without multiprocessing, it also works normally.
Why it didn't worked in the first version? I can't see the problem.
Also: can I guarantee that two processes won't try to write the file simultaneously?
EDIT:
Thanks. I got it now. This is the working version:
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
from random import uniform
def doCalculation(par):
t = uniform(0,2)
sleep(t)
return par * par # just to simulate some calculation
def feed(queue, parlist):
for par in parlist:
queue.put(par)
def calc(queueIn, queueOut):
while True:
try:
par = queueIn.get(block = False)
print "dealing with ", par, ""
res = doCalculation(par)
queueOut.put((par,res))
except:
break
def write(queue, fname):
fhandle = open(fname, "w")
while True:
try:
par, res = queue.get(block = False)
print >>fhandle, par, res
except:
break
fhandle.close()
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fname = "foo"
workerQueue = Queue()
writerQueue = Queue()
parlist = [1,2,3,4,5,6,7,8,9,10]
feedProc = Process(target = feed , args = (workerQueue, parlist))
calcProc = [Process(target = calc , args = (workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target = write, args = (writerQueue, fname))
feedProc.start()
for p in calcProc:
p.start()
writProc.start()
feedProc.join ()
for p in calcProc:
p.join()
writProc.join ()
You really should use two queues and three separate kinds of processing.
Put stuff into Queue #1.
Get stuff out of Queue #1 and do calculations, putting stuff in Queue #2. You can have many of these, since they get from one queue and put into another queue safely.
Get stuff out of Queue #2 and write it to a file. You must have exactly 1 of these and no more. It "owns" the file, guarantees atomic access, and absolutely assures that the file is written cleanly and consistently.
If anyone is looking for a simple way to do the same, this can help you.
I don't think there are any disadvantages to doing it in this way. If there are, please let me know.
import multiprocessing
import re
def mp_worker(item):
# Do something
return item, count
def mp_handler():
cpus = multiprocessing.cpu_count()
p = multiprocessing.Pool(cpus)
# The below 2 lines populate the list. This listX will later be accessed parallely. This can be replaced as long as listX is passed on to the next step.
with open('ExampleFile.txt') as f:
listX = [line for line in (l.strip() for l in f) if line]
with open('results.txt', 'w') as f:
for result in p.imap(mp_worker, listX):
# (item, count) tuples from worker
f.write('%s: %d\n' % result)
if __name__=='__main__':
mp_handler()
Source: Python: Writing to a single file with queue while using multiprocessing Pool
There is a mistake in the write worker code, if the block is false, the worker will never get any data. Should be as follows:
par, res = queue.get(block = True)
You can check it by adding line
print "QSize",queueOut.qsize()
after the
queueOut.put((par,res))
With block=False you would be getting ever increasing length of the queue until it fills up, unlike with block=True where you get always "1".

Categories