Using ray to speed up checking json files - python

I have over a million json files, and I'm trying to find the fastest way to check first, if they load, and then, if there exists either key_A, key_B, or neither. I thought I might be able to use ray to speed up this process, but opening a file seems to fail with ray.
As a simplification, here's my attempt at just checking whether or not a file will load:
import ray
ray.init()
#ray.remote
class Counter(object):
def __init__(self):
self.good = 0
self.bad = 0
def increment(self, j):
try:
with open(j, 'r') as f:
l = json.load(f)
self.good += 1
except: # all files end up here
self.bad += 1
def read(self):
return (self.good, self.bad)
counter = Counter.remote()
[counter.increment.remote(j) for j in json_paths]
futures = counter.read.remote()
print(ray.get(futures))
But I end up with (0, len(json_paths)) as a result.
For reference, the slightly more complicated actual end goal I have is to check:
new, old, bad = 0,0,0
try:
with open(json_path, 'r') as f:
l = json.load(f)
ann = l['frames']['FrameLabel']['annotations']
first_object = ann[0][0]
except:
bad += 1
return
if 'object_category' in first_object:
new += 1
elif 'category' in first_object:
old += 1
else:
bad += 1

I'd recommend not using Python for this at all, but for example jq.
A command like
jq -c "[input_filename, (.frames.FrameLabel.annotations[0][0]|[.object_category,.category])]" good.json bad.json old.json
outputs
["good.json",["good",null]]
["bad.json",[null,null]]
["old.json",[null,"good"]]
for each of your categories of data, which will be significantly easier to parse.
You can use e.g. the GNU find tool, or if you're feeling fancy, parallel, to come up with the command lines to run.

You could use Python' built-in concurrent module instead to perform your task, which ray might not be best-suited for. Example:
from concurrent.futures import ThreadPoolExecutor
numThreads = 10
def checkFile(path):
return True # parse and check here
with ThreadPoolExecutor(max_workers=numThreads) as pool:
good = sum(pool.map(checkFile, json_paths))
bad = len(json_paths) - good

Related

Speed up importing huge json files

I am trying to open up some huge json files
papers0 = []
papers1 = []
papers2 = []
papers3 = []
papers4 = []
papers5 = []
papers6 = []
papers7 = []
for x in range(8):
for line in open(f'part_00{x}.json', 'r'):
globals()['papers%s' % x].append(json.loads(line))
However the process above is slow. I wonder if there is some parallelization trick or some other in order to speed it up.
Thank you
If the JSON files are very large then loading them (as Python dictionaries) will be I/O bound. Therefore, multithreading would be appropriate for parallelisation.
Rather than having discrete variables for each dictionary, why not have a single dictionary keyed on the significant numeric part of the filename(s).
For example:
from concurrent.futures import ThreadPoolExecutor as TPE
from json import load as LOAD
from sys import stderr as STDERR
NFILES = 8
JDATA = {}
def get_json(n):
try:
with open(f'part_00{n}.json') as j:
return n, LOAD(j)
except Exception as e:
print(e, file=STDERR)
return n, None
def main():
with TPE() as tpe:
JDATA = dict(tpe.map(get_json, range(NFILES)))
if __name__ == '__main__':
main()
After running this, the dictionary representation of the JSON file part_005.json (for example) would be accessible as JDATA[5]
Note that if an exception arises during accessing or processing of any of the files, the relevant dictionary value will be None

Python: Write content to an open stream

I am working with an API that takes an open binary file as a parameter and then performs blocking reads on that until EOF.
Rather than opening an existing file (io.open mode 'rb') I want to pass it a stream that I write calculated/constructed content to - in effect I want something that is conceptually a unidirectional pipe where the output is delivered via an inputstream that is interchangeable with an open file.
I looked at BufferedRWPair but the few examples I could find all violate it's warnings not to use the same object for the input and output sides.
If anyone has an appropriate example or better suggestion, it's welcome!
I've looked at BufferedRandom based on comments here, but I'm obviously doing something wrong as....
import io
buf = io.BufferedRandom(io.BytesIO())
buf.write("a")
buf.write("b")
buf.flush()
while True:
print "reading"
a = buf.read(1024)
if not a: break
print "read: {}".format(a)
buf.close()
This exits after the first read
update
This admittedly messy example shows the solution, having to maintain independent read and write positions
import io
buf = io.BufferedRandom(io.BytesIO())
read = 0
wrote = 0
buf.seek(wrote)
wrote += buf.write(b"a")
wrote += buf.write(b"b")
buf.seek(read)
data = buf.read(1)
read += len(data)
buf.seek(wrote)
wrote += buf.write(b"c")
print "read: {}".format(data)
buf.seek(read)
data = buf.read(512)
read += len(data)
wrote += buf.write(b"d")
buf.seek(wrote)
wrote += buf.write(b"efghihjlmnop")
while data:
print "read: {}".format(data)
buf.seek(read)
data = buf.read(1024)
read += len(data)
buf.close()
Comment: ... allow me to interleave reads and writes to the stream without ... managing the current read and write positions myself.
This is the behave of io.BufferedRandom.
But you can encapsulate the logic into a own class StreamRW(io.BufferedRandom),
for instance:
class StreamRW(io.BufferedRandom):
def __init__(self, raw):
super().__init__(raw)
self.seek(0)
def read(self, size=1):
super().seek(self.read_offset)
data = super().read(size)
self.read_offset = self.tell()
return data
def write(self, data):
super().seek(self.write_offset)
written = super().write(data)
self.write_offset = self.tell()
return written
def seek(self, offset):
super().seek(offset)
self.read_offset = self.write_offset = self.tell()
#Usage:
buf = StreamRW(io.BytesIO())
...
Further code as below, but without buf.seek(0)!
You have to use buf.seek(0) to rewind the file position.
Note: I have to use binary prefix b""!
This is working for me:
import io
buf = io.BufferedRandom(io.BytesIO())
buf.write(b"a")
buf.write(b"b")
buf.seek(0)
while True:
print "reading"
a = buf.read(1024)
if not a: break
print "read: {}".format(a)
buf.close()
Output:
read: b'ab'
Tested with Python: 3.4.2 and 2.7.9

Disco chaining skips reduce

I recently found Disco Project and really like it in comparison to Hadoop but I have a problem. My project is setup like so (I'll be happy to cut/paste real code if it would help):
myfile.py
from disco.core import Job, result_iterator
import collections, sys
from disco.worker.classic.func import chain_reader
from disco.worker.classic.worker import Params
def helper1():
#do stuff
def helper2():
#do stuff
.
.
.
def helperN():
#do stuff
class A(Job):
#staticmethod
def map_reader(fd, params):
#Read input file
yield line
def map(self, line, params):
#Process lines into dictionary
#Iterate dictionary
yield k, v
def reduce(self, iter, out, params):
#iterate iter
#Process k, v into dictionary, aggregating values
#Process dictionry
#Iterate dictionary
out.add(k,v)
Class B(Job):
map_reader = staticmethod(chain_reader)
map = staticmethod(nop_map)
reduce(self, iter, out, params):
#Process iter
#iterate results
out.add(k,v)
if __name__ == '__main__':
from myfile import A, B
job1 = A().run(input=[input_filename], params=Params(k=k))
job2 = B().run(input=[job1.wait()], params=Params(k=k))
with open(output_filename, 'w') as fp:
for count, line in result_iterator(job2.wait(show=True)):
fp.write(str(count) + ',' + line + '\n')
My problem is the job flow completely skips A's reduce and goes down to B's reduce.
Any ideas what is going on here?
This was an easy but subtle problem: I didn't have a
show = True
for job1. For some reason, with show set for job2, it was showing me the map() and map-shuffle() steps from job1 so since I wasn't getting the final result I was expecting and input to one of the job2 functions looks wrong, I jumped to the conclusion that job1 steps weren't run properly (this was further supported that before I added job2 I verified accuracy of job1's output).

Trying to parallelize a for loop, which calls a function with arguments

I'm new to python, and especially new to multiprocessing/multithreading. I have trouble reading the documentation, or finding a sufficiently similar example to work off of.
The part that I am trying to divide among multiple cores is italicized, the rest is there for context. There are three functions that are defined elsewhere in the code, NextFunction(), QualFunction(), and PrintFunction(). I don't think what they do is critical to parallelizing this code, so I did not include their definitions.
Can you help me parallelize this?
So far, I've looked at
https://docs.python.org/2/library/multiprocessing.html
Python Multiprocessing a for loop
and I've tried the equivalents for multithreading, and I've tried ipython.parallel as well.
The code is intended to pull data from a file, process it through a few functions and print it, checking for various conditions along the way.
The code looks like:
def main(arg, obj1Name, obj2Name):
global dblen
records = fasta(refName)
for i,r in enumerate(records):
s = r.fastasequence
idnt = s.name.split()[0]
reference[idnt] = s.seq
names[i] = idnt
dblen += len(s.seq)
if taxNm == None: taxid[idnt] = GetTaxId(idnt).strip()
records.close()
print >> stderr, "Read it"
# read the taxids
if taxNm != None:
file = open(taxNm, "r")
for line in file:
idnt,tax = line.strip().split()
taxid[idnt] = tax
file.close()
File1 = pysam.Samfile(obj1Name, "rb")
File2 = pysam.Samfile(obj2Name, "rb")
***for obj1s,obj2s in NextFunction(File1, File2):
qobj1 = []
qobj2 = []
lobj1s = list(QualFunction(obj1s))
lobj2s = list(QualFunction(obj2s))
for obj1,ftrs1 in lobj1s:
for obj2,ftrs2 in lobj2s:
if (obj1.tid == obj2.tid):
qobj1.append((obj1,ftrs1))
qobj2.append((obj2,ftrs2))
for obj,ftrs in qobj1:
PrintFunction(obj, ftrs, "1")
for obj,ftrs in qobj2:
PrintFunctiont(obj, ftrs, "2")***
File1.close()
File2.close()
And is called by
if __name__ == "__main__":
etc

How to get the internal position while reading bzip2 file

I've got a script to decompress and parse data contained in a bunch of very large bzip2 compressed files. Since it can take a while I'd like to have some way to monitor the progress. I know I can get the file size with os.path.getsize(), but bz2.BZ2File.tell() returns the position within the uncompressed data. Is there any way to get the current position within the uncompressed file so I can monitor the progress?
Bonus points if there's a python equivalent to Java's ProgressMonitorInputStream.
If you only need to parse the data in the bziped file, I think it should be possible to avoid to unzip the file before reading it. I have not tested it on bzip, but on gziped files. I hope this is also possible with bziped files.
See for instance :
How to write csv in python efficiently?.
This is the solution I came up with that seems to work.
import bz2
class SimpleBZ2File(object):
def __init__(self,path,readsize=1024):
self.decomp = bz2.BZ2Decompressor()
self.rawinput = open(path,'rb')
self.eof = False
self.readsize = readsize
self.leftover = ''
def tell(self):
return self.rawinput.tell()
def __iter__(self):
while not self.eof:
rawdata = self.rawinput.read(self.readsize)
if rawdata == '':
self.eof = True
else:
data = self.decomp.decompress(rawdata)
if not data:
continue #we need to supply more raw to decompress
newlines = list(data.splitlines(True))
yield self.leftover + newlines[0]
self.leftover = ''
for l in newlines[1:-1]:
yield l
if newlines[-1].endswith('\n'):
yield newlines[-1]
else:
self.leftover = newlines[-1]
if self.leftover:
yield self.leftover
self.rawinput.close()

Categories