Disco chaining skips reduce

Disco chaining skips reduce - python

I recently found Disco Project and really like it in comparison to Hadoop but I have a problem. My project is setup like so (I'll be happy to cut/paste real code if it would help):
myfile.py
from disco.core import Job, result_iterator
import collections, sys
from disco.worker.classic.func import chain_reader
from disco.worker.classic.worker import Params
def helper1():
#do stuff
def helper2():
#do stuff
.
.
.
def helperN():
#do stuff
class A(Job):
#staticmethod
def map_reader(fd, params):
#Read input file
yield line
def map(self, line, params):
#Process lines into dictionary
#Iterate dictionary
yield k, v
def reduce(self, iter, out, params):
#iterate iter
#Process k, v into dictionary, aggregating values
#Process dictionry
#Iterate dictionary
out.add(k,v)
Class B(Job):
map_reader = staticmethod(chain_reader)
map = staticmethod(nop_map)
reduce(self, iter, out, params):
#Process iter
#iterate results
out.add(k,v)
if __name__ == '__main__':
from myfile import A, B
job1 = A().run(input=[input_filename], params=Params(k=k))
job2 = B().run(input=[job1.wait()], params=Params(k=k))
with open(output_filename, 'w') as fp:
for count, line in result_iterator(job2.wait(show=True)):
fp.write(str(count) + ',' + line + '\n')
My problem is the job flow completely skips A's reduce and goes down to B's reduce.
Any ideas what is going on here?

This was an easy but subtle problem: I didn't have a
show = True
for job1. For some reason, with show set for job2, it was showing me the map() and map-shuffle() steps from job1 so since I wasn't getting the final result I was expecting and input to one of the job2 functions looks wrong, I jumped to the conclusion that job1 steps weren't run properly (this was further supported that before I added job2 I verified accuracy of job1's output).

Related

How to parallel the following code using Multiprocessing in Python

I have a function generate(file_path) which returns an integer index and a numpy array. The simplified of generate function is as follows:
def generate(file_path):
temp = np.load(file_path)
#get index from the string file_path
idx = int(file_path.split["_"][0])
#do some mathematical operation on temp
result = operate(temp)
return idx, result
I need to glob through a directory and collect the results of generate(file_path) into a hdf5 file. My serialization code is as follows:
for path in glob.glob(directory):
idx, result = generate(path)
hdf5_file["results"][idx,:] = result
hdf5_file.close()
I hope to write a multi-thread or multi-process code to speed up the above code. How could I modify it? Pretty thanks!
My try is to modify my generate function and to modify my "main" as follows:
def generate(file_path):
temp = np.load(file_path)
#get index from the string file_path
idx = int(file_path.split["_"][0])
#do some mathematical operation on temp
result = operate(temp)
hdf5_path = "./result.hdf5"
hdf5_file = h5py.File(hdf5_path, 'w')
hdf5_file["results"][idx,:] = result
hdf5_file.close()
if __name__ == '__main__':
##construct hdf5 file
hdf5_path = "./output.hdf5"
hdf5_file = h5py.File(hdf5_path, 'w')
hdf5_file.create_dataset("results", [2000,15000], np.uint8)
hdf5_file.close()
path_ = "./compute/*"
p = Pool(mp.cpu_count())
p.map(generate, glob.glob(path_))
hdf5_file.close()
print("finished")
However, it does not work. It will throw error
KeyError: "Unable to open object (object 'results' doesn't exist)"

You can use a thread or process pool to execute multiple function calls concurrently. Here is an example which uses a process pool:
from concurrent.futures import ProcessPoolExecutor
from time import sleep
def generate(file_path: str) -> int:
sleep(1.0)
return file_path.split("_")[1]
def main():
file_paths = ["path_1", "path_2", "path_3"]
with ProcessPoolExecutor() as pool:
results = pool.map(generate, file_paths)
for result in results:
# Write to the HDF5 file
print(result)
if __name__ == "__main__":
main()
Note that you should not write to the same HDF5 file concurrently, i.e. the file writing should not append in the generate function.

I detected some errors in initialising the dataset after examining your code;
You produced the hdf5 file with the path ""./result.hdf5" inside the generate function.
However, I think you neglected to create a "results" dataset beneath that file, as that is what is causing the Object Does Not Exist issue.
Kindly reply if you still face the same issue with error message

Exporting API output to txt file

I have created a simple API with FastAPI and I want to export the output in a text file (txt).
This is a simplified code
import sys
from clases.sequence import Sequence
from clases.read_file import Read_file
from fastapi import FastAPI
app = FastAPI()
#app.get("/DNA_toolkit")
def sum(input: str): # pass the sequence in, this time as a query param
DNA = Sequence(input) # get the result (i.e., 4)
return {"Length": DNA.length(), # return the response
"Reverse": DNA.reverse(),
"complement":DNA.complement(),
"Reverse and complement": DNA.reverse_and_complement(),
"gc_percentage": DNA.gc_percentage()
}
And this is the output
{"Length":36,"Reverse":"TTTTTTTTTTGGGGGGGAAAAAAAAAAAAAAAATAT","complement":"ATATTTTTTTTTTTTTTTTCCCCCCCAAAAAAAAAA","Reverse and complement":"AAAAAAAAAACCCCCCCTTTTTTTTTTTTTTTTATA","gc_percentage":5.142857142857143}
The file I would like to get
Length 36
Reverse TTTTTTTTTTGGGGGGGAAAAAAAAAAAAAAAATAT
complement ATATTTTTTTTTTTTTTTTCCCCCCCAAAAAAAAAA
Reverse and complement AAAAAAAAAACCCCCCCTTTTTTTTTTTTTTTTATA
There is a simple way to do this. This is my first time working with APIs and I don't even know how possible is this

dict1={"Length":36,"Reverse":"TTTTTTTTTTGGGGGGGAAAAAAAAAAAAAAAATAT","complement":"ATATTTTTTTTTTTTTTTTCCCCCCCAAAAAAAAAA","Reverse and complement":"AAAAAAAAAACCCCCCCTTTTTTTTTTTTTTTTATA","gc_percentage":5.142857142857143}
with open("output.txt","w") as data:
for k,v in dict1.items():
append_data=k+" "+str(v)
data.write(append_data)
data.write("\n")
Output:
Length 36
Reverse TTTTTTTTTTGGGGGGGAAAAAAAAAAAAAAAATAT
complement ATATTTTTTTTTTTTTTTTCCCCCCCAAAAAAAAAA
Reverse and complement AAAAAAAAAACCCCCCCTTTTTTTTTTTTTTTTATA
gc_percentage 5.142857142857143

You can use open method to create a new file, and write your output. And as #Blackgaurd told you, this isn't a code-writing service.
Also I wrote this code really quickly so some syntax error may occur
import sys
import datetime
from clases.sequence import Sequence
from clases.read_file import Read_file
from fastapi import FastAPI
app = FastAPI()
#app.get("/DNA_toolkit")
def sum(input: str): # pass the sequence in, this time as a query param
DNA = Sequence(input) # get the result (i.e., 4)
res = {"Length": DNA.length(), # return the response
"Reverse": DNA.reverse(),
"complement":DNA.complement(),
"Reverse and complement": DNA.reverse_and_complement(),
"gc_percentage": DNA.gc_percentage()
}
#with open('result.txt', 'w+') as resFile:
#for i in res:
#resFile.write(i+" "+res[i]+"\n")
#resFile.close()
# Undo the above comment if you don't want to save result into
#file with unique id, else go with the method I wrote below...
filename = str(datetime.datetime.now().date()) + '_' + str(datetime.datetime.now().time()).replace(':', '.')
with open(filename+'.txt', 'w+') as resFile:
for i in res:
resFile.write(i+" "+res[i]+"\n")
resFile.close()
return {"Length": DNA.length(), # return the response
"Reverse": DNA.reverse(),
"complement":DNA.complement(),
"Reverse and complement": DNA.reverse_and_complement(),
"gc_percentage": DNA.gc_percentage()
}

I gonna assume that you have already got your data somehow calling your API.
# data = request.get(...).json()
# save to file:
with open("DNA_insights.txt", 'w') as f:
for k, v in data.items():
f.write(f"{k}: {v}\n")

Using ray to speed up checking json files

I have over a million json files, and I'm trying to find the fastest way to check first, if they load, and then, if there exists either key_A, key_B, or neither. I thought I might be able to use ray to speed up this process, but opening a file seems to fail with ray.
As a simplification, here's my attempt at just checking whether or not a file will load:
import ray
ray.init()
#ray.remote
class Counter(object):
def __init__(self):
self.good = 0
self.bad = 0
def increment(self, j):
try:
with open(j, 'r') as f:
l = json.load(f)
self.good += 1
except: # all files end up here
self.bad += 1
def read(self):
return (self.good, self.bad)
counter = Counter.remote()
[counter.increment.remote(j) for j in json_paths]
futures = counter.read.remote()
print(ray.get(futures))
But I end up with (0, len(json_paths)) as a result.
For reference, the slightly more complicated actual end goal I have is to check:
new, old, bad = 0,0,0
try:
with open(json_path, 'r') as f:
l = json.load(f)
ann = l['frames']['FrameLabel']['annotations']
first_object = ann[0][0]
except:
bad += 1
return
if 'object_category' in first_object:
new += 1
elif 'category' in first_object:
old += 1
else:
bad += 1

I'd recommend not using Python for this at all, but for example jq.
A command like
jq -c "[input_filename, (.frames.FrameLabel.annotations[0][0]|[.object_category,.category])]" good.json bad.json old.json
outputs
["good.json",["good",null]]
["bad.json",[null,null]]
["old.json",[null,"good"]]
for each of your categories of data, which will be significantly easier to parse.
You can use e.g. the GNU find tool, or if you're feeling fancy, parallel, to come up with the command lines to run.

You could use Python' built-in concurrent module instead to perform your task, which ray might not be best-suited for. Example:
from concurrent.futures import ThreadPoolExecutor
numThreads = 10
def checkFile(path):
return True # parse and check here
with ThreadPoolExecutor(max_workers=numThreads) as pool:
good = sum(pool.map(checkFile, json_paths))
bad = len(json_paths) - good

Trying to parallelize a for loop, which calls a function with arguments

I'm new to python, and especially new to multiprocessing/multithreading. I have trouble reading the documentation, or finding a sufficiently similar example to work off of.
The part that I am trying to divide among multiple cores is italicized, the rest is there for context. There are three functions that are defined elsewhere in the code, NextFunction(), QualFunction(), and PrintFunction(). I don't think what they do is critical to parallelizing this code, so I did not include their definitions.
Can you help me parallelize this?
So far, I've looked at
https://docs.python.org/2/library/multiprocessing.html
Python Multiprocessing a for loop
and I've tried the equivalents for multithreading, and I've tried ipython.parallel as well.
The code is intended to pull data from a file, process it through a few functions and print it, checking for various conditions along the way.
The code looks like:
def main(arg, obj1Name, obj2Name):
global dblen
records = fasta(refName)
for i,r in enumerate(records):
s = r.fastasequence
idnt = s.name.split()[0]
reference[idnt] = s.seq
names[i] = idnt
dblen += len(s.seq)
if taxNm == None: taxid[idnt] = GetTaxId(idnt).strip()
records.close()
print >> stderr, "Read it"
# read the taxids
if taxNm != None:
file = open(taxNm, "r")
for line in file:
idnt,tax = line.strip().split()
taxid[idnt] = tax
file.close()
File1 = pysam.Samfile(obj1Name, "rb")
File2 = pysam.Samfile(obj2Name, "rb")
***for obj1s,obj2s in NextFunction(File1, File2):
qobj1 = []
qobj2 = []
lobj1s = list(QualFunction(obj1s))
lobj2s = list(QualFunction(obj2s))
for obj1,ftrs1 in lobj1s:
for obj2,ftrs2 in lobj2s:
if (obj1.tid == obj2.tid):
qobj1.append((obj1,ftrs1))
qobj2.append((obj2,ftrs2))
for obj,ftrs in qobj1:
PrintFunction(obj, ftrs, "1")
for obj,ftrs in qobj2:
PrintFunctiont(obj, ftrs, "2")***
File1.close()
File2.close()
And is called by
if __name__ == "__main__":
etc

Writing to a file with multiprocessing

I'm having the following problem in python.
I need to do some calculations in parallel whose results I need to be written sequentially in a file. So I created a function that receives a multiprocessing.Queue and a file handle, do the calculation and print the result in the file:
import multiprocessing
from multiprocessing import Process, Queue
from mySimulation import doCalculation
# doCalculation(pars) is a function I must run for many different sets of parameters and collect the results in a file
def work(queue, fh):
while True:
try:
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
except:
break
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fh = open("foo", "w")
workQueue = Queue()
parList = # list of conditions for which I want to run doCalculation()
for x in parList:
workQueue.put(x)
processes = [Process(target = writefh, args = (workQueue, fh)) for i in range(nthreads)]
for p in processes:
p.start()
for p in processes:
p.join()
fh.close()
But the file ends up empty after the script runs. I tried to change the worker() function to:
def work(queue, filename):
while True:
try:
fh = open(filename, "a")
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
fh.close()
except:
break
and pass the filename as parameter. Then it works as I intended. When I try to do the same thing sequentially, without multiprocessing, it also works normally.
Why it didn't worked in the first version? I can't see the problem.
Also: can I guarantee that two processes won't try to write the file simultaneously?
EDIT:
Thanks. I got it now. This is the working version:
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
from random import uniform
def doCalculation(par):
t = uniform(0,2)
sleep(t)
return par * par # just to simulate some calculation
def feed(queue, parlist):
for par in parlist:
queue.put(par)
def calc(queueIn, queueOut):
while True:
try:
par = queueIn.get(block = False)
print "dealing with ", par, ""
res = doCalculation(par)
queueOut.put((par,res))
except:
break
def write(queue, fname):
fhandle = open(fname, "w")
while True:
try:
par, res = queue.get(block = False)
print >>fhandle, par, res
except:
break
fhandle.close()
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fname = "foo"
workerQueue = Queue()
writerQueue = Queue()
parlist = [1,2,3,4,5,6,7,8,9,10]
feedProc = Process(target = feed , args = (workerQueue, parlist))
calcProc = [Process(target = calc , args = (workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target = write, args = (writerQueue, fname))
feedProc.start()
for p in calcProc:
p.start()
writProc.start()
feedProc.join ()
for p in calcProc:
p.join()
writProc.join ()

You really should use two queues and three separate kinds of processing.
Put stuff into Queue #1.
Get stuff out of Queue #1 and do calculations, putting stuff in Queue #2. You can have many of these, since they get from one queue and put into another queue safely.
Get stuff out of Queue #2 and write it to a file. You must have exactly 1 of these and no more. It "owns" the file, guarantees atomic access, and absolutely assures that the file is written cleanly and consistently.

If anyone is looking for a simple way to do the same, this can help you.
I don't think there are any disadvantages to doing it in this way. If there are, please let me know.
import multiprocessing
import re
def mp_worker(item):
# Do something
return item, count
def mp_handler():
cpus = multiprocessing.cpu_count()
p = multiprocessing.Pool(cpus)
# The below 2 lines populate the list. This listX will later be accessed parallely. This can be replaced as long as listX is passed on to the next step.
with open('ExampleFile.txt') as f:
listX = [line for line in (l.strip() for l in f) if line]
with open('results.txt', 'w') as f:
for result in p.imap(mp_worker, listX):
# (item, count) tuples from worker
f.write('%s: %d\n' % result)
if __name__=='__main__':
mp_handler()
Source: Python: Writing to a single file with queue while using multiprocessing Pool

There is a mistake in the write worker code, if the block is false, the worker will never get any data. Should be as follows:
par, res = queue.get(block = True)
You can check it by adding line
print "QSize",queueOut.qsize()
after the
queueOut.put((par,res))
With block=False you would be getting ever increasing length of the queue until it fills up, unlike with block=True where you get always "1".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Disco chaining skips reduce - python

Related

How to parallel the following code using Multiprocessing in Python

Exporting API output to txt file

Using ray to speed up checking json files

Trying to parallelize a for loop, which calls a function with arguments

Writing to a file with multiprocessing

Categories

Resources