I want to put a huge file into small files. There are approximately 2 million IDs in file and I want sort them by module. When you run program it should ask the number of files that you want to divide the main file.(x= int(input)). And I want to seperate file by module function. I mean if ID%x == 1 it should ad this ID to q1 and to f1. But it adds only first line that true for requirements.
import multiprocessing
def createlist(x,i,queue_name):
with open("events.txt") as f:
next(f)
for line in f:
if int(line) % x == i:
queue_name.put(line)
def createfile(x,i,queue_name):
for i in range(x):
file_name = "file{}.txt".format(i+1)
with open(file_name, "w") as text:
text.write(queue_name.get())
if __name__=='__main__':
x= int(input("number of parts "))
i = 0
for i in range(x):
queue_name = "q{}".format(i+1)
queue_name = multiprocessing.Queue()
p0=multiprocessing.Process(target = createlist, args = (x,i,queue_name,))
process_name = "p{}".format(i+1)
process_name = multiprocessing.Process(target = createfile, args = (x,i,queue_name,))
p0.start()
process_name.start()
Your createfile has two functional issues.
it only reads from the queue once, then terminates
it iterates the range of the desired number of subsets a second time, hence even after fixing the issue with the single queue-read you get one written file and parts - 1 empty files.
To fix your approach make createfile look like this:
def createfile(i, queue_name): # Note: x has been removed from the args
file_name = "file{}.txt".format(i + 1)
with open(file_name, "w") as text:
while True:
if queue.empty():
break
text.write(queue.get())
Since x has been removed from createfile's arguments, you'd also remove it from the process instantiation:
process_name = multiprocessing.Process(target = createfile, args = (i,queue_name,))
However ... do not do it like this. The more subsets you want, the more processes and queues you create (two processes and one queue per subset). That is a lot of overhead you create.
Also, while having one responsible process per output file for writing might still make some sense, having multiple processes reading the same (huge) file completely does not.
I did some timing and testing with an input file containing 1000 lines, each consisting of one random integer between 0 and 9999. I created three algorithms and ran each in ten iterations while tracking execution time. I did this for a desired number of subsets of 1, 2, 3, 4, 5 and 10. For the graph below I took the mean value of each series.
orwqpp (green): Is one-reader-writer-queue-per-part. Your approach. It saw an average increase in execution time of 0.48 seconds per additional subset.
orpp (blue): Is one-reader-per-part. This one had a common writer process that took care of writing to all files. It saw an average increase in execution time of 0.25 seconds per additional subset.
ofa (yellow): Is one-for-all. One single function, not run in a separate process, reading and writing in one go. It saw an average increase in execution time of 0.0014 seconds per additional subset.
Keep in mind that these figures were created with an input file 1/2000 the size of yours. The processes in what resembles your approach completed so quickly, they barely got in each other's way. Once the input is large enough to make the processes run for a longer amount of time, contention for CPU resources will increase and so will the penalty of having more processes as more subsets are requested.
Here's the one-for-all approach:
def one_for_all(parts):
handles = dict({el: open("file{}.txt".format(el), "w") for el in range(parts)})
with open("events.txt") as f:
next(f)
for line in f:
fnum = int(line) % parts
handles.get(fnum).write(line)
[h.close() for h in handles.values()]
if __name__ == '__main__':
x = int(input("number of parts "))
one_for_all(x)
This currently names the files based on the result of the modulo operation, so numbers where int(line) % parts is 0 will be in file0.txt and so on.
if you don't want that, simply add 1 when formatting the file name:
handles = dict({el: open("file{}.txt".format(el+1), "w") for el in range(parts)})
Related
I have this code that reads and processes text file line by line, problem is my text file has between 1,5-2 billion lines and it is taking forever. Is there away to process over 100 Million lines at the same time?
from cryptotools.BTC.HD import check, WORDS
with open("input.txt", "r") as a_file:
for line in a_file:
stripped_line = line.strip()
for word in WORDS:
mnemonic = stripped_line.format(x=word)
if check(mnemonic):
print(mnemonic)
with open("print.txt", "a") as i:
i.write(mnemonic)
i.write("\n")
Input file has the following sample lines:
gloom document {x} stomach uncover peasant sock minor decide special roast rural
happy seven {x} gown rally tennis yard patrol confirm actress pledge luggage
tattoo time {x} other horn motor symbol dice update outer fiction sign
govern wire {x} pill valid matter tomato scheme girl garbage action pulp
To process 100 million lines at once you would have to have 100 million threads. Another approach to improve the speed of your code is to split the work among different threads (lower than 100 million).
Because write and read operations from file are not asynchronous, you would be better off with reading all the file at the beginning of your program and write out thr processed data at the end. In the code below i will assume you do not care about the order at which wou write the file out. But if order is important you could set a dictionary that has as key the positional value of the current line being elaborated by a specific thread and at the end sort accordingly.
import concurrent.futures as cf
N_THREADS = 20
result = []
def doWork(data):
for line in data:
#do what you have to do
result.append(mnemonic)
m_input = open("input.txt", "r")
lines = [line for line in m_input]
#the data for the threads will be here
#as a list of rows for each thread
m_data= { i: [] for i in range(0, N_THREADS)}
for l, n in zip(lines, range(0, len(lines))):
m_data[n%N_THREADS].append(l)
'''
If you have to trim the number of threads uncomment these lines
m_data= { k:v for k, v in m_data.items() if len(v) != 0}
N_THREADS = N_THREADS if len(m_data) > N_THREADS else len(m_data)
if(N_THREADS == 0):
exit()
'''
with cf.ThreadPoolExecutor(max_workers=N_THREADS) as tp:
for d in m_data.keys():
tp.submit(doWork,m_data[d])
#work done
output = open("print.txt", "w")
for item in result:
output.write(f"{item}\n")
output.close()
Change the number of threads as you find most efficient.
Edit (with memory optimizations):
The code above, while being extrimely fast, uses a significant amount of memory because loads the whole file in memory and then works on it.
You then have two options:
split your file in multiple smaller files, from my testing (see below) using a test file with ~10 million lines, the program run extremely fast actually, but used up to 1.3 GB of ram.
use the code down here where i load one line at a time and assign that line to a thread that works on that line and then pushes the data to a thread that is only responsible to write to the file. In this way the memory usage significantly drops but the execution time rises.
The code below, reads a single line from the file (the one with 10 million lines that is approximately ~500 MB) and then sends that data to a class that manages a fixed number of threads. Currently I spawn a new thread every time one finishes, actually could be more efficient, using always the same threads and use a queue for each thread. Then I spawn a writer thread that its only work is to write to the out.txt file that will contain the result. In my testing I only read the text file and write the same lines in another file.
What i found out is the following (using the 10 million lines file):
Original code: it took 14.20630669593811 seconds and used 1.301 GB (average usage) of ram and 10% cpu usage
Updated code: it took 1230.4356942176819 seconds and used 4.3 MB (average usage) of ram and 10% cpu usage with the internal parameters as in the code below.
The timed results were obtained using the same number of threads for both the programs.
From those results it is evident how the memory optimized code runs significantly slower while using way less ram. You can tune the internal parameters such as the number of threads or the maximum queue size to improve the performance, keeping in mind that would affect memory usage. After a lots of tests, I would suggest to split the file in multiple subfiles that can fit in your memory and run the original version of the code (see above), because the tradeoff between time and speed is simply not justified in my opinion.
Here I put the code I optimized for memory cunsumption but keep in mind that is NOT optimized in any significant way as far as thread management goes, one suggestion would be to use always the same threads and use multiple queues to pass the data to those threads.
Here I leave the code i used to optimize memory consumption (and yes is way more complex than the one above XD, and maybe more than what it needs to be) :
from threading import Thread
import time
import os
import queue
MAX_Q_SIZE = 100000
m_queue = queue.Queue(maxsize=MAX_Q_SIZE)
end_thread = object()
def doWork(data):
#do your work here, before
#checking if the queue is full,
#otherwise when you finish the
#queue might be full again
while m_queue.full():
time.sleep(0.1)
pass
m_queue.put(data)
def writer():
#check if file exists or creates it
try:
out = open("out.txt", "r")
out.close()
except FileNotFoundError:
out = open("out.txt", "w")
out.close()
out = open("out.txt", "w")
_end = False
while True:
if m_queue.qsize == 0:
if _end:
break
continue
try:
item = m_queue.get()
if item is end_thread:
out.close()
_end = True
break
global written_lines
written_lines += 1
out.write(item)
except:
break
class Spawner:
def __init__(self, max_threads):
self.max_threads = max_threads
self.current_threads = [None]*max_threads
self.active_threads = 0
self.writer = Thread(target=writer)
self.writer.start()
def sendWork(self, data):
m_thread = Thread(target=doWork, args=(data, ))
replace_at = -1
if self.active_threads >= self.max_threads:
#wait for at least 1 thread to finish
while True:
for index in range(self.max_threads):
if self.current_threads[index].is_alive() :
pass
else:
self.current_threads[index] = None
self.active_threads -= 1
replace_at = index
break
if replace_at != -1:
break
#else: no threads have finished, keep waiting
if replace_at == -1:
#only if len(current_threads) < max_threads
for i in range(len(self.current_threads)):
if self.current_threads[i] == None:
replace_at = i
break
self.current_threads[replace_at] = m_thread
self.active_threads += 1
m_thread.start()
def waitEnd(self):
for t in self.current_threads:
if t.is_alive():
t.join()
self.active_threads -= 1
while True:
if m_queue.qsize == MAX_Q_SIZE:
time.sleep(0.1)
continue
m_queue.put(end_thread)
break
if self.writer.is_alive():
self.writer.join()
start_time = time.time()
spawner = Spawner(50)
with open("input.txt", "r") as infile:
for line in infile:
spawner.sendWork(line)
spawner.waitEnd()
print("--- %s seconds ---" % (time.time() - start_time))
You can remove the prints for the time, I left those just for reference to know how I computed the time the program took to run, as well below you find the screenshots of the execution of the two programs from task manager.
Memory optimized version:
Original version (I forgot to expand the terminal process when taking the screenshot, anyway the memory usage of subprocesses of the terminal is negligibile with respect to the one used by the program and the 1.3 GB of ram is accurate):
Let us consider the following code where I calculate the factorial of 4 really large numbers, saving each output to a separate .txt file (out_mp_{idx}.txt). I use multiprocessing (4 processes) to reduce the computation time. Though this works fine, I want to output all the 4 results in one file.
One way is to open each of the generated (4) files I create (from the code below) and append to a new file, but that's not my choice (below is just a simplistic version of my code, I have too many files to handle, which defeats the purpose of time-saving via multiprocessing). Is there a better way to automate such that the results from the processes are all dumped/appended to some file? Also, in my case the returned results form each process could be several lines, so how would we avoid open-file conflict for the case when the results are appended in the output file by one process and second process returns its answer and wants to open/access the output file?
As an alternative, I tried process.immap route, but that's not as computationally efficient as the below code. Something like this SO post.
from multiprocessing import Process
import os
import time
tic = time.time()
def factorial(n, idx): # function to calculate the factorial
num = 1
while n >= 1:
num *= n
n = n - 1
with open(f'out_mp_{idx}.txt', 'w') as f0: # saving output to a separate file
f0.writelines(str(num))
def My_prog():
jobs = []
N = [10000, 20000, 40000, 50000] # numbers for which factorial is desired
n_procs = 4
# executing multiple processes
for i in range(n_procs):
p = Process(target=factorial, args=(N[i], i))
jobs.append(p)
for j in jobs:
j.start()
for j in jobs:
j.join()
print(f'Exec. Time:{time.time()-tic} [s]')
if __name__=='__main__':
My_prog()
You can do this.
Create a Queue
a) manager = Manager()
b) data_queue = manager.Queue()
c) put all data in this queue.
Create a thread and start it before multiprocess
a) create a function which waits on data_queue.
Something like
`
def fun():
while True:
data = data_queue.get()
if instance(data_queue, Sentinal):
break
#write to a file
`
3) Remember to send some Sentinal object after all multiprocesses are done.
You can also make this thread a daemon thread and skip sentinal part.
I am trying to use multiprocessing in handling csv files in excess of 2GB. The problem is that the input is only being consumed in one process while the others seem to be idling.
The following recreates the problem I am encountering. Is it possible to use multiprocess with an iterator? Consuming the full input into memory is not desired.
import csv
import multiprocessing
import time
def something(row):
# print row[0]
# pass
return row
def main():
start = time.time()
i = open("input.csv")
reader = csv.reader(i, delimiter='\t')
print reader.next()
p = multiprocessing.Pool(16)
print "Starting processes"
j = p.imap(something, reader, chunksize=10000)
count= 1
while j:
print j.next()
print time.time() - start
if __name__ == '__main__':
main()
I think you are confusing "processes" with "processors".
Your program is definitely spawning multiple processes at the same time, as you can verify in the system or resources monitor while your program is running. How many processors or CPU cores are being used depends mainly on the OS, and has a lot to do with how CPU intensive is the task you are delegating to each process.
Make a little modification to your something function, to introduce a sleep time, that simulates the work being done in the function:
def something(row):
time.sleep(.4)
return row
Now, first run your function sequentially to each row in your file, and notice that each result is coming one by one every 400ms.
def main():
with open("input.csv") as i:
reader = csv.reader(i)
print (next(reader))
# SEQUENTIALLY:
for row in reader:
result = something(row)
print (result)
Now try with the pool of workers. Keep it at a low number, say 4 workers, and you will see that the result comes every 400ms, but in groups of 4 (or roughly the number of workers in the pool):
def main():
with open("input.csv") as i:
reader = csv.reader(i)
print (next(reader))
# IN PARALLEL
print ("Starting processes")
p = multiprocessing.Pool(4)
results = p.imap(something, reader)
for result in results:
print(result) # one result is the processing of 4 rows...
While running in parallel, check the system monitor and look for how many "python" processes are being executed. Should be one plus the number of workers.
I hope this explanation is useful.
System Monitor during process I am a novice when it comes to programming. I've worked through the book Practical Computing for Biologists and am playing around with some slightly more advanced concepts.
I've written a Python (2.7) script which reads in a .fasta file and calculates GC-content. The code is provided below.
The file I'm working with is cumbersome (~ 3.9 Gb), and I was wondering if there's a way to take advantage of multiple processors, or whether it would be worth-while. I have a four-core (hyperthreaded) Intel i-7 2600K processor.
I ran the code and looked at system resources (picture attached) to see what the load on my CPU is. Is this process CPU limited? Is it IO limited? These concepts are pretty new to me. I played around with the multiprocessing module and Pool(), to no avail (probably because my function returns a tuple).
Here's the code:
def GC_calc(InFile):
Iteration = 0
GC = 0
Total = 0
for Line in InFile:
if Line[0] != ">":
GC = GC + Line.count('G') + Line.count('C')
Total = Total + len(Line)
Iteration = Iteration + 1
print Iteration
GCC = 100 * GC / Total
return (GC, Total, GCC)
InFileName = "WS_Genome_v1.fasta"
InFile = open(InFileName, 'r')
results = GC_calc(InFile)
print results
Currently, the major bottleneck of your code is print Iteration. Printing to stdout is really really slow. I would expect a major performance boost if you remove this line, or at least, if you absolutely need it, move it to another thread. However, thread management is an advanced topic and I would advice not to go into it right now.
Another possible bottleneck is the fact that you read data from a file. File IO can be slow, especially if you have a single HDD on your machine. With a single HDD you won't need to use multiprocessing at all, because you won't be able to provide enough data to processor cores. Performance-oriented RAIDs and SSDs can help here.
The final comment is to try to use grep and similar text-operating programs instead of python. They got through a decades of optimizing and have a good chance to work way more faster. There's a bunch of questions on SO where grep outperforms python. Or at least you can filter out FASTA headers before you pass data to the script:
$ grep "^[>]" WS_Genome_v1.fasta | python gc_calc.py
(Took grep "^[>]" from here.) In this case you shouldn't open files in the script, but rather read lines from sys.stdin, almost like you do it now.
Basically you're counting the number of C and G in every line and you're calculating the length of the line.
Only at the end you calculate a total.
Such a process is easy to do in parallel, because the calculation for each line is independent of the others.
Assuming the calculations are done in CPython (the one from python.org), threading won't improve performance much because of the GIL.
These calculations could be done in parallel with multiprocessing.Pool.
Processes don't share data like threads do. And we don't want to send parts of a 3.9 GB file to each worker process!
So you want each worker process to open the file by itself. The operating system's cache should take care that pages from the same file aren't loaded into memory multiple times.
If you have N cores, I would create the worker function so as to process every N-th line, with an offset.
def worker(arguments):
n = os.cpu_count() + 1
infile, offset = arguments
with open(infile) as f:
cg = 0
totlen = 0
count = 1
for line in f:
if (count % n) - offset == 0:
if not line.startswith('>'):
cg += line.count('C') +
line.count('G')
totlen += len(line)
count += 1
return (cg, totlen)
You could run the pool like this;
import multiprocessing as mp
from os import cpu_count
pool = mp.Pool()
results = pool.map(worker, [('infile', n) for n in range(1, cpu_count()+1)])
By default, a Pool creates as many workers as the CPU has cores.
The results would be a list of (cg, len) tuples, which you can easily sum.
Edit: updated to fix modulo zero error.
I now have a working code for parallelization (special thanks to #Roland Smith). I just had to make two small modifications to the code, and there's a caveat with respect to how .fasta files are structured. The final (working) code is below:
###ONLY WORKS WHEN THERE ARE NO BREAKS IN SEQUENCE LINES###
def GC_calc(arguments):
n = mp.cpu_count()
InFile, offset = arguments
with open(InFile) as f:
GC = 0
Total = 0
count = 0
for Line in f:
if (count % n) - offset == 0:
if Line[0] != ">":
Line = Line.strip('\n')
GC += Line.count('G') + Line.count('C')
Total += len(Line)
count += 1
return (GC, Total)
import time
import multiprocessing as mp
startTime = time.time()
pool = mp.Pool()
results = pool.map(GC_calc, [('WS_Genome_v2.fasta', n) for n in range(1, mp.cpu_count()+1)])
endTime = time.time()
workTime = endTime - startTime
#Takes the tuples, parses them out, adds them
GC_List = []
Tot_List = []
# x = GC count, y = total count: results = [(x,y), (x,y), (x,y),...(x,y)]
for x,y in results:
GC_List.append(x)
Tot_List.append(y)
GC_Final = sum(GC_List)
Tot_Final = sum(Tot_List)
GCC = 100*float(GC_Final)/float(Tot_Final)
print results
print
print "Number GC = ", GC_Final
print "Total bp = ", Tot_Final
print "GC Content = %.3f%%" % (GCC)
print
endTime = time.time()
workTime = endTime - startTime
print "The job took %.5f seconds to complete" % (workTime)
The caveat is that the .fasta files cannot have breaks within the sequences themselves. My original code didn't have an issue with it, but this code would not work properly when sequence was broken into multiple lines. That was simple enough to fix via the command line.
I also had to modify the code in two spots:
n = mp.cpu_count()
and
count = 0
Originally, count was set to 1 and n was set to mp.cpu_count()+1. This resulted in inaccurate counts, even after the file correction. The downside is that it also allowed all 8 cores (well, threads) to work. The new code only allows 4 to work at any given time.
But it DID speed up the process from about 23 seconds to about 13 seconds! So I'd say it was a success (except for the amount of time it took to correct the original .fasta file).
In my python script I call a function many times with different input parameters. For this I used multi-threading with pool.apply_async and call this many times inside a for. However, I want to check how many output files have been created so far as an indicator for progress. It seems that once the processes are running I cannot start this external function any time.
How can I interrupt and ask "hello, let's check the number of files with FileCount()?
Here is my code:
def FileCount(path):
Path = os.getenv("HOME") + "/hbar_gshfs/" + path
list_dir = []
list_dir = os.listdir(Path)
count = 0
for file in list_dir:
if file.endswith('.root'):
count += 1
return count
main()
if args.run:
mkdir_p(args.outdir)
pool=Pool(10)
for n in reversed(qnumbers):
for pos in positions:
for temp in temperatures:
for fmap in BFieldmap:
for fseek in fieldSeek:
for lH, uH in zip(lowerHysteresis,upperHysteresis):
if BFieldmap.index(fmap) !=positions.index(pos):
continue
pool.apply_async(g4Run,args=(paramToOutputName(args.macrodir,temp, n, pos, fmap, fseek, lH, uH),))
pool.close();
pool.join();
I say thanks a lot for every input on that.
Why not? Your FileCount function returns the number of files created, but not completely finished and filled with data (assuming g4Run creates one file in the beggining and then goes on to fill it with data). But it should be enough for a simple progress indicator.
Otherwise you can use multiprocessing.Value. Pass it to every g4Run as an additional argument, and have g4Run increase it by one after it has finished doing its work.
To call your FileCount at fixed intervals, you can make use of the Pool.map_async call. For that transform your cycle into a generator:
def generate_params():
for n in reversed(qnumbers):
for pos in positions:
for temp in temperatures:
for fmap in BFieldmap:
for fseek in fieldSeek:
for lH, uH in zip(lowerHysteresis,upperHysteresis):
if BFieldmap.index(fmap) !=positions.index(pos):
continue
yield paramToOutputName(args.macrodir,temp, n, pos, fmap, fseek, lH, uH)
async_result = pool.map_async(g4Run, generate_params())
pool.close()
while not async_result.ready():
async_result.wait(1) # Wait 1 second
# Call FileCount here and output progress
pool.join()