I have about 4 input text files that I want to read them and write all of them into one separate file.
I use two threads so it runs faster!
Here is my questions and code in python:
1-Does each thread has its own version of variables such as "lines" inside the function "writeInFile"?
2-Since I copied some parts of the code from Tutorialspoint, I don't understand what is "while 1: pass" in the last line. Can you explain? link to the main code: http://www.tutorialspoint.com/python/python_multithreading.htm
3-Does it matter what delay I put for the threads?
4-If I have about 400 input text files and want to do some operations on them before writing all of them into a separate file, how many threads I can use?
5- If assume I use 10 threads, is it better to have the inputs in different folders (10 folders with 40 input text files each) and for each thread call one folder OR I use what I already done in the below code in which I ask each thread to read one of the 400 input text files if they have not been read before by other threads?
processedFiles=[] # this list to check which file in the folder has already been read by one thread so the other thread don't read it
#Function run by the threads
def writeInFile( threadName, delay):
for file in glob.glob("*.txt"):
if file not in processedFiles:
processedFiles.append(file)
f = open(file,"r")
lines = f.readlines()
f.close()
time.sleep(delay)
#open the file to write in
f = open('myfile','a')
f.write("%s \n" %lines)
f.close()
print "%s: %s" % ( threadName, time.ctime(time.time()) )
# Create two threads as follows
try:
f = open('myfile', 'r+')
f.truncate()
start = timeit.default_timer()
thread.start_new_thread( writeInFile, ("Thread-1", 0, ) )
thread.start_new_thread( writeInFile, ("Thread-2", 0, ) )
stop = timeit.default_timer()
print stop - start
except:
print "Error: unable to start thread"
while 1:
pass
Yes. Each of the local variables is on the thread's stack and are not shared between threads.
This loop allows the parent thread to wait for each of the child threads to finish and exit before termination of the program. The actual construct you should use to handle this is join and not a while loop. See what is the use of join() in python threading.
In practice, yes, especially if the threads are writing to a common set of files (e.g., both thread 1 and thread2 will be reading/writing to the same file). Depending on the hardware, the size of the files and the amount of data you re trying to write, different delays may make your program feel more responsive to the user than not. The best bet is to start with a simple value and adjust it as you see the program work in a real-world setting.
While you can technically use as many threads as you want, you generally won’t get any performance benefits over 1 thread per core per CPU.
Different folders won’t matter as much for only 400 files. If you’re talking about 4,000,000 files, than it might matter for instances when you want to do ls on those directories. What will matter for performance is whether each thread is working on it's own file or whether two or more threads might be operating on the same file.
General thought: while it is a more advanced architecture, you may want to try to learn/use celery for these types of tasks in a production environment http://www.celeryproject.org/.
Related
The program is monitoring a folder received_dir and process the files received in real time. After processing the file, the original file should be deleted to save the disk space.
I am trying to use Python multiprocessing and Pool.
I want to check if there is any technical flaw in current approach.
One of the problem in the current code is that the program should wait until all 20 files in the queue are processed before starting the next round, so it may be inefficient in certain conditions (i.e, various file sizes).
from multiprocessing import Pool
import os
import os.path
Parse_OUT="/opt/out/"
Receive_Dir="/opt/receive/"
def parser(infile):
out_dir=date_of(filename)
if not os.path.exists(out_dir):
os.mkdir(out_dir)
fout=gzip.open(out_dir+'/'+filename+'csv.gz','wb')
with gzip.open(infile) as fin:
for line in fin:
data=line.split(',')
fout.write(data)
fout.close()
os.remove(infile)
if __name__ == '__main__':
pool=Pool(20)
while True:
targets=glob.glob(Receive_Dir)[:10]
pool.map(parser, targets)
pool.close()
I see several issues:
if not os.path.exists(out_dir): os.mkdir(out_dir): This is a race condition. If two workers try to create the same directory at the same time, one will raise an exception. Don't do the if condition. Simply call os.makedirs(out_dir, exist_ok=True)
Don't assemble file paths with string addition. Simply do os.path.join(out_dir, filename+'csv.gz'). This is cleaner and has fewer failure states
Instead of spinning in your while True-loop even if no new directories appear, you can use the inotify mechanism on Linux to monitor the directory for changes. That would only wake your process if there is actually anything to do. Check out pyinotify: https://github.com/seb-m/pyinotify
Since you mentioned that you are dissatisfied with the batching: You can use pool.apply_async to start new operations as they become available. Your main loop doesn't do anything with the results, so you can just "fire and forget"
Incidentally, why are you starting a pool with 20 workers and then you just launch 10 directory operations at once?
I have a big text file that needs to be processed. I first read all text into a list and then use ThreadPoolExecutor to start multiple threads to process it. The two functions called in process_text() are not listed here: is_channel and get_relations().
I am on Mac and my observations show that it doesn't really speed up the processing (cpu with 8 cores, only 15% cpu is used). If there is a performance bottleneck in either the function is_channel or get_relations, then the multithreading won't help much. Is that the reason for no performance gain? Should I try to use multiprocessing to speed up instead of multithreading?
def process_file(file_name):
all_lines = []
with open(file_name, 'r', encoding='utf8') as f:
for index, line in enumerate(f):
line = line.strip()
all_lines.append(line)
# Classify text
all_results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for index, result in enumerate(executor.map(process_text, all_lines, itertools.repeat(channel))):
all_results.append(result)
for index, entities_relations_list in enumerate(all_results):
# print out results
def process_text(text, channel):
global channel_text
global non_channel_text
is_right_channel = is_channel(text, channel)
entities = ()
relations = None
entities_relations_list = set()
entities_relations_list.add((entities, relations))
if is_right_channel:
channel_text += 1
entities_relations_list = get_relations(text, channel)
return (text, entities_relations_list, is_right_channel)
non_channel_text += 1
return (text, entities_relations_list, is_right_channel)
The first thing that should be done is finding out how much time it takes to:
Read the file in memory (T1)
Do all processing (T2)
Printing result (T3)
The third point (printing), if you are really doing it, can slow down things. It's fine as long as you are not printing it to terminal and just piping the output to a file or something else.
Based on timings, we'll get to know:
T1 >> T2 => IO bound
T2 >> T1 => CPU bound
T1 and T2 are close => Neither.
by x >> y I mean x is significantly greater than y.
Based on above and the file size, you can try a few approaches:
Threading based
Even this can be done 2 ways, which one would work faster can be found out by again benchmarking/looking at the timings.
Approach-1 (T1 >> T2 or even when T1 and T2 are similar)
Run the code to read the file itself in a thread and let it push the lines to a queue instead of the list.
This thread inserts a None at end when it is done reading from file. This will be important to tell the worker that they can stop
Now run the processing workers and pass them the queue
The workers keep reading from the queue in a loop and processing the results. Similar to the reader thread, these workers put results in a queue.
Once a thread encounters a None, it stops the loop and re-inserts the None into the queue (so that other threads can stop themselves).
The printing part can again be done in a thread.
The above is example of single Producer and multiple consumer threads.
Approach-2 (This is just another way of doing what is being already done by the code snippet in the question)
Read the entire file into a list.
Divide the list into index ranges based on no. of threads.
Example: if the file has 100 lines in total and we use 10 threads
then 0-9, 10-19, .... 90-99 are the index ranges
Pass the complete list and these index ranges to the threads to process each set. Since you are not modifying original list, hence this works.
This approach can give results better than running the worker for each individual line.
Multiprocessing based
(CPU bound)
Split the file into multiple files before processing.
Run a new process for each file.
Each process gets the path of the file it should read and process
This requires additional step of combining all results/files at end
The process creation part can be done from within python using multiprocessing module
or from a driver script to spawn a python process for each file, like a shell script
Just by looking at the code, it seems to be CPU bound. Hence, I would prefer multiprocessing for doing that. I have used both approaches in practice.
Multiprocessing: when processing huge text files(GBs) stored on disk (like what you are doing).
Threading (Approach-1): when reading from multiple databases. As that is more IO bound than CPU (I used multiple producer and multiple consumer threads).
I have implemented the following code:
lines=[]
with open('path_to_file', 'r+') as source:
for line in source:
line = line.replace('\n','').strip()
if line.split()[-1] != 'sent':
# do some operation on line without 'sent' tag
upload(data1.zip)
upload(data2.zip)
do_operation(line)
# tag the line
line += '\tsent'
line += '\n'
# temporary save lines in a list
lines.append(line)
# move position to start of the file
source.seek(0)
# write back lines to the file
source.writelines(lines)
I am calling upload methods in a section #do some operation with lines without sent tag to upload data to the cloud. As the data is a bit large (around 1GB), it takes a while to finish the upload. In the mean time, does the for loop go ahead to call upload(data2)? I am getting errors as I cannot upload simultaneously.
If yes, how can I avoid this?
EDIT:::
I have changed upload function to return status as done after uploading. So, how can I modify my main loop so that it will wait after calling upload(data1.zip) and then move on to upload(data2.zip). I want to synchronize..
I think your problem might be that you don't want to try to upload more than one file at a time.
Your code doesn't try to do any parallel uploads. So I suspect that your upload() function is starting an upload process and then letting it run in the background while it returns to you.
If this is true, you can try some of these options:
Pass an option to the upload function that tells it to wait until the upload finishes before returning.
Discover (research) some attribute that you can use to synchronize your program with the process started by the upload function. For example, if the function returns the child process id, you could do a wait on that pid to complete. Or perhaps it writes the pid out to a pidfile - you could read in the number, and wait for it.
If you can't make the upload function do what you want synchronously, you might consider replacing calls to upload() with print statements to have your code generate some kind of script that could be executed separately, possibly with a different environment or using a different upload utility.
You can use multiprocessing do the time consuming work.
import multiprocessing
# creates processes for your files, each file has its own processor
processes = [multiprocessing.Process(target=upload, args=(zip_file,)) for zip_file in [data1.zip,data2.zip]]
# starts the processes
for p in processes:
p.start()
# waits for all processes finish work
for p in processes:
p.join()
# It will not go here until all files finish uploading.
...
...
You can send them off as independent processes. Use the Python multiprocessing module; there are nice tutorials, too.
Your inner loop might look something like this:
up1 = Process(target=upload, args=(data1.zip,))
up2 = Process(target=upload, args=(data2.zip,))
up1.start()
up2.start()
# Now, do other stuff while these run
do_operation(line)
# tag the line
line += '\tsent'
# Wait for the uploads to finish -- in case they're slower than do_operation.
up1.join()
up2.join()
flag
#Prune yes its me who is confused.. i want to synchronize.
Excellent; we have that cleared up. The things you synchronize are separate processes. You have your main process waiting for the result of your child process, the upload. Multiple processes is called ... :-)
Are we at a solution point now? I think the pieces you need are in one (or at most two) of these answers.
TL;DR: Getting different results after running code with threading and multiprocessing and single threaded. Need guidance on troubleshooting.
Hello, I apologize in advance if this may be a bit too generic, but I need a bit of help troubleshooting an issue and I am not sure how best to proceed.
Here is the story; I have a bunch of data indexed into a Solr Collection (~250m items), all items in that collection have a sessionid. Some items can share the same session id. I am combing through the collection to extract all items that have the same session, massage the data a bit and spit out another JSON file for indexing later.
The code has two main functions:
proc_day - accepts a day and processes all the sessions for that day
and
proc_session - does everything that needs to happen for a single session.
Multiprocessing is implemented on proc_day, so each day would be processed by a separate process, the proc_session function can be ran with threads. Below is the code I am using for threading/multiprocessing below. It accepts a function, a list of arguments and number of threads / multiprocesses. It will then create a queue based on input args, then create processes/threads and let them go through it. I am not posting the actual code, since it generally runs fine single threaded without any issues, but can post it if needed.
autoprocs.py
import sys
import logging
from multiprocessing import Process, Queue,JoinableQueue
import time
import multiprocessing
import os
def proc_proc(func,data,threads,delay=10):
if threads < 0:
return
q = JoinableQueue()
procs = []
for i in range(threads):
thread = Process(target=proc_exec,args=(func,q))
thread.daemon = True;
thread.start()
procs.append(thread)
for item in data:
q.put(item)
logging.debug(str(os.getpid()) + ' *** Processes started and data loaded into queue waiting')
s = q.qsize()
while s > 0:
logging.info(str(os.getpid()) + " - Proc Queue Size is:" + str(s))
s = q.qsize()
time.sleep(delay)
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
logging.debug(str(os.getpid()) + ' - *** Main Proc waiting')
q.join()
logging.debug(str(os.getpid()) + ' - *** Done')
def proc_exec(func,q):
p = multiprocessing.current_process()
logging.debug(str(os.getpid()) + ' - Starting:{},{}'.format(p.name, p.pid))
while True:
d = q.get()
try:
logging.debug(str(os.getpid()) + " - Starting to Process {}".format(d))
func(d)
sys.stdout.flush()
logging.debug(str(os.getpid()) + " - Marking Task as Done")
q.task_done()
except:
logging.error(str(os.getpid()) + " - Exception in subprocess execution")
logging.error(sys.exc_info()[0])
logging.debug(str(os.getpid()) + 'Ending:{},{}'.format(p.name, p.pid))
autothreads.py:
import threading
import logging
import time
from queue import Queue
def thread_proc(func,data,threads):
if threads < 0:
return "Thead Count not specified"
q = Queue()
for i in range(threads):
thread = threading.Thread(target=thread_exec,args=(func,q))
thread.daemon = True
thread.start()
for item in data:
q.put(item)
logging.debug('*** Main thread waiting')
s = q.qsize()
while s > 0:
logging.debug("Queue Size is:" + str(s))
s = q.qsize()
time.sleep(1)
logging.debug('*** Main thread waiting')
q.join()
logging.debug('*** Done')
def thread_exec(func,q):
while True:
d = q.get()
#logging.debug("Working...")
try:
func(d)
except:
pass
q.task_done()
I am running into problems with validating data after python runs under different multiprocessing/threading configs. There is a lot of data, so I really need to get multiprocessing working. Here are the results of my test yesterday.
Only with multiprocessing - 10 procs:
Days Processed 30
Sessions Found 3,507,475
Sessions Processed 3,514,496
Files 162,140
Data Output: 1.9G
multiprocessing and multithreading - 10 procs 10 threads
Days Processed 30
Sessions Found 3,356,362
Sessions Processed 3,272,402
Files 424,005
Data Output: 2.2GB
just threading - 10 threads
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 733,664
Data Output: 3.3GB
Single process/ no threading
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 162,190
Data Output: 1.9GB
These counts were gathered by grepping and counties entries in the log files (1 per main process). The first thing that jumps out is that days processed doesn't match. However, I manually checked the log files and it looks like a log entry was missing, there are follow on log entries to indicate that the day was actually processed. I have no idea why it was omitted.
I really don't want to write more code to validate this code, just seems like a terrible waste of time, is there any alternative?
I gave some general hints in the comments above. I think there are multiple problems with your approach, at very different levels of abstraction. You are also not showing all code of relevance.
The issue might very well be
in the method you are using to read from solr or in preparing read data before feeding it to your workers.
in the architecture you have come up with for distributing the work among multiple processes.
in your logging infrastructure (as you have pointed out yourself).
in your analysis approach.
You have to go through all of these points, and as of the complexity of the issue surely nobody here will be able to identify the exact issues for you.
Regarding points (3) and (4):
If you are not sure about the completeness of your log files, you should perform the analysis based on the payload output of your processing engine. What I am trying to say: the log files probably are just a side product of your data processing. The primary product is the thing you should analyze. Of course it is also important to get your logs right. But these two problems should be treated independently.
My contribution regarding point (2) in the list above:
What is especially suspicious about your multiprocessing-based solution is your way to wait for the workers to finish. You seem not to be sure by which method you should wait for your workers, so you apply three different methods:
First, you are monitoring the size of the queue in a while loop and wait for it to become 0. This is a non-canonical approach, which might actually work.
Secondly, you join() your processes in a weird way:
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
Why are you defining a timeout of one second here and do not respond to whether the process actually terminated within that time frame? You should either really join a process, i.e. wait until it has terminated or you specify a timeout and, if that timeout expires before the process finishes, treat that situation specially. Your code does not distinguish these situations, so p.join(1) is like writing time.sleep(1) instead.
Thirdly, you join the queue.
So, after making sure that q.qsize() returns 0 and after waiting for another second, do you really think that joining the queue is important? Does it make any difference? One of these approaches should be enough, and you need to think about which of these criteria is most important to your problem. That is, one of these conditions should deterministically implicate the other two.
All this looks like a quick & dirty hack of a multiprocessing solution, whereas you yourself are not really sure how that solution should behave. One of the most important insights I have obtained while working on concurrency architectures: You, the architect, must be 100 % aware of how the communication and control flow works in your system. Not properly monitoring and controlling the state of your worker processes may very well be the source of the issues you are observing.
I figured it out, I followed Jan-Philip's advice and started examining the output data of the multiprocess/multithreaded process. Turned out that an object that does all these things with the data from Solr was shared among threads. I did not have any locking mechanisms, so in a case it had mixed data from multiple sessions which caused inconsistent output. I validated this by instantiating a new object for every thread and the counts matched up. It is a bit slower, but still workable.
Thanks
I have a command line program I'm running and I pipe in text as arguments:
somecommand.exe < someparameters_tin.txt
It runs for a while (typically a good fraction of an hour to several hours) and then writes results in a number of text files. I'm trying to write a script to launch several of these simultaneously, using all the cores on a many core machine. On other OSs I'd fork, but that's not implemented in many scripting languages for Windows. Python's multiprocessing looks like it might do the trick so I thought I'd give it a try, although I don't know python at all. I'm hoping someone can tell me what I'm doing wrong.
I wrote a script (below) which I point to a directory, if finds the executable and input files, and launches them using pool.map and a pool of n, and a function using call. What I see is that initially (with the first set of n processes launched) it seems fine, using n cores 100%. But then I see the processes go idle, using no or only a few percent of their CPUs. There are always n processes there, but they aren't doing much. It appears to happen when they go to write the output data files, and once it starts everything bogs down, and overall core utilization ranges from a few percent to occasional peaks of 50-60%, but never gets near 100%.
If I can attach it (edit: I can't, at least for now) here's a plot of run times for the processes. The lower curve was when I opened n command prompts and manually kept n processes going at a time, easily keeping the computer near 100%. (The line is regular, slowly increasing from near 0 to 0.7 hours across 32 different processes varying a parameter.) The upper line is the result of some version of this script -- the runs times are inflated by about 0.2 hours on average and are much less predictable, like I'd taken the bottom line and added 0.2 + a random number.
Here's a link to the plot:
Run time plot
Edit: and now I think I can add the plot.
What am I doing wrong?
from multiprocessing import Pool, cpu_count, Lock
from subprocess import call
import glob, time, os, shlex, sys
import random
def launchCmd(s):
mypid = os.getpid()
try:
retcode = call(s, shell=True)
if retcode < 0:
print >>sys.stderr, "Child was terminated by signal", -retcode
else:
print >>sys.stderr, "Child returned", retcode
except OSError, e:
print >>sys.stderr, "Execution failed:", e
if __name__ == '__main__':
# ******************************************************************
# change this to the path you have the executable and input files in
mypath = 'E:\\foo\\test\\'
# ******************************************************************
startpath = os.getcwd()
os.chdir(mypath)
# find list of input files
flist = glob.glob('*_tin.txt')
elist = glob.glob('*.exe')
# this will not act as expected if there's more than one .exe file in that directory!
ex = elist[0] + ' < '
print
print 'START'
print 'Path: ', mypath
print 'Using the executable: ', ex
nin = len(flist)
print 'Found ',nin,' input files.'
print '-----'
clist = [ex + s for s in flist]
cores = cpu_count()
print 'CPU count ', cores
print '-----'
# ******************************************************
# change this to the number of processes you want to run
nproc = cores -1
# ******************************************************
pool = Pool(processes=nproc, maxtasksperchild=1) # start nproc worker processes
# mychunk = int(nin/nproc) # this didn't help
# list.reverse(clist) # neither did this, or randomizing the list
pool.map(launchCmd, clist) # launch processes
os.chdir(startpath) # return to original working directory
print 'Done'
Is there any chance that the processes are trying to write to a common file? Under Linux it would probably just work, clobbering data but not slowing down; but under Windows one process might get the file and all the other processes might hang waiting for the file to become available.
If you replace your actual task list with some silly tasks that use CPU but don't write to disk, does the problem reproduce? For example, you could have tasks that compute the md5sum of some large file; once the file was cached the other tasks would be pure CPU and then a single line output to stdout. Or compute some expensive function or something.
I think I know this. When you call map, it breaks the list of tasks into 'chunks' for each process. By default, it uses chunks large enough that it can send one to each process. This works on the assumption that all the tasks take about the same length of time to complete.
In your situation, presumably the tasks can take very different amounts of time to complete. So some workers finish before others, and those CPUs sit idle. If that's the case, then this should work as expected:
pool.map(launchCmd, clist, chunksize=1)
Less efficient, but it should mean that each worker gets more tasks as it finishes until they're all complete.