I would expect next code to be executed simultaneously and all filenames from os.walk iterations , that got 0 at random , will get in result dictionary. And all threads that have some timeout would get into deamon mode and will be killed as soon as script reaches end. However, script respects all timeouts for each thread.
Why is this happening? Should it put all threads in backgroung and kill them if they will not finish and return result before the end of script execution? thank you.
import threading
import os
import time
import random
def check_file(file_name,timeout):
time.sleep(timeout)
print file_name
result.append(file_name)
result = []
for home,dirs,files in os.walk("."):
for ifile in files :
filename = '/'.join([home,ifile])
t = threading.Thread(target=check_file(filename,random.randint(0,5)))
t.setDaemon(True)
t.start()
print result
Solution: I found my mistake:
t = threading.Thread(target=check_file(filename,random.randint(0,5)))
has to be
t = threading.Thread(target=check_file, args=(filename,random.randint(0,5)))
In this case, threading will spawn a thread with function as object ang give it arguments. In my initial example, function with args has to be resolved BEFORE thread spawns. And this is fair.
However, example above works for me at 2.7.3 , but at 2.7.2 i cannot make it working.
I `m getting got exception that
function check_file accepts exactly 1 argument (34 is given).
Soulution :
in 2.7.2 i had to put ending coma in args tuple , considering that i have 1 variable only . God knows why this not affects 2.7.3 version . It was
t = threading.Thread(target=check_file, args=(filename))
and started to work with
t = threading.Thread(target=check_file, args=(filename,))
I understand what you were trying to do, but you're not using the right format for threading. I fixed your example...look up the Queue class on how to do this properly.
Secondly, never ever do string manipulation on file paths. Use the os.path module; there's a lot more than adding separators between strings that you and I don't think about most of the time.
Good luck!
import threading
import os
import time
import random
import Queue
def check_file():
while True:
item = q.get()
time.sleep(item[1])
print item
q.task_done()
q = Queue.Queue()
result = []
for home,dirs,files in os.walk("."):
for ifile in files:
filename = os.path.join(home, ifile)
q.put((filename, random.randint(0,5)))
number_of_threads = 25
for i in range(number_of_threads):
t = threading.Thread(target=check_file)
t.daemon = True
t.start()
q.join()
print result
Related
I have main_script.py which import scripts which get data from webpages. I want do this by use multithreading. I came up with this solution, but it does not work:
main_script:
import script1
temp_path = ''
thread1 = threading.Thread(target=script1.Main,
name='Script1',
args=(temp_path, ))
thread1.start()
thread1.join()
script1:
class Main:
def __init__()
def some_func()
def some_func2()
def __main__():
some_func()
some_func2()
return callback
Now only 1 way I know to get value of callback from script1 to main_script is:
main_script:
import script1
temp_path = ''
# make instance of class with temp_path
inst_script1 = script1.Main(temp_path)
print("instance1:")
print(inst_script1.callback)
It's works but then I run instances of scripts one-by-one, no concurrently.
Anybody has any idea how handle that? :)
First off if you are using threading in Python make sure you read: https://docs.python.org/2/glossary.html#term-global-interpreter-lock. Unless you are using C modules or a lot of I/O you won't see the scripts run concurrently. Generally speaking, multiprocessing.pool is a better approach.
If you are certain we want threads rather then processes you can use a mutable variable to store the result. For example, a dictionary which keeps track of the result of each thread.
result = {}
def test(val, name, target):
target[name] = val * 4
temp_path = 'ASD'
thread1 = threading.Thread(target=test,
name='Script1',
args=(temp_path, 'A', result))
thread1.start()
thread1.join()
print (result)
Thanks for response. Yes, I readed about GIL, but it's doesn't make me any problem yet. Generally I solve my problem, because I find solution on other website. Code like this now:
Main_script:
import queue
import script1
import script2
queue_callbacks = queue.Queue()
threads_list = list()
temp_path1 = ''
thread1 = threading.Thread(target= lambda q, arg1: q.put(Script1.Main(arg1)),
name='Script1',
args=(queue_callbacks, temp_path1, ))
thread1.start()
temp_path2 = ''
thread2 = threading.Thread(target= lambda q, arg1: q.put(Script2.Main(arg1)),
name='Script2',
args=(queue_callbacks, temp_path2, ))
thread2.start()
for t in threads_list:
t.join()
while not kolejka_callbacks.empty():
result = queue_callbacks.get()
callbacks.append({"service": result.service, "callback": result.callback, "error": result.error})
And this works fine. Now I have other problem, because I want this to work in big scale, where I have a hundreds of scripts and handle this by e.q. 5 threads.
In general, is there any limit to the number of threads running at any one time?
I am using the following code to complete a task using multithreading with Queue and Joinable Queue. Sometimes the script executes perfectly other times it stalls at the end of the task without ending the worker and will not continue on to the next portion of the script. I am new to working with Queue and JoinableQueue and I need to find out why this stalling happens.
Before this part in the code I run another Queue, JoinableQueue worker to download some data and it works perfectly fine everytime. Do I need to close() any thing from the first Queue/JoinableQueue? Is there a way to check if it stalls and if so continue on?
Here is my code:
import multiprocessing
from multiprocessing import Queue
from multiprocessing import JoinableQueue
from threading import Thread
def run_this_definition(hr):
#do things here
return()
def worker():
while True:
item = jq.get()
run_this_definition(item)
jq.task_done()
return()
q = Queue()
jq = JoinableQueue()
number_of_threads = 8
for i in range(number_of_threads):
t = Thread(target=worker)
t.daemon = True
t.start()
input_list = [0,1,2,3,4]
for item in input_list:
jq.put(item)
jq.join()
print "finished"
The script never prints "finished" when it stalls, but seems to finish all the tasks and stalls at the end of the 'run_this_definition' on the very last item in the Queue.
My guess is you are using the multiprocessing.JoinableQueue()!? Use the Queue.Queue() instead for threading. It has a .join() and a .task_done() method as well. Furthermore you should pass your queue as an argument to your threads: See the following example:
import threading
from threading import Thread
from Queue import Queue
def worker(jq):
while True:
item = jq.get()
# Do whatever you have to do.
print '{}: {}'.format(threading.currentThread().name, item)
jq.task_done()
return()
number_of_threads = 4
input_list = [1,2,3,4,5]
jq = Queue()
for i in range(number_of_threads):
t = Thread(target=worker, args=(jq,))
t.daemon = True
t.start()
for item in input_list:
jq.put(item)
jq.join()
print "finished"
The print output form multiple threads might look messy, but as an example it should be fine.
For the future: Please provide a comprehensive example of your code. Neither your imports, nor number_of_threads, run_this_definition or input_list were defined in your example.
I a linux script that I'm looking to automate through subprocess. Each iteration of subprocess should run the linux script in each subdirectory of a parent directory, and each of these subprocesses should run in a separate thread.
The way my directory is organized is as follows:
/parent/p1
/parent/p2....and so on till
/parent/p[n]
The first part of my code aims to run the process across all the subdirectories (p1, p2, p3...etc). It works fine for a fast process. However, many of my jobs need to run in the background, for which I usually use nohup and manually run them on a separate node. So every node in my terminal will run the same job on each directory (p1, p2, p3..etc). The latter part of my code (using threading) aims to achieve this, but what ends up happening is every node runs the same process (p1,p1,p1...etc) - basically by entire 'jobs' function is being passed through runSims when I want them separated out over the threads. Would someone know how I could further iterate the threading function to place different jobs on each node?
import os
import sys
import subprocess
import os.path
import threading
#takes the argument: python FOLDER_NAME #ofThreads
#Example: python /parent 8
directory = sys.argv[1] #in my case input is /parent
threads = int(sys.argv[2]) #input is 8
category_name = directory.split('/')[-1] #splits parent as a word
folder_list = next(os.walk(directory))[1] #makes a list of subdirectories [p1,p2,p3..]
def jobs(cmd):
for i in folder_list:
f = open("/vol01/bin/dir/nohup.out", "w")
cmd = subprocess.call(['nohup','python','np.py','{0}/{1}' .format(directory,i)],cwd = '/vol01/bin/dir', stdout=f)
return cmd
def runSimThreads(numThreads):
threads = []
for i in range(numThreads):
t = threading.Thread(target=jobs, args=(i,))
threads.append(t)
t.start()
#Wait for all threads to complete
main_thread = threading.currentThread()
for t in threads:
if t is main_thread:
continue
t.join()
runSimThreads(threads)
That can't be your code.
import os
import sys
import subprocess
import os.path
import threading
#takes the argument: python FOLDER_NAME #ofThreads
#Example: python /parent 8
threads = 8 #input is 8
...
...
for t in threads:
print("hello")
--output:--
TypeError: 'int' object is not iterable
You are using the same variable names everywhere, and that is confusing you (or me?).
You also do this:
def jobs(cmd):
for i in folder_list:
f = open("/vol01/bin/dir/nohup.out", "w")
cmd = "something"
You are overwriting your cmd parameter variable, which means that jobs() shouldn't have a parameter variable.
Response to comment1:
import threading as thr
import time
def greet():
print("hello world")
t = thr.Thread(target=greet)
t.start()
t.join()
--output:--
hello world
import threading as thr
import time
def greet(greeting):
print(greeting)
t = thr.Thread(target=greet, args=("Hello, Newman.",) )
t.start()
t.join()
--output:--
Hello, Newman.
Below is the equivalent of what you are doing:
import threading as thr
import time
def greet(greeting):
greeting = "Hello, Jerry."
print(greeting)
t = thr.Thread(target=greet, args=("Hello, Newman.",) )
t.start()
t.join()
--output:--
Hello, Jerry.
And anyone reading that code would ask, "Why are you passing an argument to the greet() function when you don't use it?"
I'm relatively new to python
Well, your code does this:
threads = 8
#Other irrelevant stuff here
for t in threads:
print("hello")
and that will produce the error:
TypeError: 'int' object is not iterable
Do you know why?
I have the problem that I need to write values generated by a consumer to disk. I do not want to open a new instance of a file to write every time, so I thought to use a second queue and a other consumer to write to disk from a singe Greenlet. The problem with my code is that the second queue does not get consumed async from the first queue. The first queue finishes first and then the second queue gets consumed.
I want to write values to disk at the same time then other values get generated.
Thanks for help!
#!/usr/bin/python
#- * -coding: utf-8 - * -
import gevent #pip install gevent
from gevent.queue import *
import gevent.monkey
from timeit import default_timer as timer
from time import sleep
import cPickle as pickle
gevent.monkey.patch_all()
def save_lineCount(count):
with open("count.p", "wb") as f:
pickle.dump(count, f)
def loader():
for i in range(0,3):
q.put(i)
def writer():
while True:
task = q_w.get()
print "writing",task
save_lineCount(task)
def worker():
while not q.empty():
task = q.get()
if task%2:
q_w.put(task)
print "put",task
sleep(10)
def asynchronous():
threads = []
threads.append(gevent.spawn(writer))
for i in range(0, 1):
threads.append(gevent.spawn(worker))
start = timer()
gevent.joinall(threads,raise_error=True)
end = timer()
#pbar.close()
print "\n\nTime passed: " + str(end - start)[:6]
q = gevent.queue.Queue()
q_w = gevent.queue.Queue()
gevent.spawn(loader).join()
asynchronous()
In general, that approach should work fine. There are some problems with this specific code, though:
Calling time.sleep will cause all greenlets to block. You either need to call gevent.sleep or monkey-patch the process in order to have just one greenlet block (I see gevent.monkey imported, but patch_all is not called). I suspect that's the major problem here.
Writing to a file is also synchronous and causes all greenlets to block. You can use FileObjectThread if that's a major bottleneck.
below is my code and im really new to python. from my below code, i will actually create multiple threads (above 1000). but at some point, nearly 800 threads, i get an error message saying "error:cannot start new thread". i did read some about threadpool. i couldnt really understand. in my code, how can i implement threadpool? or at least please explain to me in a simple way
#!/usr/bin/python
import threading
import urllib
lock = threading.Lock()
def get_wip_info(query_str):
try:
temp = urllib.urlopen(query_str).read()
except:
temp = 'ERROR'
return temp
def makeURLcall(arg1, arg2, arg3, file_output, dowhat, result) :
url1 = "some URL call with args"
url2 = "some URL call with args"
if dowhat == "IN" :
result = get_wip_info(url1)
elif dowhat == "OUT" :
result = get_wip_info(url2)
lock.acquire()
report = open(file_output, "a")
report.writelines("%s - %s\n"%(serial, result))
report.close()
lock.release()
return
testername = "arg1"
stationcode = "arg2"
dowhat = "OUT"
result = "PASS"
file_source = "sourcefile.txt"
file_output = "resultfile.txt"
readfile = open(file_source, "r")
Data = readfile.readlines()
threads = []
for SNs in Data :
SNs = SNs.strip()
print SNs
thread = threading.Thread(target = makeURLcalls, args = (SNs, args1, testername, file_output, dowhat, result))
thread.start()
threads.append(thread)
for thread in threads :
thread.join()
Don't implement your own thread pool, use the one that ships with Python.
On Python 3, you can use concurrent.futures.ThreadPoolExecutor to use threads explicitly, on Python 2.6 and higher, you can import Pool from multiprocessing.dummy which is similar to the multiprocessing API, but backed by threads instead of processes.
Of course, if you need to do CPU bound work in CPython (the reference interpreter), you'd want to use multiprocessing proper, not multiprocessing.dummy; Python threads are fine for I/O bound work, but the GIL makes them pretty bad for CPU bound work.
Here's code to replace your explicit use of Threads with multiprocessing.dummy's Pool, using a fixed number of workers that each complete tasks as fast as possible one after another, rather than having an infinite number of one job threads. First off, since the local I/O is likely to be fairly cheap, and you want to synchronize the output, we'll make the worker task return the resulting data rather than write it out itself, and have the main thread do the write to local disk (removing the need for locking, as well as the need for opening the file over and over). This changes makeURLcall to:
# Accept args as a single sequence to ease use of imap_unordered,
# and unpack on first line
def makeURLcall(args):
arg1, arg2, arg3, dowhat, result = args
url1 = "some URL call with args"
url2 = "some URL call with args"
if dowhat == "IN" :
result = get_wip_info(url1)
elif dowhat == "OUT" :
result = get_wip_info(url2)
return "%s - %s\n" % (serial, result)
And now for the code that replaces your explicit thread use:
import multiprocessing.dummy as mp
from contextlib import closing
# Open input and output files and create pool
# Odds are that 32 is enough workers to saturate the connection,
# but you can play around; somewhere between 16 and 128 is likely to be the
# sweet spot for network I/O
with open(file_source) as inf,\
open(file_output, 'w') as outf,\
closing(mp.Pool(32)) as pool:
# Define generator that creates tuples of arguments to pass to makeURLcall
# We also read the file in lazily instead of using readlines, to
# start producing results faster
tasks = ((SNs.strip(), args1, testername, dowhat, result) for SNs in inf)
# Pulls and writes results from the workers as they become available
outf.writelines(pool.imap_unordered(makeURLcall, tasks))
# Once we leave the with block, input and output files are closed, and
# pool workers are cleaned up