Confusion regarding the output of threads in python - python

I am currently working with python v.2.7 on windows 8.
My programme is using threads. I am providing a name to these threads during their creation. The first thread is named First-Thread and second one is named Second-Thread. The threads execute a method named as getData() that does the following:
makes the current thread to sleep for some time
calls the compareValues()
retrieve the information from the compareValues() and adds them to a
list called myList
The compareValues() does the following:
generates a random number
checks if it is less than 5 or if it is greater than or equal to 5
and yields the result along with the current thread's name
I save the results of these threads to a list named as myList and then finally print this myList.
Problem: Why I never see the Second-Thread in myList? I don't understand this behavior. Please try to execute this code to see the output in order to understand my problem.
Code:
import time
from random import randrange
import threading
myList = []
def getData(i):
print "Sleep for %d"%i
time.sleep(i)
data = compareValues()
for d in list(data):
myList.append(d)
def compareValues():
number = randrange(10)
if number >= 5:
yield "%s: Greater than or equal to 5: %d "%(t.name, number)
else:
yield "%s: Less than 5: %d "%(t.name, number)
threadList = []
wait = randrange(10)+1
t = threading.Thread(name = 'First-Thread', target = getData, args=(wait,))
threadList.append(t)
t.start()
wait = randrange(3)+1
t = threading.Thread(name = 'Second-Thread', target = getData, args=(wait,))
threadList.append(t)
t.start()
for t in threadList:
t.join()
print
print "The final list"
print myList
Sample output:
Sleep for 4Sleep for 1
The final list
['First-Thread: Greater than or equal to 5: 7 ', 'First-Thread: Greater than or equal to 5: 8 ']
Thank you for your time.

def compareValues():
number = randrange(10)
if number >= 5:
yield "%s: Greater than or equal to 5: %d "%(t.name, number)
else:
yield "%s: Less than 5: %d "%(t.name, number)
In the body of compareValues the code refers to t.name. By the time compareValues() gets called by the threads, t, which is looked up according to the LEGB rule and found in the global scope, references the first thread because the t.join() is waiting on the first thread. t.name thus has the value First-Thread.
To get the current thread name, use threading.current_thread().name:
def compareValues():
number = randrange(10)
name = threading.current_thread().name
if number >= 5:
yield "%s: Greater than or equal to 5: %d "%(name, number)
else:
yield "%s: Less than 5: %d "%(name, number)
Then you will get output like
Sleep for 4
Sleep for 2
The final list
['Second-Thread: Less than 5: 3 ', 'First-Thread: Greater than or equal to 5: 5 ']

Related

Python multiprocessing: how to create x number of processes and get return value back

I have a program that I created using threads, but then I learned that threads don't run concurrently in python and processes do. As a result, I am trying to rewrite the program using multiprocessing, but I am having a hard time doing so. I have tried following several examples that show how to create the processes and pools, but I don't think it's exactly what I want.
Below is my code with the attempts I have tried. The program tries to estimate the value of pi by randomly placing points on a graph that contains a circle. The program takes two command-line arguments: one is the number of threads/processes I want to create, and the other is the total number of points to try placing on the graph (N).
import math
import sys
from time import time
import concurrent.futures
import random
import multiprocessing as mp
def myThread(arg):
# Take care of imput argument
n = int(arg)
print("Thread received. n = ", n)
# main calculation loop
count = 0
for i in range (0, n):
x = random.uniform(0,1)
y = random.uniform(0,1)
d = math.sqrt(x * x + y * y)
if (d < 1):
count = count + 1
print("Thread found ", count, " points inside circle.")
return count;
# end myThread
# receive command line arguments
if (len(sys.argv) == 3):
N = sys.argv[1] # original ex: 0.01
N = int(N)
totalThreads = sys.argv[2]
totalThreads = int(totalThreads)
print("N = ", N)
print("totalThreads = ", totalThreads)
else:
print("Incorrect number of arguments!")
sys.exit(1)
if ((totalThreads == 1) or (totalThreads == 2) or (totalThreads == 4) or (totalThreads == 8)):
print()
else:
print("Invalid number of threads. Please use 1, 2, 4, or 8 threads.")
sys.exit(1)
# start experiment
t = int(time() * 1000) # begin run time
total = 0
# ATTEMPT 1
# processes = []
# for i in range(totalThreads):
# process = mp.Process(target=myThread, args=(N/totalThreads))
# processes.append(process)
# process.start()
# for process in processes:
# process.join()
# ATTEMPT 2
#pool = mp.Pool(mp.cpu_count())
#total = pool.map(myThread, [N/totalThreads])
# ATTEMPT 3
#for i in range(totalThreads):
#total = total + pool.map(myThread, [N/totalThreads])
# p = mp.Process(target=myThread, args=(N/totalThreads))
# p.start()
# ATTEMPT 4
# with concurrent.futures.ThreadPoolExecutor() as executor:
# for i in range(totalThreads):
# future = executor.submit(myThread, N/totalThreads) # start thread
# total = total + future.result() # get result
# analyze results
pi = 4 * total / N
print("pi estimate =", pi)
delta_time = int(time() * 1000) - t # calculate time required
print("Time =", delta_time, " milliseconds")
I thought that creating a loop from 0 to totalThreads that creates a process for each iteration would work. I also wanted to pass in N/totalThreads (to divide the work), but it seems that processes take in an iterable list rather than an argument to pass to the method.
What is it I am missing with multiprocessing? Is it at all possible to even do what I want to do with processes?
Thank you in advance for any help, it is greatly appreciated :)
I have simplified your code and used some hard-coded values which may or may not be reasonable.
import math
import concurrent.futures
import random
from datetime import datetime
def myThread(arg):
count = 0
for i in range(0, arg[0]):
x = random.uniform(0, 1)
y = random.uniform(0, 1)
d = math.sqrt(x * x + y * y)
if (d < 1):
count += 1
return count
N = 10_000
T = 8
_start = datetime.now()
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = {executor.submit(myThread, (int(N / T),)): _ for _ in range(T)}
total = 0
for future in concurrent.futures.as_completed(futures):
total += future.result()
_end = datetime.now()
print(f'Estimate for PI = {4 * total / N}')
print(f'Run duration = {_end-_start}')
A typical output on my machine looks like this:-
Estimate for PI = 3.1472
Run duration = 0:00:00.008895
Bear in mind that the number of threads you start is effectively managed by the ThreadPoolExecutor (TPE) [ when constructed with no parameters ]. It makes decisions about the number of threads that can run based on your machine's processing capacity (number of cores etc). Therefore you could, if you really wanted to, set T to a very high number and the TPE will block execution of any new threads until it determines that there is capacity.

I'm confused with the questions about python multiprocessing problem

import multiprocessing
import time
import os
def whoami(name):
print("I'm %s, in process %s" % (name, os.getpid()))
def loopy(name):
whoami(name)
start = 1
stop = 1000000
for num in range(start, stop):
print("\tNumber %s of %s. Honk!!!" % (num, stop))
time.sleep(1)
if __name__ == "__main__":
whoami("main")
p = multiprocessing.Process(target=loopy, args=("loopy",))
p.start()
time.sleep(5)
p.terminate()
This code was from 'Introducing python' Book, 2nd versions by Bill Lubanovic. According to the book, when I run this code, the result occurs:
I'm main, in process xxxx
I'm loopy, in process ----
Number 1 of 1000000. Honk!
Number 2 of 1000000. Honk!
Number 3 of 1000000. Honk!
Number 4 of 1000000. Honk!
Number 5 of 1000000. Honk!
However, In my case, only 'Process finished with exit code 0' printed. I want to know which point of this code is wrong.

How to condense this string compression code to make it more efficient?

I just wrote up code for problem 1.6 String Compression from Cracking the Coding Interview. I am wondering how I can condense this code to make it more efficient. Also, I want to make sure that this code is O(n) because I am not concatenating to a new string.
The problem states:
Implement a method to perform basic string compression using the counts of repeated characters. For example, the string 'aabcccccaaa' would become a2b1c5a3. If the "compressed" string would not become smaller than the original string, your method should return the original string. You can assume the string has only uppercase and lowercase letters (a - z).
My code works. My first if statement after the else checks to see if the count for the character is 1, and if it is then to just append the character. I do this so when checking the length of the end result and the original string to decide which one to return.
import string
def stringcompress(str1):
res = []
d = dict.fromkeys(string.ascii_letters, 0)
main = str1[0]
for char in range(len(str1)):
if str1[char] == main:
d[main] += 1
else:
if d[main] == 1:
res.append(main)
d[main] = 0
main = str1[char]
d[main] += 1
else:
res.append(main + str(d[main]))
d[main] = 0
main = str1[char]
d[main] += 1
res.append(main + str(d[main]))
return min(''.join(res), str1)
Again, my code works as expected and does what the question asks. I just want to see if there are certain lines of code I can take out to make the program more efficient.
I messed around testing different variations with the timeit module. Your variation worked fantastically when I generated test data that did not repeat often, but for short strings, my stringcompress_using_string was the fastest method. As the strings grow longer everything flips upside down, and your method of doing things becomes the fastest, and stringcompress_using_string is the slowest.
This just goes to show the importance of testing under different circumstances. My initial conclusions where incomplete, and having more test data showed the true story about the effectiveness of these three methods.
import string
import timeit
import random
def stringcompress_original(str1):
res = []
d = dict.fromkeys(string.ascii_letters, 0)
main = str1[0]
for char in range(len(str1)):
if str1[char] == main:
d[main] += 1
else:
if d[main] == 1:
res.append(main)
d[main] = 0
main = str1[char]
d[main] += 1
else:
res.append(main + str(d[main]))
d[main] = 0
main = str1[char]
d[main] += 1
res.append(main + str(d[main]))
return min(''.join(res), str1, key=len)
def stringcompress_using_list(str1):
res = []
count = 0
for i in range(1, len(str1)):
count += 1
if str1[i] is str1[i-1]:
continue
res.append(str1[i-1])
res.append(str(count))
count = 0
res.append(str1[i] + str(count+1))
return min(''.join(res), str1, key=len)
def stringcompress_using_string(str1):
res = ''
count = 0
# we can start at 1 because we already know the first letter is not a repition of any previous letters
for i in range(1, len(str1)):
count += 1
# we keep going through the for loop, until a character does not repeat with the previous one
if str1[i] is str1[i-1]:
continue
# add the character along with the number of times it repeated to the final string
# reset the count
# and we start all over with the next character
res += str1[i-1] + str(count)
count = 0
# add the final character + count
res += str1[i] + str(count+1)
return min(res, str1, key=len)
def generate_test_data(min_length=3, max_length=300, iterations=3000, repeat_chance=.66):
assert repeat_chance > 0 and repeat_chance < 1
data = []
chr = 'a'
for i in range(iterations):
the_str = ''
# create a random string with a random length between min_length and max_length
for j in range( random.randrange(min_length, max_length+1) ):
# if we've decided to not repeat by randomization, then grab a new character,
# otherwise we will continue to use (repeat) the character that was chosen last time
if random.random() > repeat_chance:
chr = random.choice(string.ascii_letters)
the_str += chr
data.append(the_str)
return data
# generate test data beforehand to make sure all of our tests use the same test data
test_data = generate_test_data()
#make sure all of our test functions are doing the algorithm correctly
print('showing that the algorithms all produce the correct output')
print('stringcompress_original: ', stringcompress_original('aabcccccaaa'))
print('stringcompress_using_list: ', stringcompress_using_list('aabcccccaaa'))
print('stringcompress_using_string: ', stringcompress_using_string('aabcccccaaa'))
print()
print('stringcompress_original took', timeit.timeit("[stringcompress_original(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
print('stringcompress_using_list took', timeit.timeit("[stringcompress_using_list(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
print('stringcompress_using_string took', timeit.timeit("[stringcompress_using_string(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
The following results where all taken on an Intel i7-5700HQ CPU # 2.70GHz, quad core processor. Compare the different functions within each blockquote, but don't try to cross compare results from one blockquote to another because the size of the test data will be different.
Using long strings
Test data generated with generate_test_data(10000, 50000, 100, .66)
stringcompress_original took 7.346990528497378 seconds
stringcompress_using_list took 7.589927956366313 seconds
stringcompress_using_string took 7.713812443264496 seconds
Using short strings
Test data generated with generate_test_data(2, 5, 10000, .66)
stringcompress_original took 0.40272931026355685 seconds
stringcompress_using_list took 0.1525574881739265 seconds
stringcompress_using_string took 0.13842854253813164 seconds
10% chance of repeating characters
Test data generated with generate_test_data(10, 300, 10000, .10)
stringcompress_original took 4.675965586924492 seconds
stringcompress_using_list took 6.081609410376534 seconds
stringcompress_using_string took 5.887430301813865 seconds
90% chance of repeating characters
Test data generated with generate_test_data(10, 300, 10000, .90)
stringcompress_original took 2.6049783549783547 seconds
stringcompress_using_list took 1.9739111725413099 seconds
stringcompress_using_string took 1.9460854974553605 seconds
It's important to create a little framework like this that you can use to test changes to your algorithm. Often changes that don't seem useful will make your code go much faster, so the key to the game when optimizing for performance is to try out different things, and time the results. I'm sure there are more discoveries that could be found if you play around with making different changes, but it really matters on the type of data you want to optimize for -- compressing short strings vs long strings vs strings that don't repeat as often vs those that do.

python time results not as expected: time.time() - time.time()

In playing around with the python execution of time, I found an odd behavior when calling time.time() twice within a single statement. There is a very small processing delay in obtaining time.time() during statement execution.
E.g. time.time() - time.time()
If executed immediately in a perfect world, would compute in a result of 0.
However, in real world, this results in a very small number as there is an assumed delay in when the processor executes the first time.time() computation and the next. However, when running this same execution and comparing it to a variable computed in the same way, the results are skewed in one direction.
See the small code snippet below.
This also holds true for very large data sets
import time
counts = 300000
def at_once():
first = 0
second = 0
x = 0
while x < counts:
x += 1
exec_first = time.time() - time.time()
exec_second = time.time() - time.time()
if exec_first > exec_second:
first += 1
else:
second += 1
print('1sts: %s' % first)
print('2nds: %s' % second)
prints:
1sts: 39630
2nds: 260370
Unless I have my logic incorrect, I would expect the results to very close to 50:50, but it does not seem to be the case. Is there anyone who could explain what causes this behavior or point out a potential flaw with the code logic that is making the results skewed in one direction?
Could it be that exec_first == exec_second? Your if-else would add 1 to second in that case.
Try changing you if-else to something like:
if exec_first > exec_second:
first += 1
elif exec_second > exec_first:
second += 1
else:
pass
You assign all of the ties to one category. Try it with a middle ground:
import time
counts = 300000
first = 0
second = 0
same = 0
for _ in range(counts):
exec_first = time.time() - time.time()
exec_second = time.time() - time.time()
if exec_first == exec_second:
same += 1
elif exec_first > exec_second:
first += 1
else:
second += 1
print('1sts: %s' % first)
print('same: %s' % same)
print('2nds: %s' % second)
Output:
$ python3 so.py
1sts: 53099
same: 194616
2nds: 52285
$ python3 so.py
1sts: 57529
same: 186726
2nds: 55745
Also, I'm confused as to why you think that a function call might take 0 time. Every invocation requires at least access to the system clock and copying that value to a temporary location of some sort. This isn't free of overhead on any current computer.

Multi-threading in python with loop

I'm trying to solve Problem 8 in project euler with multi-threading technique in python.
Find the greatest product of five consecutive digits in the 1000-digit number. The number can be found here.
My approach is to generate product from chunks of 5 from the original list and repeat this process 5 times, each with the starting index shifted one to the right.
Here is my thread class
class pThread(threading.Thread):
def __init__(self, l):
threading.Thread.__init__(self)
self.l = l
self.p = 0
def run(self):
def greatest_product(l):
"""
Divide the list into chunks of 5 and find the greatest product
"""
def product(seq):
return reduce(lambda x,y : x*y, seq)
def chunk_product(l, n=5):
for i in range(0, len(l), n):
yield product(l[i:i+n])
result = 0
for p in chunk_product(num):
result = result > p and result or p
return result
self.p = greatest_product(self.l)
When I try to create 5 threads to cover all 5-digit chunks in my original list, the manual approach below gives the correct answer, with num being the list of single-digit numbers that I parse from the text:
thread1 = pThread(num)
del num[0]
thread2 = pThread(num)
del num[0]
thread3 = pThread(num)
del num[0]
thread4 = pThread(num)
del num[0]
thread5 = pThread(num)
thread1.start()
thread2.start()
thread3.start()
thread4.start()
thread5.start()
thread1.join()
thread2.join()
thread3.join()
thread4.join()
thread5.join()
def max(*args):
result = 0
for i in args:
result = i > result and i or result
return result
print max(thread1.p, thread2.p, thread3.p, thread4.p, thread5.p)
But this doesn't give the correct result:
threads = []
for i in range(0, 4):
tmp = num[:]
del tmp[0:i+1]
thread = pThread(tmp)
thread.start()
threads.append(thread)
for i in range(0, 4):
threads[i].join()
What did I do wrong here? I'm very new to multithreading so please be gentle.
There are 3 problems:
The first is that the "manual" approach does not give the correct answer. It just happens that the correct answer to the problem is at the offset 4 from the start of your list. You can see this by using:
import operator as op
print max(reduce(op.mul, num[i:i+5]) for i in range(1000))
for k in range(5):
print max(reduce(op.mul, num[i:i+5]) for i in range(k, 1000, 5))
One problem with your "manual" approach is that the threads share the num variable, each has the same list. So when you do del num[0], all threadX.l are affected. The fact that you consistently get the same answer is due to the second problem.
The line
for p in chunk_product(num):
should be:
for p in chunk_product(l):
since you want to use the parameter of function greatest_product(l) and not the global variable num.
In the second method you only spawn 4 threads since the loops range over [0, 1, 2, 3]. Also, you want to delete the values tmp[0:i] and not tmp[0:i+1]. Here is the code:
threads = []
for i in range(5):
tmp = num[:]
del tmp[0:i]
thread = pThread(tmp)
thread.start()
threads.append(thread)
for i in range(5):
threads[i].join()
print len(threads), map(lambda th: th.p, threads)
print max(map(lambda th: th.p, threads))
I took a stab at this mainly to get some practice multiprocessing, and to learn how to use argparse.
This took around 4-5 gigs of ram just in case your machine doesn't have a lot.
python euler.py -l 50000000 -n 100 -p 8
Took 5.836833333969116 minutes
The largest product of 100 consecutive numbers is: a very large number
If you type python euler.py -h at the commandline you get:
usage: euler.py [-h] -l L [L ...] -n N [-p P]
Calculates the product of consecutive numbers and return the largest product.
optional arguments:
-h, --help show this help message and exit
-l L [L ...] A single number or list of numbers, where each # is seperated
by a space
-n N A number that specifies how many consecutive numbers should be
multiplied together.
-p P Number of processes to create. Optional, defaults to the # of
cores on the pc.
And the code:
"""A multiprocess iplementation for calculation the maximum product of N consecutive
numbers in a given range (list of numbers)."""
import multiprocessing
import math
import time
import operator
from functools import reduce
import argparse
def euler8(alist,lenNums):
"""Returns the largest product of N consecutive numbers in a given range"""
return max(reduce(operator.mul, alist[i:i+lenNums]) for i in range(len(alist)))
def split_list_multi(listOfNumbers,numLength,threads):
"""Split a list into N parts where N is the # of processes."""
fullLength = len(listOfNumbers)
single = math.floor(fullLength/threads)
results = {}
counter = 0
while counter < threads:
if counter == (threads-1):
temp = listOfNumbers[single*counter::]
if counter == 0:
results[str(counter)] = listOfNumbers[single*counter::]
else:
prevListIndex = results[str(counter-1)][-int('{}'.format(numLength-1))::]
newlist = prevListIndex + temp
results[str(counter)] = newlist
else:
temp = listOfNumbers[single*counter:single*(counter+1)]
if counter == 0:
newlist = temp
else:
prevListIndex = results[str(counter-1)][-int('{}'.format(numLength-1))::]
newlist = prevListIndex + temp
results[str(counter)] = newlist
counter += 1
return results,threads
def worker(listNumbers,number,output):
"""A worker. Used to run seperate processes and put the results in the queue"""
result = euler8(listNumbers,number)
output.put(result)
def main(listOfNums,lengthNumbers,numCores=multiprocessing.cpu_count()):
"""Runs the module.
listOfNums must be a list of ints, or single int
lengthNumbers is N (an int) where N is the # of consecutive numbers to multiply together
numCores (an int) defaults to however many the cpu has, can specify a number if you choose."""
if isinstance(listOfNums,list):
if len(listOfNums) == 1:
valuesToSplit = [i for i in range(int(listOfNums[0]))]
else:
valuesToSplit = [int(i) for i in listOfNums]
elif isinstance(listOfNums,int):
valuesToSplit = [i for i in range(listOfNums)]
else:
print('First arg must be a number or a list of numbers')
split = split_list_multi(valuesToSplit,lengthNumbers,numCores)
done_queue = multiprocessing.Queue()
jobs = []
startTime = time.time()
for num in range(split[1]):
numChunks = split[0][str(num)]
thread = multiprocessing.Process(target=worker, args=(numChunks,lengthNumbers,done_queue))
jobs.append(thread)
thread.start()
resultlist = []
for i in range(split[1]):
resultlist.append(done_queue.get())
for j in jobs:
j.join()
resultlist = max(resultlist)
endTime = time.time()
totalTime = (endTime-startTime)/60
print("Took {} minutes".format(totalTime))
return print("The largest product of {} consecutive numbers is: {}".format(lengthNumbers, resultlist))
if __name__ == '__main__':
#To call the module from the commandline with arguments
parser = argparse.ArgumentParser(description="""Calculates the product of consecutive numbers \
and return the largest product.""")
parser.add_argument('-l', nargs='+', required=True,
help='A single number or list of numbers, where each # is seperated by a space')
parser.add_argument('-n', required=True, type=int,
help = 'A number that specifies how many consecutive numbers should be \
multiplied together.')
parser.add_argument('-p', default=multiprocessing.cpu_count(), type=int,
help='Number of processes to create. Optional, defaults to the # of cores on the pc.')
args = parser.parse_args()
main(args.l, args.n, args.p)

Categories