I have a python program which has a function takeScreenshot that takes a screenshot of 10 webpages that are inputted. I want to use threading to make the web scraping part of taking the screenshot be executed in the background while the program goes on inputting more webpages. After taking 10 screenshots, they should be displayed in the program.
The question is how to make the program display them after the last takeScreenshot thread (the tenth thread) is done so as not to cause an error? In other words, how to make sure that all the threads are finished? I tried to make a list of all the threads that started and make them .join() after inputting the last webpage (in the last loop). However, this makes the program freeze after inputting the last webpage.
threads=[]
n=0
while n<10:
webpage = input("Enter the webpage")
thread = threading.Thread(target = takeScreenshot, args = webpage)
thread.start()
threads.append(thread)
if n==9:
for thread in threads:
thread.join()
n++
I tried to investigate more, so I discovered that the program freezes in the last loop when it sets an attribute of a class equal to the screenshot: self.graph = PhotoImage(file='screenshot.png'). Note that the screenshot of the last webpage is normally downloaded, so the error isn't due to the absence of the screenshot. The previous line of code is included in the takeScreenshot function.
Here's the takeScreenshot method (it's a part of a class called scraping:
ublockPath = r'C:\Users\Bassel Attia\Documents\Trading Core\1.37.2_0'
chromeOptions = Options()
chromeOptions.add_argument("--log-level=3")
chromeOptions.add_argument('load-extension=' + ublockPath)
self.driver = webdriver.Chrome(ChromeDriverManager().install(),
chrome_options=chromeOptions)
self.driver.get(webpage)
self.driver.get_screenshot_as_file('screenshot.png')
self.image = Image.open('screenshot.png')
#crop screenshot
area = (20, 290, 1250, 800)
croppedImage=self.image.crop(area)
os.remove('currentStock.png')
croppedImage.save('screenshot.png')
self.image = Image.open('screenshot.png')
#resizeImage
newHeight = 300
newWidth = int(newHeight / self.image.height * self.image.width)
resizedImage = self.image.resize((newWidth, newHeight))
os.remove('currentStock.png')
resizedImage.save('screenshot.png')
self.image = Image.open('screenshot.png')
self.image.close()
self.driver.quit()
By inspecting the code you provided unfortunately i wasn't able to reproduce the issue, but there are some things that could be problematic:
the file currentStock.png is deleted twice (which i'm surprised it doesn't raise an exception the second time you try to delete it)
you keep overwriting the same 'screenshot.png' file
args is not a list
If this could help, here it is a minimum working example:
import os, io, threading, uuid
from PIL import Image
from selenium import webdriver
def screen(wid, webpage):
opts = webdriver.FirefoxOptions()
opts.add_argument('--headless')
print(wid, 'starting webdriver')
driver = webdriver.Firefox(options=opts)
driver.get(webpage)
print(wid, 'taking screenshot')
image_data = driver.get_screenshot_as_png()
image = Image.open(io.BytesIO(image_data))
area = (20, 290, 1250, 800)
cropped = image.crop(area)
h = 300
w = int(h / cropped.height * cropped.width)
resized = cropped.resize((w, h))
if not os.path.isdir('screen'):
os.mkdir('screen')
fname = os.path.join('screen', f'{uuid.uuid4()}.png')
resized.save(fname)
print(wid, f'screenshot saved to {fname}')
driver.quit()
def main():
threads = []
for wid in range(4):
webpage = input('Webpage: ')
thread = threading.Thread(target = screen, args = [wid, webpage])
thread.start()
threads.append(thread)
for i, thread in enumerate(threads):
thread.join()
print('joined thread : ', i)
if __name__ == '__main__':
main()
Related
import pystray
import PIL.Image
from datetime import datetime
from text_to_speech import speak
from time import time, sleep
import os
from gtts import gTTS
import vlc
image = PIL.Image.open('hourglass.jpg')
def on_clicked(icon, item):
icon.stop()
icon = pystray.Icon('Hourglass', image, menu=pystray.Menu(
pystray.MenuItem('Exit', on_clicked)))
icon.run()
stop = False ## To loop forever
while stop == False:
print('test')
now = datetime.now()
second = now.second
minute = now.minute
if second == 0 :
myText = 'It is now ' + (now.strftime("%I %p"))
print(myText)
output = gTTS(text=myText, lang='en', slow=False)
output.save("Time.mp3")
p = vlc.MediaPlayer("Time.mp3")
p.play()
sleep(10)
os.remove("Time.mp3")
this is my code. For some reason which i cant figure out until i press on the icon and exit, the rest of the code wont run. I was trying to make an icon try for when i run this in the background.
The icon.run() internally run a loop. So until this loop breaks (by closing the window) the code below will not be executed. If you want for the icon and the code below to run independently, you can use Threads.
import threading
def run_icon():
icon = pystray.Icon('Hourglass', image, menu=pystray.Menu(
pystray.MenuItem('Exit', on_clicked)))
icon.run()
def run_second():
stop = False ## To loop forever
while stop == False:
print('test')
now = datetime.now()
second = now.second
minute = now.minute
if second == 0 :
myText = 'It is now ' + (now.strftime("%I %p"))
print(myText)
output = gTTS(text=myText, lang='en', slow=False)
output.save("Time.mp3")
p = vlc.MediaPlayer("Time.mp3")
p.play()
sleep(10)
os.remove("Time.mp3")
Thread1 = threading.Thread(target=run_icon)
Thread2 = threading.Thread(target=run_second)
Thread1.join() # wait for thread to stop
Thread2.join() # wait for thread to stop
You can use icon.run_detached(). Then just run your main code underneath.
so I am making a program on tkinter that gets a response from a server and depending on the answer, it will change the background color, to either green for success or red for error, the problem is that I realized that when running the code, the windows.after() method doesn't wait till is done to continue and when I do the request for the server, it have to do it three times to check if the response is correct, and it is suppossed to change the window background color each time, but it is only doing it one time. And not only the background color changing fails, also I want to change a label's text when it is doing the request,but it does it really quick and I'm not able to diferentiate the changes, so the question is: how can I
How can I make the program wait until one line finishes running to go to the next one and not everything happens at the same time and so fast?
Here is a piece of my code, I removed the request part because I'm trying to solve this problem first:
# import gpiozero
# import picamera
import json
import requests
import tkinter as tk
with open("config.json") as file:
config = json.load(file)
ENDPOINT = config["ENDPOINT"]
USUARIO = config["USUARIO"]
ESTACION = config["ESTACION"]
TIEMPO_ESPERA = config["TIEMPO_ESPERA"]
PIN_RELE = config["PIN_RELE"]
PATH_SALIDA = ENDPOINT + "Salida.getTicket/" + ESTACION + "/" + USUARIO + "/"
barcode = ""
# RELAY = gpiozero.OutputDevice(PIN_RELE, active_high=True, initial_value=False)
# CAMERA = picamera.PiCamera()
def check_scan_barcode(event=None):
info_label.config(text = "Wait...")
barcode = barcode_entry.get()
barcode_entry.delete(0, "end")
for i in range(3):
response = get_request(ENDPOINT + barcode)
if response["data"] == "True":
success()
open_barrier()
else:
error()
info_label.config(text = "Scan barcode")
def get_request(url):
response = requests.get(url)
response.raise_for_status()
response = response.json()
return response
def normal():
window.configure(bg="white")
info_label.configure(bg="white")
def success():
window.configure(bg="green")
info_label.configure(bg="green")
window.after(1000, normal)
def error():
window.configure(bg="red")
info_label.configure(bg="red")
window.after(1000, normal)
def open_barrier(barcode):
# CAMERA.capture(f"/home/pi/Pictures{barcode}.jpg")
# RELAY.on()
# window.after(TIEMPO_ESPERA, RELAY.off)
pass
window = tk.Tk()
# window.attributes('-fullscreen', True)
info_label = tk.Label(window, text= "Scan barcode.", font=("Arial", 40))
info_label.pack()
barcode_entry = tk.Entry(window, width=50)
barcode_entry.bind('<Return>', check_scan_barcode)
barcode_entry.pack(expand=True)
barcode_entry.focus()
window.mainloop()
Hello I have a script that does a GET request and I need to measure the thread that is loaded with that function. This is the code that I have written but it doesn`t show the correct time it shows 0 and sometimes 0.001 or something like that.
import requests
import threading
import time
def functie():
URL = "http://10.250.100.170:9082/SPVWS2/rest/listaMesaje"
r = requests.get(url = URL)
data = r.json()
threads = []
for i in range(5):
start = time.clock_gettime_ns()
t = threading.Thread(target=functie)
threads.append(t)
t.start()
end = time.clock_gettime_ns()
print(end-start)
I need an example on how to get in my code the exact thread execution time. Thanks
The code in this script runs on the main thread and you are trying to measure the timing of thread t. To do that, you can tell main thread to wait until thread t has finished like this:
import requests
import threading
import time
threads = []
start = []
end = []
def functie():
start.append(time.clock_gettime_ns())
URL = "http://10.250.100.170:9082/SPVWS2/rest/listaMesaje"
r = requests.get(url = URL)
data = r.json()
end.append(time.clock_gettime_ns())
for i in range(5):
start.append(time.clock_gettime_ns())
t = threading.Thread(target=functie)
threads.append(t)
t.start()
for (i,t) in enumerate(threads):
t.join()
print(end[i]-start[i])
The other answer would produce incorrect results. If the first thread takes longer than the second, the time of the second will be recorded as the same as the first. This is because the end times are recorded sequentially after each join finishes rather than when the thread's target function actually finishes which may be in any order.
A better way would be to wrap the target functions of the threads with code that does this:
def thread_time(target):
def wrapper(*args, **kwargs):
st = time.time()
try:
return target(*args, **kwargs)
finally:
et = time.time()
print(et - st)
threading.currentThread().duration = et - st
return wrapper
def functie():
print "starting"
time.sleep(1)
print "ending"
t = threading.Thread(target=thread_time(functie))
t.start()
t.join()
print(t.duration)
I'm using python 2.7 and have built a UI using Tkinter. I'm using threads and queues to keep the UI responsive while the main script is working. The basic summary is the script reads a text file, parses out some information on it and puts that info in a dictionary and a list in the dictionary, then uses that info to send TCP modbus request (using pyModbus). It then writes the responses/results to a text file. The results also get printed a Text widget included in the UI. The updates to the Text widget is handled by the mainloop.
I'm still fairly new to threads and queues and I'm having trouble figuring out this issue.
The problem I'm running into is I need to include a ~10ms sleep after it loops through each item in the list for the UI to remain responsive. If I include the sleep time it works as expected, if not it freezes up until the threaded process is finished then updates the UI all at once (as it would if threads weren't used). The 10ms sleep can be slightly shorter. Any amount longer also works.
Here's the code that handles updating the log:
textQueue = Queue.Queue()
def TextDisplay(message, disTime="false", myColor="black", bgColor="white"):
textQueue.put([message, disTime, myColor, bgColor])
class LogUI:
def __init__(self, master):
self.master = master
'''other ui elements, not relevent'''
self.mainLogFrame = Frame(self.master)
self.mainLogFrame.pack(side="top", fill="both", expand="yes", padx=5, pady=2)
self.logText = Text(self.mainLogFrame, height=2)
self.logText.pack(side="left", fill="both", expand="yes", padx=5, pady=2)
self.ThreadSafeTextDisplay()
def ThreadSafeTextDisplay(self):
while not textQueue.empty():
tempText = textQueue.get(0)
message = tempText[0]
disTime = tempText[1]
myColor = tempText[2]
bgColor = tempText[3]
message = str(message) + "\n"
'''bunch of formating stuff'''
logUI.logText.insert(END, message)
print message
#NOTE: tried to include a sleep time here, no effect
self.logText.after(10, self.ThreadSafeTextDisplay)
Here's the non-threaded function that's called when the user clicks a button.
def ParseInputFile():
'''non-threaded function, called when user clicks button'''
inputList = []
inputFile = mainUI.fullInFileEntry.get()
with open(inputFile, 'r') as myInput:
'''open file and put contents in list'''
for line in myInput:
inputList.append(line.strip())
outFile = mainUI.outFileEntry.get().strip() + '.txt'
i = 1
tableBol = False
inputDict = {}
inputKeys = []
tableID = None
for item in inputList:
'''parses out inputKeys, inputDict using regular expressions'''
runInputGetQueue.put([inputKeys, inputDict, outFile, inputFile])
Here's the threaded function that receives the parsed information and handles the modbus request (note: i tried commenting out the actual modbus request, no effect):
def RunInputThread():
time.sleep(.1)
while 1:
while not runInputGetQueue.empty():
tempGet = runInputGetQueue.get(0)
inputKeys = tempGet[0]
inputDict = tempGet[1]
outFile = tempGet[2]
inputFile = tempGet[3]
outFile = open(outFile, 'w')
TextDisplay('< Start of %s input file > ' % inputFile, True, 'blue')
for key in inputKeys:
'''loops through the keys in the dictionary'''
TextDisplay(key) #just used as an example.
for lineIndex in range(len(inputDict[key]['tableLines'])):
'''lots of code that loops thorugh the lines of input file, frequently calls the TextDisplay() function'''
TextDisplay(inputDict[key][lineIndex]) #just used as an example.
time.sleep(0.01) #UI will become unresponseive if not included.
outFile.close()
time.sleep(0.001)
Found a way to get the UI mostly responsive. As stated in the comments above, the queue was receiving stuff to fast the function would be constantly working causing the UI to lock up. I made it so it will print at most 5 messages before taking a 1ms break and recalling the function which allows the UI to 'catch up.' The messages are printed almost as fast as they come in.
The UI will be slightly non-responsive if you move it or resize it. I have no issues interacting with other UI elements while this is running though.
You could also change the while loop to a if statement if you don't mind it being slow. The one process i ran went from 14 seconds with an if statement down to around 5 or 6 seconds using the code below. It would be same as if you changed the pullCount break point to 1 instead of 5.
def ThreadSafeTextDisplay(self):
pullCount = 0
while not textQueue.empty():
pullCount += 1
tempText = textQueue.get(0)
message = tempText[0]
disTime = tempText[1]
myColor = tempText[2]
bgColor = tempText[3]
message = str(message) + "\n"
'''bunch of formatting stuff'''
logUI.logText.insert(END, message)
print message
if pullCount >= 5: break #you can change the 5 value to whatever you want, the higher the number the faster stuff will print but the UI will start to become unresponsive.
self.logText.after(1, self.ThreadSafeTextDisplay)
So I have been trying to multi-thread some internet connections in python. I have been using the multiprocessing module so I can get around the "Global Interpreter Lock". But it seems that the system only gives one open connection port to python, Or at least it only allows one connection to happen at once. Here is an example of what I am saying.
*Note that this is running on a linux server
from multiprocessing import Process, Queue
import urllib
import random
# Generate 10,000 random urls to test and put them in the queue
queue = Queue()
for each in range(10000):
rand_num = random.randint(1000,10000)
url = ('http://www.' + str(rand_num) + '.com')
queue.put(url)
# Main funtion for checking to see if generated url is active
def check(q):
while True:
try:
url = q.get(False)
try:
request = urllib.urlopen(url)
del request
print url + ' is an active url!'
except:
print url + ' is not an active url!'
except:
if q.empty():
break
# Then start all the threads (50)
for thread in range(50):
task = Process(target=check, args=(queue,))
task.start()
So if you run this you will notice that it starts 50 instances on the function but only runs one at a time. You may think that the 'Global Interpreter Lock' is doing this but it isn't. Try changing the function to a mathematical function instead of a network request and you will see that all fifty threads run simultaneously.
So will I have to work with sockets? Or is there something I can do that will give python access to more ports? Or is there something I am not seeing? Let me know what you think! Thanks!
*Edit
So I wrote this script to test things better with the requests library. It seems as though I had not tested it very well with this before. (I had mainly used urllib and urllib2)
from multiprocessing import Process, Queue
from threading import Thread
from Queue import Queue as Q
import requests
import time
# A main timestamp
main_time = time.time()
# Generate 100 urls to test and put them in the queue
queue = Queue()
for each in range(100):
url = ('http://www.' + str(each) + '.com')
queue.put(url)
# Timer queue
time_queue = Queue()
# Main funtion for checking to see if generated url is active
def check(q, t_q): # args are queue and time_queue
while True:
try:
url = q.get(False)
# Make a timestamp
t = time.time()
try:
request = requests.head(url, timeout=5)
t = time.time() - t
t_q.put(t)
del request
except:
t = time.time() - t
t_q.put(t)
except:
break
# Then start all the threads (20)
thread_list = []
for thread in range(20):
task = Process(target=check, args=(queue, time_queue))
task.start()
thread_list.append(task)
# Join all the threads so the main process don't quit
for each in thread_list:
each.join()
main_time_end = time.time()
# Put the timerQueue into a list to get the average
time_queue_list = []
while True:
try:
time_queue_list.append(time_queue.get(False))
except:
break
# Results of the time
average_response = sum(time_queue_list) / float(len(time_queue_list))
total_time = main_time_end - main_time
line = "Multiprocessing: Average response time: %s sec. -- Total time: %s sec." % (average_response, total_time)
print line
# A main timestamp
main_time = time.time()
# Generate 100 urls to test and put them in the queue
queue = Q()
for each in range(100):
url = ('http://www.' + str(each) + '.com')
queue.put(url)
# Timer queue
time_queue = Queue()
# Main funtion for checking to see if generated url is active
def check(q, t_q): # args are queue and time_queue
while True:
try:
url = q.get(False)
# Make a timestamp
t = time.time()
try:
request = requests.head(url, timeout=5)
t = time.time() - t
t_q.put(t)
del request
except:
t = time.time() - t
t_q.put(t)
except:
break
# Then start all the threads (20)
thread_list = []
for thread in range(20):
task = Thread(target=check, args=(queue, time_queue))
task.start()
thread_list.append(task)
# Join all the threads so the main process don't quit
for each in thread_list:
each.join()
main_time_end = time.time()
# Put the timerQueue into a list to get the average
time_queue_list = []
while True:
try:
time_queue_list.append(time_queue.get(False))
except:
break
# Results of the time
average_response = sum(time_queue_list) / float(len(time_queue_list))
total_time = main_time_end - main_time
line = "Standard Threading: Average response time: %s sec. -- Total time: %s sec." % (average_response, total_time)
print line
# Do the same thing all over again but this time do each url at a time
# A main timestamp
main_time = time.time()
# Generate 100 urls and test them
timer_list = []
for each in range(100):
url = ('http://www.' + str(each) + '.com')
t = time.time()
try:
request = requests.head(url, timeout=5)
timer_list.append(time.time() - t)
except:
timer_list.append(time.time() - t)
main_time_end = time.time()
# Results of the time
average_response = sum(timer_list) / float(len(timer_list))
total_time = main_time_end - main_time
line = "Not using threads: Average response time: %s sec. -- Total time: %s sec." % (average_response, total_time)
print line
As you can see, it is multithreading very well. Actually, most of my tests show that the threading module is actually faster than the multiprocessing module. (I don't understand why!) Here are some of my results.
Multiprocessing: Average response time: 2.40511314869 sec. -- Total time: 25.6876308918 sec.
Standard Threading: Average response time: 2.2179402256 sec. -- Total time: 24.2941861153 sec.
Not using threads: Average response time: 2.1740363431 sec. -- Total time: 217.404567957 sec.
This was done on my home network, the response time on my server is much faster. I think my question has been answered indirectly, since I was having my problems on a much more complex script. All of the suggestions helped me optimize it very well. Thanks to everyone!
it starts 50 instances on the function but only runs one at a time
You have misinterpreted the results of htop. Only a few, if any, copies of python will be runnable at any specific instance. Most of them will be blocked waiting for network I/O.
The processes are, in fact, running parallel.
Try changing the function to a mathematical function instead of a network request and you will see that all fifty threads run simultaneously.
Changing the task to a mathematical function merely illustrates the difference between CPU-bound (e.g. math) and IO-bound (e.g. urlopen) processes. The former is always runnable, the latter is rarely runnable.
it only prints one at a time. If it was actually running multiple processes it would print many out at once.
It prints one at a time because you are writing lines to a terminal. Because the lines are indistinguishable, you wouldn't be able to tell if they are written all by one thread, or each by a separate thread in turn.
First of all, using multiprocessing to parallelize network I/O is an overkill. Using the built-in threading or a lightweight greenlet library like gevent are a much better option with less overhead. The GIL has nothing to do with blocking IO calls, so you don't have to worry about that at all.
Secondly, an easy way to see if your subprocesses/threads/greenlets are running in parallel if you are monitoring stdout is to print out something at the very beginning of the function, right after the subprocesses/threads/greenlets are spawned. For example, modify your check() function like so
def check(q):
print 'Start checking urls!'
while True:
...
If your code is correct, you should see many Start checking urls! lines printed out before any of the url + ' is [not] an active url!' printed out. It works on my machine, so it looks like your code is correct.
It appears that your issue is actually with the serial behavior of gethostbyname(3). This is discussed in this SO thread.
Try this code that uses the Twisted asynchronous I/O library:
import random
import sys
from twisted.internet import reactor
from twisted.internet import defer
from twisted.internet.task import cooperate
from twisted.web import client
SIMULTANEOUS_CONNECTIONS = 25
# Generate 10,000 random urls to test and put them in the queue
pages = []
for each in range(10000):
rand_num = random.randint(1000,10000)
url = ('http://www.' + str(rand_num) + '.com')
pages.append(url)
# Main function for checking to see if generated url is active
def check(page):
def successback(data, page):
print "{} is an active URL!".format(page)
def errback(err, page):
print "{} is not an active URL!; errmsg:{}".format(page, err.value)
d = client.getPage(page, timeout=3) # timeout in seconds
d.addCallback(successback, page)
d.addErrback(errback, page)
return d
def generate_checks(pages):
for i in xrange(0, len(pages)):
page = pages[i]
#print "Page no. {}".format(i)
yield check(page)
def work(pages):
print "started work(): {}".format(len(pages))
batch_size = len(pages) / SIMULTANEOUS_CONNECTIONS
for i in xrange(0, len(pages), batch_size):
task = cooperate(generate_checks(pages[i:i+batch_size]))
print "starting..."
reactor.callWhenRunning(work, pages)
reactor.run()