i am trying out the following code to learn threading in python.
import urllib.request
import re
import threading
from sys import argv, exit
if len(argv[1:])==0:
exit("You haven't entered any arguments. Try again.")
else:
comps=argv[1:]
def extr(comp):
url = 'http://finance.yahoo.com/q?s='+comp
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
respData = resp.read()
print (re.findall(r'<span id="yfs_l84_[^.]*">(.*?)</span>',str(respData)))
for x in comps:
t = threading.Thread(extr(x))
t.daemon = True
t.start()
I get the right result but one after the other and not at once. Am I missing something?
t = threading.Thread(extr(x)) is the problem. You are calling extr(x), and passing the result of that to the Thread constructor. Try Thread(target=extr, args=(x,)).
You'll then need to use something like https://docs.python.org/2/library/queue.html to allow threads to pass the result data back to the main thread before they terminate. You'd create the queue in the main thread, and pass it as an argument into each subthread.
Related
How can I simultaneously run the following code OR run the TTS function after returning the text?
CODE:
def main(q):
# CODE BEFORE THIS.
# TTS IS JUST A SIMPLE TEXT TO SPEECH FUNCTION
time.sleep(random.uniform(0.5, 2))
response = 'BOT: '+ response
# TTS
# SIMULTANEOUSLY RUN BELOW
if(responsetts!=None):
tts(responsetts)
else:
tts(response)
return response
if __name__ == '__main__':
while True:
query=input('U: ')
print(main(query))
The simple solution in case you want your tts function to run after the response was printed would be, to just let main print out response before calling tts. But for more flexibility and better responsiveness of your prompt you can use a separate thread for your tts call.
The threading module offers a Timer, which is a subclass of Thread. Timer has an interval parameter for adding a sleep before the target function gets executed. You could use this to add a delay if you want, or just use Thread if you don't need this feature. I use espeak in my example instead of tts:
import time
import random
import subprocess
from threading import Timer
from functools import partial
def _espeak(msg):
# Speak slowly in a female english voice
cmd = ["espeak", '-s130', '-ven+f5', msg]
subprocess.run(cmd)
def _vocalize(response, responsetts=None, interval=0):
# "Comparisons to singletons like None should always be done with is or
# is not, never the equality operators." -PEP 8
if responsetts is not None:
response = responsetts
Timer(interval=interval, function=_espeak, args=(response,)).start()
def _get_response(q):
time.sleep(random.uniform(0.5, 2))
response = '42'
response = 'BOT: '+ response
return response
def _handle_query(q):
response = _get_response(q)
print(response)
_vocalize(response, interval=0)
def main():
prompt = partial(input, 'U: ')
# alternative to using partial: iter(lambda: input('U: '), 'q')
for query in iter(prompt, 'q'): # quits on input 'q'
_handle_query(query)
if __name__ == '__main__':
main()
I've tried to create a scraper using python in combination with Thread to make the execution time faster. The scraper is supposed to parse all the shop names along with their phone numbers traversing multiple pages.
The script is running without any issues. As I'm very new to work with Thread, I can hardly understand I'm doing it in the right way.
This is what I've tried so far with:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
for pagelink in [url.format(page) for page in range(20)]:
response = requests.get(pagelink).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
thread = threading.Thread(target=get_information, args=(link,))
thread.start()
thread.join()
The problem being I can't find any difference in time or performance whether I run the above script using Thread or without using Thread. If I'm going wrong, how can I execute the above script using Thread?
EDIT: I've tried to change the logic to use multiple links. Is it possible now? Thanks in advance.
You can use Threading to scrape several pages in paralel as below:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
response = requests.get(url).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
threads = []
for url in [link.format(page) for page in range(20)]:
thread = threading.Thread(target=get_information, args=(url,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
Note that sequence of data will not be preserved. It means that if to scrape pages one by one sequence of extracted data will be:
page_1_name_1
page_1_name_2
page_1_name_3
page_2_name_1
page_2_name_2
page_2_name_3
page_3_name_1
page_3_name_2
page_3_name_3
while with Threading data will be mixed:
page_1_name_1
page_2_name_1
page_1_name_2
page_2_name_2
page_3_name_1
page_2_name_3
page_1_name_3
page_3_name_2
page_3_name_3
I have a python program that I have written. This python program calls a function within a module I have also written and passes it some data.
program:
def Response(Response):
Resp = Response
def main():
myModule.process_this("hello") #Send string to myModule Process_this function
#Should wait around here for Resp to contain the Response
print Resp
That function processes it and passes it back as a response to function Response in the main program.
myModule:
def process_this(data)
#process data
program.Response(data)
I checked and all the data is being passed correctly. I have left out all the imports and the data processing to keep this question as concise as possible.
I need to find some way of having Python wait for resp to actually contain the response before proceeding with the program. I've been looking threading and using semaphores or using the Queue module, but i'm not 100% sure how I would incorporate either into my program.
Here's a working solution with queues and the threading module. Note: if your tasks are CPU bound rather than IO bound, you should use multiprocessing instead
import threading
import Queue
def worker(in_q, out_q):
""" threadsafe worker """
abort = False
while not abort:
try:
# make sure we don't wait forever
task = in_q.get(True, .5)
except Queue.Empty:
abort = True
else:
# process task
response = task
# return result
out_q.put(response)
in_q.task_done()
# one queue to pass tasks, one to get results
task_q = Queue.Queue()
result_q = Queue.Queue()
# start threads
t = threading.Thread(target=worker, args=(task_q, result_q))
t.start()
# submit some work
task_q.put("hello")
# wait for results
task_q.join()
print "result", result_q.get()
I am trying to develop a downloader app in pygtk
So when a user adds a url following actions happen
addUrl()
which calls
validateUrl()
getUrldetails()
So it took a little while to add the url to the list because of urllib.urlopen delay
so i tried to implement threads. I added the following code to main window
thread.start_new_thread(addUrl, (self,url, ))
I passed a reference to the main window so that i can access the list from thread
but nothing seems to happen
I think that you check this thread first How to use threading in Python?.
for example:
import Queue
import threading
import urllib2
# called by each thread
def get_url(q, url):
q.put(urllib2.urlopen(url).read())
theurls = '''http://google.com http://yahoo.com'''.split()
q = Queue.Queue()
for u in theurls:
t = threading.Thread(target=get_url, args = (q,u))
t.daemon = True
t.start()
s = q.get()
print s
Hope this helps you.
I need to do a blocking xmlrpc call from my python script to several physical server simultaneously and perform actions based on response from each server independently.
To explain in detail let us assume following pseudo code
while True:
response=call_to_server1() #blocking and takes very long time
if response==this:
do that
I want to do this for all the servers simultaneously and independently but from same script
Use the threading module.
Boilerplate threading code (I can tailor this if you give me a little more detail on what you are trying to accomplish)
def run_me(func):
while not stop_event.isSet():
response= func() #blocking and takes very long time
if response==this:
do that
def call_to_server1():
#code to call server 1...
return magic_server1_call()
def call_to_server2():
#code to call server 2...
return magic_server2_call()
#used to stop your loop.
stop_event = threading.Event()
t = threading.Thread(target=run_me, args=(call_to_server1))
t.start()
t2 = threading.Thread(target=run_me, args=(call_to_server2))
t2.start()
#wait for threads to return.
t.join()
t2.join()
#we are done....
You can use multiprocessing module
import multiprocessing
def call_to_server(ip,port):
....
....
for i in xrange(server_count):
process.append( multiprocessing.Process(target=call_to_server,args=(ip,port)))
process[i].start()
#waiting process to stop
for p in process:
p.join()
You can use multiprocessing plus queues. With one single sub-process this is the example:
import multiprocessing
import time
def processWorker(input, result):
def remoteRequest( params ):
## this is my remote request
return True
while True:
work = input.get()
if 'STOP' in work:
break
result.put( remoteRequest(work) )
input = multiprocessing.Queue()
result = multiprocessing.Queue()
p = multiprocessing.Process(target = processWorker, args = (input, result))
p.start()
requestlist = ['1', '2']
for req in requestlist:
input.put(req)
for i in xrange(len(requestlist)):
res = result.get(block = True)
print 'retrieved ', res
input.put('STOP')
time.sleep(1)
print 'done'
To have more the one sub-process simply use a list object to store all the sub-processes you start.
The multiprocessing queue is a safe object.
Then you may keep track of which request is being executed by each sub-process simply storing the request associated to a workid (the workid can be a counter incremented when the queue get filled with new work). Usage of multiprocessing.Queue is robust since you do not need to rely on stdout/err parsing and you also avoid related limitation.
Then, you can also set a timeout on how long you want a get call to wait at max, eg:
import Queue
try:
res = result.get(block = True, timeout = 10)
except Queue.Empty:
print error
Use twisted.
It has a lot of useful stuff for work with network. It is also very good at working asynchronously.