simple multi-threading in python 3 - python

I've created a simple python program that scrapes my favorite recipe website and returns the individual recipe URLs from the main site. While this is a relatively quick and simple process, I've tried scaling this out to scrape multiple webpages within the site. When I do this, it takes about 45 seconds to scrape all of the recipe URLs from the whole site. I'd like this process to be much quicker so I tried implementing threads into my program.
I realize there is something wrong here as each thread returns the whole URL thread over and over again instead of 'splitting up' the work. Does anyone have any suggestions on how to better implement the threads? I've included my work below. Using Python 3.
from bs4 import BeautifulSoup
import urllib.request
from urllib.request import urlopen
from datetime import datetime
import threading
from datetime import datetime
startTime = datetime.now()
quote_page='http://thepioneerwoman.com/cooking_cat/all-pw-recipes/'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
all_recipe_links = []
#get all recipe links on current page
def get_recipe_links():
for link in soup.find_all('a', attrs={'post-card-permalink'}):
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
print(datetime.now() - startTime)
return all_recipe_links
def worker():
"""thread worker function"""
print(get_recipe_links())
return
threads = []
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()

I was able to distribute the work to the workers by having the workers all process data from a single list, instead of having them all run the whole method individually. Below are the parts that I changed. The method get_recipe_links is no longer needed, since its tasks have been moved to other methods.
all_recipe_links = []
links_to_process = []
def worker():
"""thread worker function"""
while(len(links_to_process) > 0):
link = links_to_process.pop()
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
threads = []
links_to_process = soup.find_all('a', attrs={'post-card-permalink'})
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
while len(links_to_process)>0:
continue
print(all_recipe_links)
I ran the new methods several times, and on average it takes .02 seconds to run this.

Related

Turn for-loop code into multi-threading code with max number of threads

Background: I'm trying to do 100's of dymola simulations with the python-dymola interface. I managed to run them in a for-loop. Now I want them to run while multi-threading so I can run multiple models parallel (which will be much faster). Since probably nobody uses the interface, I wrote some simple code that also shows my problem:
1: Turn a for-loop into a definition that is run into another for-loop BUT both the def and the for-loop share the same variable 'i'.
2: Turn a for-loop into a definition and use multi-threading to execute it. A for-loop runs the command one by one. I want to run them parallel with a maximum of x threads at the same time. The result should be the same as when executing the for-loop
Example-code:
import os
nSim = 100
ndig='{:01d}'
for i in range(nSim):
os.makedirs(str(ndig.format(i)))
Note that the name of the created directories are just the numbers from the for-loop (this is important). Now instead of using the for-loop, I would love to create the directories with multi-threading (note: probably not interesting for this short code but when calling and executing 100's of simulation models it definitely is interesting to use multi-threading).
So I started with something simple I thought, turning the for-loop into a function that then is run inside another for-loop and hoped to have the same result as with the for-loop code above but got this error:
AttributeError: 'NoneType' object has no attribute 'start'
(note: I just started with this, because I did not use the def-statement before and the thread package is also new. After this I would evolve towards the multi-threading.)
1:
import os
nSim = 100
ndig='{:01d}'
def simulation(i):
os.makedirs(str(ndig.format(i)))
for i in range(nSim):
simulation(i=i).start
After that failed, I tried to evolve to multi-threading (converting the for-loop into something that does the same but with multi-threading and by that running the code parallel instead of one by one and with a maximum number of threads):
2:
import os
import threading
nSim = 100
ndig='{:01d}'
def simulation(i):
os.makedirs(str(ndig.format(i)))
if __name__ == '__main__':
i in range(nSim)
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
Unfortunately that attempt failed as well and now I got the error:
NameError: name 'i' is not defined
Does anybody has suggestions for issues 1 or 2?
Both examples are incomplete. Here's a complete example. Note that target gets passed the name of the function target=simulation and a tuple of its arguments args=(i,). Don't call the function target=simulation(i=i) because that just passes the result of the function, which is equivalent to target=None in this case.
import threading
nSim = 100
def simulation(i):
print(f'{threading.current_thread().name}: {i}')
if __name__ == '__main__':
threads = [threading.Thread(target=simulation,args=(i,)) for i in range(nSim)]
for t in threads:
t.start()
for t in threads:
t.join()
Output:
Thread-1: 0
Thread-2: 1
Thread-3: 2
.
.
Thread-98: 97
Thread-99: 98
Thread-100: 99
Note you usually don't want more threads that CPUs, which you can get from multiprocessing.cpu_count(). You can use create a thread pool and use queue.Queue to post work that the threads execute. An example is in the Python Queue documentation.
Cannot call .start like this
simulation(i=i).start
on an non-threading object. Also, you have to import the module as well
It seems like you forgot to add 'for' and indent the code in your loop
i in range(nSim)
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
to
for i in range(nSim):
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
If you would like to have max number of thread in a pool, and to run all items in the queue. We can continue #mark-tolonen answer and do like this:
import threading
import queue
import time
def main():
size_of_threads_pool = 10
num_of_tasks = 30
task_seconds = 1
q = queue.Queue()
def worker():
while True:
item = q.get()
print(my_st)
print(f'{threading.current_thread().name}: Working on {item}')
time.sleep(task_seconds)
print(f'Finished {item}')
q.task_done()
my_st = "MY string"
threads = [threading.Thread(target=worker, daemon=True) for i in range(size_of_threads_pool)]
for t in threads:
t.start()
# send the tasks requests to the worker
for item in range(num_of_tasks):
q.put(item)
# block until all tasks are done
q.join()
print('All work completed')
# NO need this, as threads are while True, so never will stop..
# for t in threads:
# t.join()
if __name__ == '__main__':
main()
This will run 30 tasks of 1 second in each, using 10 threads.
So total time would be 3 seconds.
$ time python3 q_test.py
...
All work completed
real 0m3.064s
user 0m0.033s
sys 0m0.016s
EDIT: I found another higher-level interface for asynchronously executing callables.
Use concurrent.futures, see the example in the docs:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Note the max_workers=5 that will tell the max number of threads, and
note the for loop for url in URLS that you can use.

Unable to execute my script in the right way using thread

I've tried to create a scraper using python in combination with Thread to make the execution time faster. The scraper is supposed to parse all the shop names along with their phone numbers traversing multiple pages.
The script is running without any issues. As I'm very new to work with Thread, I can hardly understand I'm doing it in the right way.
This is what I've tried so far with:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
for pagelink in [url.format(page) for page in range(20)]:
response = requests.get(pagelink).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
thread = threading.Thread(target=get_information, args=(link,))
thread.start()
thread.join()
The problem being I can't find any difference in time or performance whether I run the above script using Thread or without using Thread. If I'm going wrong, how can I execute the above script using Thread?
EDIT: I've tried to change the logic to use multiple links. Is it possible now? Thanks in advance.
You can use Threading to scrape several pages in paralel as below:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
response = requests.get(url).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
threads = []
for url in [link.format(page) for page in range(20)]:
thread = threading.Thread(target=get_information, args=(url,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
Note that sequence of data will not be preserved. It means that if to scrape pages one by one sequence of extracted data will be:
page_1_name_1
page_1_name_2
page_1_name_3
page_2_name_1
page_2_name_2
page_2_name_3
page_3_name_1
page_3_name_2
page_3_name_3
while with Threading data will be mixed:
page_1_name_1
page_2_name_1
page_1_name_2
page_2_name_2
page_3_name_1
page_2_name_3
page_1_name_3
page_3_name_2
page_3_name_3

How can I use threading to parse multiple webpages in Python?

Most of the time the amount of webpages I have to scrape are below 100, so using a for loop I parse them in a reasonable time. But now I have to parse over 1000 webpages.
Searching for a way to do this, I found that threading might help. I have watched and read some tutorials and I believe that I have understood the general logic.
I know that if I have 100 webpages, I can create 100 threads. It's not recommended, especially for a very large number of webpages. What I haven't really figured out is, for example, how I can create 5 threads with 200 webpages on each thread.
Below is a simple code sample using threading and Selenium:
import threading
from selenium import webdriver
def parse_page(page_url):
driver = webdriver.PhantomJS()
driver.get(url)
text = driver.page_source
..........
return parsed_items
def threader():
worker = q.get()
parse_page(page_url)
q.task_one()
urls = [.......]
q = Queue()
for x in range(len(urls)):
t = threading.Thread(target=threader)
t.daemon = True
t.start()
for worker in range(20):
q.put(worker)
q.join()
Another thing that I am not clear on and it is shown in the above code sample is how I use arguments in thread.
Probably the simplest way will be to use ThreadPool from multiprocessing.pool module or if you are on python3 ThreadPoolExecutor from concurrent.futures module.
ThreadPool has (almost) the same api as regular Pool but uses threads instead of processes.
e.g.
def f(i):
return i * i
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=10)
res = pool.map(f, [2, 3, 4, 5])
print(res)
[4, 9, 16, 25]
And for ThreadPoolExecutor check this example.

Synchronise muti-threads in Python

The class BrokenLinkTest in the code below does the following.
takes a web page url
finds all the links in the web page
get the headers of the links concurrently (this is done to check if the link is broken or not)
print 'completed' when all the headers are received.
from bs4 import BeautifulSoup
import requests
class BrokenLinkTest(object):
def __init__(self, url):
self.url = url
self.thread_count = 0
self.lock = threading.Lock()
def execute(self):
soup = BeautifulSoup(requests.get(self.url).text)
self.lock.acquire()
for link in soup.find_all('a'):
url = link.get('href')
threading.Thread(target=self._check_url(url))
self.lock.acquire()
def _on_complete(self):
self.thread_count -= 1
if self.thread_count == 0: #check if all the threads are completed
self.lock.release()
print "completed"
def _check_url(self, url):
self.thread_count += 1
print url
result = requests.head(url)
print result
self._on_complete()
BrokenLinkTest("http://www.example.com").execute()
Can the concurrency/synchronization part be done in a better way. I did it using threading.Lock. This is my first experiment with python threading.
def execute(self):
soup = BeautifulSoup(requests.get(self.url).text)
threads = []
for link in soup.find_all('a'):
url = link.get('href')
t = threading.Thread(target=self._check_url, args=(url,))
t.start()
threads.append(t)
for thread in threads:
thread.join()
You could use the join method to wait for all the threads to finish.
Note I also added a start call, and passed the bound method object to the target param. In your original example you were calling _check_url in the main thread and passing the return value to the target param.
All threads in Python run on the same core, so you won't be gaining any performance by doing it this way. Also - it's very unclear what is actually happening?
You are never actually starting a threads, you are just initializing it
The threads themselves do absolutely nothing other than decrementing the thread count
You may only gain performance in a thread-based scenario if your program is delivering work to the IO (sending requests, writing to file and so on), where other threads can work in the meanwhile.

python apscheduler, an easier way to run jobs?

I have jobs scheduled thru apscheduler. I have 3 jobs so far, but soon will have many more. i'm looking for a way to scale my code.
Currently, each job is its own .py file, and in the file, I have turned the script into a function with run() as the function name. Here is my code.
from apscheduler.scheduler import Scheduler
import logging
import job1
import job2
import job3
logging.basicConfig()
sched = Scheduler()
#sched.cron_schedule(day_of_week='mon-sun', hour=7)
def runjobs():
job1.run()
job2.run()
job3.run()
sched.start()
This works, right now the code is just stupid, but it gets the job done. But when I have 50 jobs, the code will be stupid long. How do I scale it?
note: the actual names of the jobs are arbitrary and doesn't follow a pattern. The name of the file is scheduler.py and I run it using execfile('scheduler.py') in python shell.
import urllib
import threading
import datetime
pages = ['http://google.com', 'http://yahoo.com', 'http://msn.com']
#------------------------------------------------------------------------------
# Getting the pages WITHOUT threads
#------------------------------------------------------------------------------
def job(url):
response = urllib.urlopen(url)
html = response.read()
def runjobs():
for page in pages:
job(page)
start = datetime.datetime.now()
runjobs()
end = datetime.datetime.now()
print "jobs run in {} microseconds WITHOUT threads" \
.format((end - start).microseconds)
#------------------------------------------------------------------------------
# Getting the pages WITH threads
#------------------------------------------------------------------------------
def job(url):
response = urllib.urlopen(url)
html = response.read()
def runjobs():
threads = []
for page in pages:
t = threading.Thread(target=job, args=(page,))
t.start()
threads.append(t)
for t in threads:
t.join()
start = datetime.datetime.now()
runjobs()
end = datetime.datetime.now()
print "jobs run in {} microsecond WITH threads" \
.format((end - start).microseconds)
Look #
http://furius.ca/pubcode/pub/conf/bin/python-recursive-import-test
This will help you import all python / .py files.
while importing you can create a list which keeps keeps a function call, for example.
[job1.run(),job2.run()]
Then iterate through them and call function :)
Thanks Arjun

Categories