I've tried to create a scraper using python in combination with Thread to make the execution time faster. The scraper is supposed to parse all the shop names along with their phone numbers traversing multiple pages.
The script is running without any issues. As I'm very new to work with Thread, I can hardly understand I'm doing it in the right way.
This is what I've tried so far with:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
for pagelink in [url.format(page) for page in range(20)]:
response = requests.get(pagelink).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
thread = threading.Thread(target=get_information, args=(link,))
thread.start()
thread.join()
The problem being I can't find any difference in time or performance whether I run the above script using Thread or without using Thread. If I'm going wrong, how can I execute the above script using Thread?
EDIT: I've tried to change the logic to use multiple links. Is it possible now? Thanks in advance.
You can use Threading to scrape several pages in paralel as below:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
response = requests.get(url).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
threads = []
for url in [link.format(page) for page in range(20)]:
thread = threading.Thread(target=get_information, args=(url,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
Note that sequence of data will not be preserved. It means that if to scrape pages one by one sequence of extracted data will be:
page_1_name_1
page_1_name_2
page_1_name_3
page_2_name_1
page_2_name_2
page_2_name_3
page_3_name_1
page_3_name_2
page_3_name_3
while with Threading data will be mixed:
page_1_name_1
page_2_name_1
page_1_name_2
page_2_name_2
page_3_name_1
page_2_name_3
page_1_name_3
page_3_name_2
page_3_name_3
Related
I have a some threads that are downloading content for various websites using Python's built-in urllib module. The code looks something like this:
from urllib.request import Request, urlopen
from threading import Thread
##do stuff
def download(url):
req = urlopen(Request(url))
return req.read()
url = "somerandomwebsite"
Thread(target=download, args=(url,)).start()
Thread(target=download, args=(url,)).start()
#Do more stuff
The user should have an option to stop loading data, and while I can use flags/events to prevent using the data when it is done downloading if the user stops the download, I can't actually stop the download itself.
Is there a way to either stop the download (and preferably do something when the download is stopped) or forcibly (and safely) kill the thread the download is running in?
Thanks in advance.
you can use urllib.request.urlretrieve instead witch takes a reporthook argument:
from urllib.request import urlretrieve
from threading import Thread
url = "someurl"
flag = 0
def dl_progress(count, blksize, filesize):
global flag
if flag:
raise Exception('downlaod canceled')
Thread(target=urlretrieve, args=(url, "test.rar",dl_progress)).start()
if cancel_download():
flag = 1
I've created a simple python program that scrapes my favorite recipe website and returns the individual recipe URLs from the main site. While this is a relatively quick and simple process, I've tried scaling this out to scrape multiple webpages within the site. When I do this, it takes about 45 seconds to scrape all of the recipe URLs from the whole site. I'd like this process to be much quicker so I tried implementing threads into my program.
I realize there is something wrong here as each thread returns the whole URL thread over and over again instead of 'splitting up' the work. Does anyone have any suggestions on how to better implement the threads? I've included my work below. Using Python 3.
from bs4 import BeautifulSoup
import urllib.request
from urllib.request import urlopen
from datetime import datetime
import threading
from datetime import datetime
startTime = datetime.now()
quote_page='http://thepioneerwoman.com/cooking_cat/all-pw-recipes/'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
all_recipe_links = []
#get all recipe links on current page
def get_recipe_links():
for link in soup.find_all('a', attrs={'post-card-permalink'}):
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
print(datetime.now() - startTime)
return all_recipe_links
def worker():
"""thread worker function"""
print(get_recipe_links())
return
threads = []
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
I was able to distribute the work to the workers by having the workers all process data from a single list, instead of having them all run the whole method individually. Below are the parts that I changed. The method get_recipe_links is no longer needed, since its tasks have been moved to other methods.
all_recipe_links = []
links_to_process = []
def worker():
"""thread worker function"""
while(len(links_to_process) > 0):
link = links_to_process.pop()
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
threads = []
links_to_process = soup.find_all('a', attrs={'post-card-permalink'})
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
while len(links_to_process)>0:
continue
print(all_recipe_links)
I ran the new methods several times, and on average it takes .02 seconds to run this.
The class BrokenLinkTest in the code below does the following.
takes a web page url
finds all the links in the web page
get the headers of the links concurrently (this is done to check if the link is broken or not)
print 'completed' when all the headers are received.
from bs4 import BeautifulSoup
import requests
class BrokenLinkTest(object):
def __init__(self, url):
self.url = url
self.thread_count = 0
self.lock = threading.Lock()
def execute(self):
soup = BeautifulSoup(requests.get(self.url).text)
self.lock.acquire()
for link in soup.find_all('a'):
url = link.get('href')
threading.Thread(target=self._check_url(url))
self.lock.acquire()
def _on_complete(self):
self.thread_count -= 1
if self.thread_count == 0: #check if all the threads are completed
self.lock.release()
print "completed"
def _check_url(self, url):
self.thread_count += 1
print url
result = requests.head(url)
print result
self._on_complete()
BrokenLinkTest("http://www.example.com").execute()
Can the concurrency/synchronization part be done in a better way. I did it using threading.Lock. This is my first experiment with python threading.
def execute(self):
soup = BeautifulSoup(requests.get(self.url).text)
threads = []
for link in soup.find_all('a'):
url = link.get('href')
t = threading.Thread(target=self._check_url, args=(url,))
t.start()
threads.append(t)
for thread in threads:
thread.join()
You could use the join method to wait for all the threads to finish.
Note I also added a start call, and passed the bound method object to the target param. In your original example you were calling _check_url in the main thread and passing the return value to the target param.
All threads in Python run on the same core, so you won't be gaining any performance by doing it this way. Also - it's very unclear what is actually happening?
You are never actually starting a threads, you are just initializing it
The threads themselves do absolutely nothing other than decrementing the thread count
You may only gain performance in a thread-based scenario if your program is delivering work to the IO (sending requests, writing to file and so on), where other threads can work in the meanwhile.
I have jobs scheduled thru apscheduler. I have 3 jobs so far, but soon will have many more. i'm looking for a way to scale my code.
Currently, each job is its own .py file, and in the file, I have turned the script into a function with run() as the function name. Here is my code.
from apscheduler.scheduler import Scheduler
import logging
import job1
import job2
import job3
logging.basicConfig()
sched = Scheduler()
#sched.cron_schedule(day_of_week='mon-sun', hour=7)
def runjobs():
job1.run()
job2.run()
job3.run()
sched.start()
This works, right now the code is just stupid, but it gets the job done. But when I have 50 jobs, the code will be stupid long. How do I scale it?
note: the actual names of the jobs are arbitrary and doesn't follow a pattern. The name of the file is scheduler.py and I run it using execfile('scheduler.py') in python shell.
import urllib
import threading
import datetime
pages = ['http://google.com', 'http://yahoo.com', 'http://msn.com']
#------------------------------------------------------------------------------
# Getting the pages WITHOUT threads
#------------------------------------------------------------------------------
def job(url):
response = urllib.urlopen(url)
html = response.read()
def runjobs():
for page in pages:
job(page)
start = datetime.datetime.now()
runjobs()
end = datetime.datetime.now()
print "jobs run in {} microseconds WITHOUT threads" \
.format((end - start).microseconds)
#------------------------------------------------------------------------------
# Getting the pages WITH threads
#------------------------------------------------------------------------------
def job(url):
response = urllib.urlopen(url)
html = response.read()
def runjobs():
threads = []
for page in pages:
t = threading.Thread(target=job, args=(page,))
t.start()
threads.append(t)
for t in threads:
t.join()
start = datetime.datetime.now()
runjobs()
end = datetime.datetime.now()
print "jobs run in {} microsecond WITH threads" \
.format((end - start).microseconds)
Look #
http://furius.ca/pubcode/pub/conf/bin/python-recursive-import-test
This will help you import all python / .py files.
while importing you can create a list which keeps keeps a function call, for example.
[job1.run(),job2.run()]
Then iterate through them and call function :)
Thanks Arjun
I am trying to develop a downloader app in pygtk
So when a user adds a url following actions happen
addUrl()
which calls
validateUrl()
getUrldetails()
So it took a little while to add the url to the list because of urllib.urlopen delay
so i tried to implement threads. I added the following code to main window
thread.start_new_thread(addUrl, (self,url, ))
I passed a reference to the main window so that i can access the list from thread
but nothing seems to happen
I think that you check this thread first How to use threading in Python?.
for example:
import Queue
import threading
import urllib2
# called by each thread
def get_url(q, url):
q.put(urllib2.urlopen(url).read())
theurls = '''http://google.com http://yahoo.com'''.split()
q = Queue.Queue()
for u in theurls:
t = threading.Thread(target=get_url, args = (q,u))
t.daemon = True
t.start()
s = q.get()
print s
Hope this helps you.