The class BrokenLinkTest in the code below does the following.
takes a web page url
finds all the links in the web page
get the headers of the links concurrently (this is done to check if the link is broken or not)
print 'completed' when all the headers are received.
from bs4 import BeautifulSoup
import requests
class BrokenLinkTest(object):
def __init__(self, url):
self.url = url
self.thread_count = 0
self.lock = threading.Lock()
def execute(self):
soup = BeautifulSoup(requests.get(self.url).text)
self.lock.acquire()
for link in soup.find_all('a'):
url = link.get('href')
threading.Thread(target=self._check_url(url))
self.lock.acquire()
def _on_complete(self):
self.thread_count -= 1
if self.thread_count == 0: #check if all the threads are completed
self.lock.release()
print "completed"
def _check_url(self, url):
self.thread_count += 1
print url
result = requests.head(url)
print result
self._on_complete()
BrokenLinkTest("http://www.example.com").execute()
Can the concurrency/synchronization part be done in a better way. I did it using threading.Lock. This is my first experiment with python threading.
def execute(self):
soup = BeautifulSoup(requests.get(self.url).text)
threads = []
for link in soup.find_all('a'):
url = link.get('href')
t = threading.Thread(target=self._check_url, args=(url,))
t.start()
threads.append(t)
for thread in threads:
thread.join()
You could use the join method to wait for all the threads to finish.
Note I also added a start call, and passed the bound method object to the target param. In your original example you were calling _check_url in the main thread and passing the return value to the target param.
All threads in Python run on the same core, so you won't be gaining any performance by doing it this way. Also - it's very unclear what is actually happening?
You are never actually starting a threads, you are just initializing it
The threads themselves do absolutely nothing other than decrementing the thread count
You may only gain performance in a thread-based scenario if your program is delivering work to the IO (sending requests, writing to file and so on), where other threads can work in the meanwhile.
Related
I have a script to traverse an AWS S3 bucket to do some aggregation at the file level.
from threading import Semaphore, Thread
class Spider:
def __init__(self):
self.sem = Semaphore(120)
self.threads = list()
def crawl(self, root_url):
self.recursive_harvest_subroutine(root_url)
for thread in self.threads:
thread.join()
def recursive_harvest_subroutine(self, url):
children = get_direct_subdirs(url)
self.sem.acquire()
if len(children) == 0:
queue_url_to_do_something_later(url) # Done
else:
for child_url in children:
try:
thread = Thread(target=self.recursive_harvest_subroutine, args=(url,))
self.threads.append(thread)
thread.start()
self.sem.release()
This used to run okay, until I encountered a bucket of several TB of data with hundreds of thousand sub-directories. The number of Thread objects in self.threads increases very fast and soon the server reported to me
RuntimeError: can't start new thread
There is some extra processing I have to do in the script so I can't just get all files from the bucket.
Currently I'm putting a depth of at least 2 before the script can go parallelized but it's just a workaround. Any suggestion is appreciated.
So the way the original piece of code worked was BFS, which created a lot of waiting threads in queue. I changed it to DFS and everything is working fine. Pseudo code in case someone needs this in the future:
def __init__(self):
self.sem = Semaphore(120)
self.urls = list()
self.mutex = Lock()
def crawl(self, root_url):
self.recursive_harvest_subroutine(root_url)
while not is_done():
self.sem.acquire()
url = self.urls.pop(0)
thread = Thread(target=self.recursive_harvest_subroutine, args=(url,))
thread.start()
self.sem.release()
def recursive_harvest_subroutine(self, url):
children = get_direct_subdirs(url)
if len(children) == 0:
queue_url_to_do_something_later(url) # Done
else:
self.mutex.acquire()
for child_url in children:
self.urls.insert(0, child_url)
self.mutex.release()
No join() so I implemented my own is_done() check.
I've tried to create a scraper using python in combination with Thread to make the execution time faster. The scraper is supposed to parse all the shop names along with their phone numbers traversing multiple pages.
The script is running without any issues. As I'm very new to work with Thread, I can hardly understand I'm doing it in the right way.
This is what I've tried so far with:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
for pagelink in [url.format(page) for page in range(20)]:
response = requests.get(pagelink).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
thread = threading.Thread(target=get_information, args=(link,))
thread.start()
thread.join()
The problem being I can't find any difference in time or performance whether I run the above script using Thread or without using Thread. If I'm going wrong, how can I execute the above script using Thread?
EDIT: I've tried to change the logic to use multiple links. Is it possible now? Thanks in advance.
You can use Threading to scrape several pages in paralel as below:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
response = requests.get(url).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
threads = []
for url in [link.format(page) for page in range(20)]:
thread = threading.Thread(target=get_information, args=(url,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
Note that sequence of data will not be preserved. It means that if to scrape pages one by one sequence of extracted data will be:
page_1_name_1
page_1_name_2
page_1_name_3
page_2_name_1
page_2_name_2
page_2_name_3
page_3_name_1
page_3_name_2
page_3_name_3
while with Threading data will be mixed:
page_1_name_1
page_2_name_1
page_1_name_2
page_2_name_2
page_3_name_1
page_2_name_3
page_1_name_3
page_3_name_2
page_3_name_3
I'm new in Scrapy. I have thousands of url,xpath tuples and values in a database.
These urls are from different domains (not allways, there can be 100 urls from the same domain).
x.com/a //h1
y.com/a //div[#class='1']
z.com/a //div[#href='...']
x.com/b //h1
x.com/c //h1
...
Now I want to get these values every 2 hours as fast as possible but to be sure that I don't overload any of these.
Can't figure out how to do that.
My thoughts:
I could create one Spider for every different domain, set it's parsing rules and run them at once.
Is it a good practice?
EDIT:
I'm not sure how it would work with outputting data into database according to concurrency.
EDIT2:
I can do something like this - for every domain there is a new spider. But this is impossible to do having thousands of different urls and it's xpaths.
class WikiScraper(scrapy.Spider):
name = "wiki_headers"
def start_requests(self):
urls = [
'https://en.wikipedia.org/wiki/Spider',
'https://en.wikipedia.org/wiki/Data_scraping',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select('//h1/text()').extract()
print header
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
class CraigslistScraper(scrapy.Spider):
name = "craigslist_headers"
def start_requests(self):
urls = [
'https://columbusga.craigslist.org/act/6062657418.html',
'https://columbusga.craigslist.org/acc/6060297390.html',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select('//span[#id="titletextonly"]/text()').extract()
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
From the example you posted in edit2, it looks like all your classes are easily abstractable by one more level. How about this:?
from urllib.parse import urlparse
class GenericScraper(scrapy.Spider):
def __init__(self, urls, xpath):
super().__init__()
self.name = self._create_scraper_name_from_url(urls[0])
self.urls = urls
self.xpath = xpath
def _create_scraper_name_from_url(url):
'''Generate scraper name from url
www.example.com/foobar/bar -> www_example_com'''
netloc = urlparse(url).netloc
return netloc.replace('.','_')
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select(self.xpath).extract()
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
Next, you could group the data from database by xpaths
for urls, xpath in grouped_data:
scraper = GenericScraper(urls, xpath)
# do whatever you need with scraper
AD concurency: your database should handle concurent writes so I do not see a problem there
Edit:
Related to the timeouts: I Do not know how scrapy works under the hood i.e. if it uses some sort of paralelization and whether it runs asynchronously in the background. But from what you wrote I guess it does and when you fire up 1k scrapers each firing multiple requests at time your hardware cant handle that much traffic (disclaimer, this is just a guess!).
There might be a native way to do this, but a possible workaround is to use multiprocessing + Queue:
from multiprocessing import JoinableQueue, Process
NUMBER_OF_CPU = 4 # change this to your number.
SENTINEL = None
class Worker(Process):
def __init__(self, queue):
super().__init__()
self.queue = queue
def run(self):
# blocking wait !You have to use sentinels if you use blocking waits!
item = self.queue.get():
if item is SENTINEL:
# we got sentinel, there are no more scrapers to process
self.queue.task_done()
return
else:
# item is scraper, run it
item.run_spider() # or however you run your scrapers
# This assumes that each scraper is **not** running in background!
# Tell the JoinableQueue we have processed one more item
# In the main thread the queue.join() waits untill for
# each item taken from queue a queue.task_done() is called
self.queue.task_done()
def run():
queue = JoinableQueue()
# if putting that many things in the queue gets slow (I imagine
# it can) You can fire up a separate Thread/Process to fill the
# queue in the background while workers are already consuming it.
for urls, xpath in grouped_data:
scraper = GenericScraper(urls, xpath)
queue.put(scraper)
for sentinel in range(NUMBER_OF_CPU):
# None or sentinel of your choice to tell the workers there are
# no more scrapers to process
queue.put(SENTINEL)
workers = []
for _ in range(NUMBER_OF_CPU):
worker = Worker(queue)
workers.append(worker)
worker.start()
# We have to wait until the queue is processed
queue.join()
But please bear in mind that this is a vanilla approach for paralell execution completely ignoring Scrapy abilities. I have found This blogpost which uses twisted to achieve (what I think is) the same thing. But since I've never used twisted I can't comment on that
if you are thinking about scrapy can't handle multiple domains at once because of the allowed_domains parameters, remember that it is optional.
If no allowed_domains parameter is set in the spider, it can work with every domain it gets.
If I understand correctly you have map of domain to xpath values and you want to pull xpath depending on what domain you crawl?
Try something like:
DOMAIN_DATA = [('domain.com', '//div')]
def get_domain(url):
for domain, xpath in DOMAIN_DATA:
if domain in url:
return xp
def parse(self, response):
xpath = get_domain(response.url)
if not xpath:
logging.error('no xpath for url: {}; unknown domain'.format(response.url))
return
item = dict()
item['some_field'] = repsonse.xpath(xpath).extract()
yield item
I've created a simple python program that scrapes my favorite recipe website and returns the individual recipe URLs from the main site. While this is a relatively quick and simple process, I've tried scaling this out to scrape multiple webpages within the site. When I do this, it takes about 45 seconds to scrape all of the recipe URLs from the whole site. I'd like this process to be much quicker so I tried implementing threads into my program.
I realize there is something wrong here as each thread returns the whole URL thread over and over again instead of 'splitting up' the work. Does anyone have any suggestions on how to better implement the threads? I've included my work below. Using Python 3.
from bs4 import BeautifulSoup
import urllib.request
from urllib.request import urlopen
from datetime import datetime
import threading
from datetime import datetime
startTime = datetime.now()
quote_page='http://thepioneerwoman.com/cooking_cat/all-pw-recipes/'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
all_recipe_links = []
#get all recipe links on current page
def get_recipe_links():
for link in soup.find_all('a', attrs={'post-card-permalink'}):
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
print(datetime.now() - startTime)
return all_recipe_links
def worker():
"""thread worker function"""
print(get_recipe_links())
return
threads = []
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
I was able to distribute the work to the workers by having the workers all process data from a single list, instead of having them all run the whole method individually. Below are the parts that I changed. The method get_recipe_links is no longer needed, since its tasks have been moved to other methods.
all_recipe_links = []
links_to_process = []
def worker():
"""thread worker function"""
while(len(links_to_process) > 0):
link = links_to_process.pop()
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
threads = []
links_to_process = soup.find_all('a', attrs={'post-card-permalink'})
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
while len(links_to_process)>0:
continue
print(all_recipe_links)
I ran the new methods several times, and on average it takes .02 seconds to run this.
I am playing around with concurrent.futures.
Currently my future calls time.sleep(secs).
It seems that Future.cancel() does less than I thought.
If the future is already executing, then time.sleep() does not get cancel by it.
The same for the timeout parameter for wait(). It does not cancel my time.sleep().
How to cancel time.sleep() which gets executed in a concurrent.futures?
For testing I use the ThreadPoolExecutor.
If you submit a function to a ThreadPoolExecutor, the executor will run the function in a thread and store its return value in the Future object. Since the number of concurrent threads is limited, you have the option to cancel the pending execution of a future, but once control in the worker thread has been passed to the callable, there's no way to stop execution.
Consider this code:
import concurrent.futures as f
import time
T = f.ThreadPoolExecutor(1) # Run at most one function concurrently
def block5():
time.sleep(5)
return 1
q = T.submit(block5)
m = T.submit(block5)
print q.cancel() # Will fail, because q is already running
print m.cancel() # Will work, because q is blocking the only thread, so m is still queued
In general, whenever you want to have something cancellable you yourself are responsible for making sure that it is.
There are some off-the-shelf options available though. E.g., consider using asyncio, they also have an example using sleep. The concept circumvents the issue by, whenever any potentially blocking operation is to be called, instead returning control to a control loop running in the outer-most context, together with a note that execution should be continued whenever the result is available - or, in your case, after n seconds have passed.
I do not know much about concurrent.futures, but you can use this logic to break the time. Use a loop instead of sleep.time() or wait()
for i in range(sec):
sleep(1)
interrupt or break can be used to come out of loop.
I figured it out.
Here is a example:
from concurrent.futures import ThreadPoolExecutor
import queue
import time
class Runner:
def __init__(self):
self.q = queue.Queue()
self.exec = ThreadPoolExecutor(max_workers=2)
def task(self):
while True:
try:
self.q.get(block=True, timeout=1)
break
except queue.Empty:
pass
print('running')
def run(self):
self.exec.submit(self.task)
def stop(self):
self.q.put(None)
self.exec.shutdown(wait=False,cancel_futures=True)
r = Runner()
r.run()
time.sleep(5)
r.stop()
As it is written in its link, You can use a with statement to ensure threads are cleaned up promptly, like the below example:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
I've faced this same problem recently. I had 2 tasks to run concurrently and one of them had to sleep from time to time. In the code below, suppose task2 is the one that sleeps.
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=2)
executor.submit(task1)
executor.submit(task2)
executor.shutdown(wait=True)
In order to avoid the endless sleep I've extracted task2 to run synchronously. I don't whether it's a good practice, but it's simple and fit perfectly in my scenario.
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=1)
executor.submit(task1)
task2()
executor.shutdown(wait=True)
Maybe it's useful to someone else.