Most of the time the amount of webpages I have to scrape are below 100, so using a for loop I parse them in a reasonable time. But now I have to parse over 1000 webpages.
Searching for a way to do this, I found that threading might help. I have watched and read some tutorials and I believe that I have understood the general logic.
I know that if I have 100 webpages, I can create 100 threads. It's not recommended, especially for a very large number of webpages. What I haven't really figured out is, for example, how I can create 5 threads with 200 webpages on each thread.
Below is a simple code sample using threading and Selenium:
import threading
from selenium import webdriver
def parse_page(page_url):
driver = webdriver.PhantomJS()
driver.get(url)
text = driver.page_source
..........
return parsed_items
def threader():
worker = q.get()
parse_page(page_url)
q.task_one()
urls = [.......]
q = Queue()
for x in range(len(urls)):
t = threading.Thread(target=threader)
t.daemon = True
t.start()
for worker in range(20):
q.put(worker)
q.join()
Another thing that I am not clear on and it is shown in the above code sample is how I use arguments in thread.
Probably the simplest way will be to use ThreadPool from multiprocessing.pool module or if you are on python3 ThreadPoolExecutor from concurrent.futures module.
ThreadPool has (almost) the same api as regular Pool but uses threads instead of processes.
e.g.
def f(i):
return i * i
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=10)
res = pool.map(f, [2, 3, 4, 5])
print(res)
[4, 9, 16, 25]
And for ThreadPoolExecutor check this example.
Related
Background: I'm trying to do 100's of dymola simulations with the python-dymola interface. I managed to run them in a for-loop. Now I want them to run while multi-threading so I can run multiple models parallel (which will be much faster). Since probably nobody uses the interface, I wrote some simple code that also shows my problem:
1: Turn a for-loop into a definition that is run into another for-loop BUT both the def and the for-loop share the same variable 'i'.
2: Turn a for-loop into a definition and use multi-threading to execute it. A for-loop runs the command one by one. I want to run them parallel with a maximum of x threads at the same time. The result should be the same as when executing the for-loop
Example-code:
import os
nSim = 100
ndig='{:01d}'
for i in range(nSim):
os.makedirs(str(ndig.format(i)))
Note that the name of the created directories are just the numbers from the for-loop (this is important). Now instead of using the for-loop, I would love to create the directories with multi-threading (note: probably not interesting for this short code but when calling and executing 100's of simulation models it definitely is interesting to use multi-threading).
So I started with something simple I thought, turning the for-loop into a function that then is run inside another for-loop and hoped to have the same result as with the for-loop code above but got this error:
AttributeError: 'NoneType' object has no attribute 'start'
(note: I just started with this, because I did not use the def-statement before and the thread package is also new. After this I would evolve towards the multi-threading.)
1:
import os
nSim = 100
ndig='{:01d}'
def simulation(i):
os.makedirs(str(ndig.format(i)))
for i in range(nSim):
simulation(i=i).start
After that failed, I tried to evolve to multi-threading (converting the for-loop into something that does the same but with multi-threading and by that running the code parallel instead of one by one and with a maximum number of threads):
2:
import os
import threading
nSim = 100
ndig='{:01d}'
def simulation(i):
os.makedirs(str(ndig.format(i)))
if __name__ == '__main__':
i in range(nSim)
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
Unfortunately that attempt failed as well and now I got the error:
NameError: name 'i' is not defined
Does anybody has suggestions for issues 1 or 2?
Both examples are incomplete. Here's a complete example. Note that target gets passed the name of the function target=simulation and a tuple of its arguments args=(i,). Don't call the function target=simulation(i=i) because that just passes the result of the function, which is equivalent to target=None in this case.
import threading
nSim = 100
def simulation(i):
print(f'{threading.current_thread().name}: {i}')
if __name__ == '__main__':
threads = [threading.Thread(target=simulation,args=(i,)) for i in range(nSim)]
for t in threads:
t.start()
for t in threads:
t.join()
Output:
Thread-1: 0
Thread-2: 1
Thread-3: 2
.
.
Thread-98: 97
Thread-99: 98
Thread-100: 99
Note you usually don't want more threads that CPUs, which you can get from multiprocessing.cpu_count(). You can use create a thread pool and use queue.Queue to post work that the threads execute. An example is in the Python Queue documentation.
Cannot call .start like this
simulation(i=i).start
on an non-threading object. Also, you have to import the module as well
It seems like you forgot to add 'for' and indent the code in your loop
i in range(nSim)
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
to
for i in range(nSim):
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
If you would like to have max number of thread in a pool, and to run all items in the queue. We can continue #mark-tolonen answer and do like this:
import threading
import queue
import time
def main():
size_of_threads_pool = 10
num_of_tasks = 30
task_seconds = 1
q = queue.Queue()
def worker():
while True:
item = q.get()
print(my_st)
print(f'{threading.current_thread().name}: Working on {item}')
time.sleep(task_seconds)
print(f'Finished {item}')
q.task_done()
my_st = "MY string"
threads = [threading.Thread(target=worker, daemon=True) for i in range(size_of_threads_pool)]
for t in threads:
t.start()
# send the tasks requests to the worker
for item in range(num_of_tasks):
q.put(item)
# block until all tasks are done
q.join()
print('All work completed')
# NO need this, as threads are while True, so never will stop..
# for t in threads:
# t.join()
if __name__ == '__main__':
main()
This will run 30 tasks of 1 second in each, using 10 threads.
So total time would be 3 seconds.
$ time python3 q_test.py
...
All work completed
real 0m3.064s
user 0m0.033s
sys 0m0.016s
EDIT: I found another higher-level interface for asynchronously executing callables.
Use concurrent.futures, see the example in the docs:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Note the max_workers=5 that will tell the max number of threads, and
note the for loop for url in URLS that you can use.
I've tried to create a scraper using python in combination with Thread to make the execution time faster. The scraper is supposed to parse all the shop names along with their phone numbers traversing multiple pages.
The script is running without any issues. As I'm very new to work with Thread, I can hardly understand I'm doing it in the right way.
This is what I've tried so far with:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
for pagelink in [url.format(page) for page in range(20)]:
response = requests.get(pagelink).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
thread = threading.Thread(target=get_information, args=(link,))
thread.start()
thread.join()
The problem being I can't find any difference in time or performance whether I run the above script using Thread or without using Thread. If I'm going wrong, how can I execute the above script using Thread?
EDIT: I've tried to change the logic to use multiple links. Is it possible now? Thanks in advance.
You can use Threading to scrape several pages in paralel as below:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
response = requests.get(url).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
threads = []
for url in [link.format(page) for page in range(20)]:
thread = threading.Thread(target=get_information, args=(url,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
Note that sequence of data will not be preserved. It means that if to scrape pages one by one sequence of extracted data will be:
page_1_name_1
page_1_name_2
page_1_name_3
page_2_name_1
page_2_name_2
page_2_name_3
page_3_name_1
page_3_name_2
page_3_name_3
while with Threading data will be mixed:
page_1_name_1
page_2_name_1
page_1_name_2
page_2_name_2
page_3_name_1
page_2_name_3
page_1_name_3
page_3_name_2
page_3_name_3
I am new to Python and I am currently working on a multiple webpage scraper. While I was playing around with Python I found out about Threading, this really speeds up the code. The problem is that the script scrapes lots of sites and I like to this in an array of 'batches' while using Threading.
When I've got an array of 1000 items I'd like to grab 10 items. When the script is done with these 10 items, grab 10 new items until there is nothing left
I hope someone can help me. Thanks in advance!
import subprocess
import threading
from multiprocessing import Pool
def scrape(url):
return subprocess.call("casperjs test.js --url=" + url, shell=True)
if __name__ == '__main__':
pool = Pool()
sites = ["http://site1.com", "http://site2.com", "http://site3.com", "http://site4.com"]
results = pool.imap(scrape, sites)
for result in results:
print(result)
In the future I use an sqlite database where I store all the URLs (this will replace the array). When I run the script I want the control of stopping the process and continue whenever I want to. This is not my question but the context of my problem.
Question: ... array of 1000 items I'd like to grab 10 items
for p in range(0, 1000, 10):
process10(sites[p:p+10])
How about using Process and Queue from multiprocessing? Writing a worker function and calling it from a loop makes it run as batches. With Process, jobs can be started and stopped when needed and give you more control over them.
import subprocess
from multiprocessing import Process, Queue
def worker(url_queue, result_queue):
for url in iter(url_queue.get, 'DONE'):
scrape_result = scrape(url)
result_queue.put(scrape_result)
def scrape(url):
return subprocess.call("casperjs test.js --url=" + url, shell=True)
if __name__ == '__main__':
sites = ['http://site1.com', "http://site2.com", "http://site3.com", "http://site4.com", "http://site5.com",
"http://site6.com", "http://site7.com", "http://site8.com", "http://site9.com", "http://site10.com",
"http://site11.com", "http://site12.com", "http://site13.com", "http://site14.com", "http://site15.com",
"http://site16.com", "http://site17.com", "http://site18.com", "http://site19.com", "http://site20.com"]
url_queue = Queue()
result_queue = Queue()
processes = []
for url in sites:
url_queue.put(url)
for i in range(10):
p = Process(target=worker, args=(url_queue, result_queue))
p.start()
processes.append(p)
url_queue.put('DONE')
for p in processes:
p.join()
result_queue.put('DONE')
for response in iter(result_queue.get, 'DONE'):
print response
Note that Queue is the FIFO Queue that supports putting and pulling elements.
I've created a simple python program that scrapes my favorite recipe website and returns the individual recipe URLs from the main site. While this is a relatively quick and simple process, I've tried scaling this out to scrape multiple webpages within the site. When I do this, it takes about 45 seconds to scrape all of the recipe URLs from the whole site. I'd like this process to be much quicker so I tried implementing threads into my program.
I realize there is something wrong here as each thread returns the whole URL thread over and over again instead of 'splitting up' the work. Does anyone have any suggestions on how to better implement the threads? I've included my work below. Using Python 3.
from bs4 import BeautifulSoup
import urllib.request
from urllib.request import urlopen
from datetime import datetime
import threading
from datetime import datetime
startTime = datetime.now()
quote_page='http://thepioneerwoman.com/cooking_cat/all-pw-recipes/'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
all_recipe_links = []
#get all recipe links on current page
def get_recipe_links():
for link in soup.find_all('a', attrs={'post-card-permalink'}):
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
print(datetime.now() - startTime)
return all_recipe_links
def worker():
"""thread worker function"""
print(get_recipe_links())
return
threads = []
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
I was able to distribute the work to the workers by having the workers all process data from a single list, instead of having them all run the whole method individually. Below are the parts that I changed. The method get_recipe_links is no longer needed, since its tasks have been moved to other methods.
all_recipe_links = []
links_to_process = []
def worker():
"""thread worker function"""
while(len(links_to_process) > 0):
link = links_to_process.pop()
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
threads = []
links_to_process = soup.find_all('a', attrs={'post-card-permalink'})
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
while len(links_to_process)>0:
continue
print(all_recipe_links)
I ran the new methods several times, and on average it takes .02 seconds to run this.
I'm going to write a program which has multiple process(CPU-crowded) and multiple threading(IO-crowded).(the code below just a sample, not the program)
But when the code meet the join() ,it make the program become a deadlock.
My code is post below
import requests
import time
from multiprocessing import Process, Queue
from multiprocessing.dummy import Pool
start = time.time()
queue = Queue()
rQueue = Queue()
url = 'http://www.bilibili.com/video/av'
for i in xrange(10):
queue.put(url+str(i))
def goURLsCrawl(queue, rQueue):
threadPool = Pool(7)
while not queue.empty():
threadPool.apply_async(urlsCrawl, args=(queue.get(), rQueue))
threadPool.close()
threadPool.join()
print 'end'
def urlsCrawl(url, rQueue):
response = requests.get(url)
rQueue.put(response)
p = Process(target=goURLsCrawl, args=(queue, rQueue))
p.start()
p.join() # join() is here
end = time.time()
print 'totle time %0.4f' % (end-start,)
Thanks in advance.😊
I finally find the reason. As you can see, I import the Queue from the multiprocessing, so the Queue should only used for Process, but I make the Thread access the Queue on my code, so it must something unknown occur behind the program.
To correct it, just import Queue instead of multiprocessing.Queue