I have the following code:
final = []
with futures.ThreadPoolExecutor(max_workers=self.number_threads) as executor:
_futures = [executor.submit(self.get_attribute, listing,
self.proxies[listings.index(listing) % len(self.proxies)]) for listing
in listings]
for result in futures.as_completed(_futures):
try:
listing = result.result()
final.append(listing)
except Exception as e:
print traceback.format_exc()
return final
The self.get_attribute function that's submitted to the executor takes a dictionary and proxy as input and makes either one or two http requests to get some data and return with an edited dictionary. The problem is that the workers/threads hang towards the end of completing all the submitted tasks. If I submit 400 dictionaries, it will complete ~380 tasks, and then hang. If I submit 600, it will complete ~570-580. However if I submit 25, it will complete all of them. I'm not sure what the threshold is at which it will go from finishing to not finishing.
I have also tried using a queue and threading system like this:
def _get_attribute_thread(self):
while self.q.not_empty:
job = self.q.get()
listing = job['listing']
proxy = job['proxy']
self.threaded_results.put(self.get_attribute(listing, proxy))
self.q.task_done()
def _get_attributes_threaded_with_proxies(self, listings):
for listing in listings:
self.q.put({'listing': listing, 'proxy': self.proxies[listings.index(listing) % len(self.proxies)]})
for _ in xrange(self.number_threads):
thread = threading.Thread(target=self._get_attribute_thread)
thread.daemon = True
thread.start()
self.q.join()
final = []
while self.threaded_results.not_empty:
final.append(self.threaded_results.get())
return final
However the result is the same. What can I do to fix/debug the problem? Thanks in advance.
Related
the situation is that sometimes a request does not load or gets stuck in Python, in case that happens or any error occurs, I would like to retry it "n" times and wait up to a maximum of 3 seconds for each one and in case the attempts are over tell me a message that f"Could not process {type_1} and {type_2}". Everything runs in parallel with concurrent.futures. Could you help me with that?
import Requests
import concurrent.futures
import json
data = [['PEN','USD'],['USD','EUR']]
def currency(element):
type_1 =element[0]
type_2 = element[1]
s = requests.Session()
url = f'https://usa.visa.com/cmsapi/fx/rates?amount=1&fee=0&utcConvertedDate=07%2F26%2F2022&exchangedate=07%2F26%2F2022&fromCurr={type_1}&toCurr={type_2}'
a = s.get(url)
response = json.loads(a)
value = response["convertedAmount"]
return value
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(
currency, data)
for value in results:
print(value)
Your code is almost there. Here, I modified a few things:
from concurrent.futures import ThreadPoolExecutor
import time
import requests
def convert_currency(tup):
from_currency, to_currency = tup
url = (
"https://usa.visa.com/cmsapi/fx/rates?amount=1&fee=0"
"&utcConvertedDate=07%2F26%2F2022&exchangedate=07%2F26%2F2022&"
f"fromCurr={from_currency}&toCurr={to_currency}"
)
session = requests.Session()
for _ in range(3):
try:
response = session.get(url, timeout=3)
if response.ok:
return response.json()["convertedAmount"]
except requests.exceptions.ConnectTimeout:
time.sleep(3)
return f"Could not process {from_currency} and {to_currency}"
data = [["VND", "XYZ"], ['PEN','USD'], ["ABC", "XYZ"], ['USD','EUR'], ["USD", "XXX"]]
with ThreadPoolExecutor() as executor:
results = executor.map(convert_currency, data)
for value in results:
print(value)
Notes
I retried 3 times (see the for loop)
Use timeout= to specify the time out (in seconds)
The .ok attribute will tell if the call was successful
No need to import json as the response object can JSON decode with the .json() method
You might experiment between ThreadPoolExecutor and ProcessPoolExecutor to see which one performs better
I have this snippet
config = {10: 'https://www.youtube.com/', 5: 'https://www.youtube.com/', 7: 'https://www.youtube.com/',
3: 'https://sportal.com/', 11: 'https://sportal.com/'}
def test(arg):
for key in arg.keys():
requests.get(arg[key], timeout=key)
test(config)
On that way the things are happaning synchronously. I want to do it аsynchronously. I want to iterate through the loop without waiting for response for each address and to go ahead to the next one. And so until I iterate though all addresses in dictionary. Than I want to wait until I get all responses for all addresses and after that to get out of test function. I know that I can do it with threading but I read that with asyncio lyb it can be done better, but I couldn't implement it. If anyone have even better suggestions I am open for them. Here is my try:
async def test(arg):
loop = asyncio.get_event_loop()
tasks = [loop.run_in_executor(requests.get(arg[key], timeout=key) for key in arg.keys())]
await asyncio.gather(*tasks)
asyncio.run(test(config))
Here is the solution:
def addresses(adr, to):
requests.get(adr, timeout=to)
async def test(arg):
loop = asyncio.get_event_loop()
tasks = [loop.run_in_executor(None, addresses, arg[key], key) for key in arg.keys()]
await asyncio.gather(*tasks)
asyncio.run(test(config))
Now it works аsynchronously with lyb asyncio not with threading.
Some good answers here. I had trouble with this myself (I do a lot of webscraping) and so I created a package to help me async-scrape (https://pypi.org/project/async-scrape/).
It supports GET and POST. I tried to make it as easy to use as possible. You just need to specify a handler function for the response when you instantiate and then use the scrape_all method to do the work.
It uses the term scrape becasue i've build in some handlers for common errors when scraping websites.
You can do some things in it as well like limit the call rate if you find you're getting blocked.
An example of it's use is:
# Create an instance
from async_scrape import AsyncScrape
def post_process(html, resp, **kwargs):
"""Function to process the gathered response from the request"""
if resp.status == 200:
return "Request worked"
else:
return "Request failed"
async_Scrape = AsyncScrape(
post_process_func=post_process,
post_process_kwargs={},
fetch_error_handler=None,
use_proxy=False,
proxy=None,
pac_url=None,
acceptable_error_limit=100,
attempt_limit=5,
rest_between_attempts=True,
rest_wait=60,
call_rate_limit=None,
randomise_headers=True
)
urls = [
"https://www.google.com",
"https://www.bing.com",
]
resps = async_Scrape.scrape_all(urls)
To do this inside a loop i collect the results and add then to a set and pop off the old ones.
EG
from async_scrape import AsyncScrape
from bs4 import BeautifulSoup as bs
def post_process(html, resp, **kwargs):
"""Function to process the gathered response from the request"""
new_urls = bs.findall("a", {"class":"new_link_on_website"}
return [new_urls, resp]
async_Scrape = AsyncScrape(
post_process_func=post_process,
post_process_kwargs={}
)
# Run the loop
urls = set(["https://initial_webpage.com/"])
processed = set()
all_resps = []
while len(urls):
resps = async_scrape.scrape_all(urls)
# Get failed urls
success_reqs = set([
r["req"] for r in resps
if not r["error"]
])
errored_reqs = set([
r["req"] for r in resps
if r["error"]
])
# Get what you want from the responses
for r in success_reqs:
# Add found urls to urls
urls |= set(r["func_resp"][0]) # "func_resp" is the key to the return from your handler function
# Collect the response
all_resps.extend(r["func_resp"][1])
# Add to processed urls
processed.add(r["url"]) # "url" is the key to the url from the response
# Remove processed urls
urls = urls - processed
I am using python 3.9 with IMAPlib in order to retrieve emails and scrape links from them. It works fine but can become quite slow for large amounts of emails (I'm doing ~40,000). In order to speed it up I'd like to use some concurrency I can get all the emails at once.
To do this I get the IDs of all the emails beforehand then assign each ID to a task in my pool. I close the previously used impalib connection before scrape_link_mp() is called. I have tried to use a lock and a manager lock but I still get the same error.
Am I missing something fundamental here? Let me know if anything else needs to be explained, thanks.
My code looks like this:
def scrape_link_mp(self):
self.file_counter = 0
self.login_session.logout()
self.Manager = multiprocessing.Manager()
self.lock = self.Manager.Lock()
futures = []
with concurrent.futures.ProcessPoolExecutor() as Executor:
for self.num_message in self.arr_of_emails[self.start_index:]:
task_params = self.current_user,self.current_password,self.counter,self.imap_url,self.num_message,self.substring_filter,self.link_regex,self.lock
futures.append(
Executor.submit(
self.scrape_link_from_email_single,
*task_params
)
)
for future in concurrent.futures.as_completed(futures):
self.counter+=1
self.timestamp = time.strftime('%H:%M:%S')
print(f'[{self.timestamp}] DONE: {self.counter}/{len(self.num_mails)}')
print(future.result())
def scrape_link_from_email_single(self,current_user,current_password,counter,imap_url,num_message,substring_filter,link_regex,lock):
login_session_mp.logout()
current_user_mp = self.current_user
current_password_mp = self.current_password
self.lock.acquire()
login_session_mp = imaplib.IMAP4_SSL(self.imap_url,993)
login_session_mp.login(current_user_mp,current_password_mp)
self.search_mail_status, self.amount_matching_criteria = login_session_mp.search(Mail.CHARSET,search_criteria)
_,individual_response_data = login_session_mp.fetch(self.num_message,'(RFC822)')
self.lock().release
raw = email.message_from_bytes(individual_response_data[0][1])
scraped_email_value = str(email.message_from_bytes(Mail.scrape_email(raw)))
print(scraped_email_value)
returned_links = str(link_regex.findall(scraped_email_value))
for i in returned_links:
if substring_filter:
self.lock.acquire()
with open('out.txt','a+') as link_file:
link_file.write(i +'\n')
link_file.close()
self.lock.release()
I'm considering a fan-out proxy in tornado to query multiple backend servers and the possible use-case of having it not wait for all responses before returning.
Is there a problem with the remaining futures if you use a WaitIterator but not continuing to wait after receiving a useful response?
Perhaps the results of the other futures will not be cleaned up? Perhaps callbacks could be added to any remaining futures to discard their results?
#!./venv/bin/python
from tornado import gen
from tornado import httpclient
from tornado import ioloop
from tornado import web
import json
class MainHandler(web.RequestHandler):
#gen.coroutine
def get(self):
r1 = httpclient.HTTPRequest(
url="http://apihost1.localdomain/api/object/thing",
connect_timeout=4.0,
request_timeout=4.0,
)
r2 = httpclient.HTTPRequest(
url="http://apihost2.localdomain/api/object/thing",
connect_timeout=4.0,
request_timeout=4.0,
)
http = httpclient.AsyncHTTPClient()
wait = gen.WaitIterator(
r1=http.fetch(r1),
r2=http.fetch(r2)
)
while not wait.done():
try:
reply = yield wait.next()
except Exception as e:
print("Error {} from {}".format(e, wait.current_future))
else:
print("Result {} received from {} at {}".format(
reply, wait.current_future,
wait.current_index))
if reply.code == 200:
result = json.loads(reply.body)
self.write(json.dumps(dict(result, backend=wait.current_index)))
return
def make_app():
return web.Application([
(r'/', MainHandler)
])
if __name__ == '__main__':
app = make_app()
app.listen(8888)
ioloop.IOLoop.current().start()
So I've checked through the source for WaitIterator.
It tracks the futures adding a callback, when fired the iterator queues the result or (if you've called next()) fulfils a future it's given to you.
As the future you wait on only gets created by calling .next(), it appears you can exit out of the while not wait.done() and not leave any futures without observers.
Reference counting ought to allow the WaitIterator instance to remain until after all the futures have fired their callbacks and then be reclaimed.
Update 2017/08/02
Having tested further with subclassing WaitIterator with extra logging, yes the iterator will be cleaned up when all the futures return, but if any of those futures return an exception it will be logged that this exception hasn't been observed.
ERROR:tornado.application:Future exception was never retrieved: HTTPError: HTTP 599: Timeout while connecting
In summary and answering my question: completing the WaitIterator isn't necessary from a clean-up point of view, but it is probably desirable to do so from a logging point of view.
If you wanted to be sure, passing the the wait iterator to a new future that will finish consuming it and adding an observer may suffice. For example
#gen.coroutine
def complete_wait_iterator(wait):
rounds = 0
while not wait.done():
rounds += 1
try:
reply = yield wait.next()
except Exception as e:
print("Not needed Error {} from {}".format(e, wait.current_future))
else:
print("Not needed result {} received from {} at {}".format(
reply, wait.current_future,
wait.current_index))
log.info('completer finished after {n} rounds'.format(n=rounds))
class MainHandler(web.RequestHandler):
#gen.coroutine
def get(self):
r1 = httpclient.HTTPRequest(
url="http://apihost1.localdomain/api/object/thing",
connect_timeout=4.0,
request_timeout=4.0,
)
r2 = httpclient.HTTPRequest(
url="http://apihost2.localdomain/api/object/thing",
connect_timeout=4.0,
request_timeout=4.0,
)
http = httpclient.AsyncHTTPClient()
wait = gen.WaitIterator(
r1=http.fetch(r1),
r2=http.fetch(r2)
)
while not wait.done():
try:
reply = yield wait.next()
except Exception as e:
print("Error {} from {}".format(e, wait.current_future))
else:
print("Result {} received from {} at {}".format(
reply, wait.current_future,
wait.current_index))
if reply.code == 200:
result = json.loads(reply.body)
self.write(json.dumps(dict(result, backend=wait.current_index)))
consumer = complete_wait_iterator(wait)
consumer.add_done_callback(lambda f: f.exception())
return
I'm editing on a simple scraper that crawls a Youtube video's comment page. The crawler uses Ajax to go through every comment on a Youtube Videos comment page and then saves them to a json file. Even with small number of comments (< 10), it still takes 3+ min for the comments to be parsed.
I've tried including request-cache and using ujson instead of json to see if there are any benefits but there's no noticeable difference.
Here's the code I'm using currently:
import os
import sys
import time
import ujson
import requests
import requests_cache
import argparse
import lxml.html
requests_cache.install_cache('comment_cache')
from lxml.cssselect import CSSSelector
YOUTUBE_COMMENTS_URL = 'https://www.youtube.com/all_comments?v={youtube_id}'
YOUTUBE_COMMENTS_AJAX_URL = 'https://www.youtube.com/comment_ajax'
def find_value(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('"', pos_begin)
return html[pos_begin: pos_end]
def extract_comments(html):
tree = lxml.html.fromstring(html)
item_sel = CSSSelector('.comment-item')
text_sel = CSSSelector('.comment-text-content')
photo_sel = CSSSelector('.user-photo')
for item in item_sel(tree):
yield {'cid': item.get('data-cid'),
'name': item.get('data-name'),
'ytid': item.get('data-aid'),
'text': text_sel(item)[0].text_content(),
'photo': photo_sel(item)[0].get('src')}
def extract_reply_cids(html):
tree = lxml.html.fromstring(html)
sel = CSSSelector('.comment-replies-header > .load-comments')
return [i.get('data-cid') for i in sel(tree)]
def ajax_request(session, url, params, data, retries=10, sleep=20):
for _ in range(retries):
response = session.post(url, params=params, data=data)
if response.status_code == 200:
response_dict = ujson.loads(response.text)
return response_dict.get('page_token', None), response_dict['html_content']
else:
time.sleep(sleep)
def download_comments(youtube_id, sleep=1, order_by_time=True):
session = requests.Session()
# Get Youtube page with initial comments
response = session.get(YOUTUBE_COMMENTS_URL.format(youtube_id=youtube_id))
html = response.text
reply_cids = extract_reply_cids(html)
ret_cids = []
for comment in extract_comments(html):
ret_cids.append(comment['cid'])
yield comment
page_token = find_value(html, 'data-token')
session_token = find_value(html, 'XSRF_TOKEN', 4)
first_iteration = True
# Get remaining comments (the same as pressing the 'Show more' button)
while page_token:
data = {'video_id': youtube_id,
'session_token': session_token}
params = {'action_load_comments': 1,
'order_by_time': order_by_time,
'filter': youtube_id}
if order_by_time and first_iteration:
params['order_menu'] = True
else:
data['page_token'] = page_token
response = ajax_request(session, YOUTUBE_COMMENTS_AJAX_URL, params, data)
if not response:
break
page_token, html = response
reply_cids += extract_reply_cids(html)
for comment in extract_comments(html):
if comment['cid'] not in ret_cids:
ret_cids.append(comment['cid'])
yield comment
first_iteration = False
time.sleep(sleep)
# Get replies (the same as pressing the 'View all X replies' link)
for cid in reply_cids:
data = {'comment_id': cid,
'video_id': youtube_id,
'can_reply': 1,
'session_token': session_token}
params = {'action_load_replies': 1,
'order_by_time': order_by_time,
'filter': youtube_id,
'tab': 'inbox'}
response = ajax_request(session, YOUTUBE_COMMENTS_AJAX_URL, params, data)
if not response:
break
_, html = response
for comment in extract_comments(html):
if comment['cid'] not in ret_cids:
ret_cids.append(comment['cid'])
yield comment
time.sleep(sleep)
def main(argv):
parser = argparse.ArgumentParser(add_help=False, description=('Download Youtube comments without using the Youtube API'))
parser.add_argument('--help', '-h', action='help', default=argparse.SUPPRESS, help='Show this help message and exit')
parser.add_argument('--youtubeid', '-y', help='ID of Youtube video for which to download the comments')
parser.add_argument('--output', '-o', help='Output filename (output format is line delimited JSON)')
parser.add_argument('--timeorder', '-t', action='store_true', help='Download Youtube comments ordered by time')
try:
args = parser.parse_args(argv)
youtube_id = args.youtubeid
output = args.output
start_time = time.time()
if not youtube_id or not output:
parser.print_usage()
raise ValueError('you need to specify a Youtube ID and an output filename')
print 'Downloading Youtube comments for video:', youtube_id
count = 0
with open(output, 'wb') as fp:
for comment in download_comments(youtube_id, order_by_time=bool(args.timeorder)):
print >> fp, ujson.dumps(comment, escape_forward_slashes=False)
count += 1
sys.stdout.write('Downloaded %d comment(s)\r' % count)
sys.stdout.flush()
elapsed_time = time.time() - start_time
print '\nDone! Elapsed time (seconds):', elapsed_time
except Exception, e:
print 'Error:', str(e)
sys.exit(1)
if __name__ == "__main__":
main(sys.argv[1:])
I'm new to Python so I'm not sure where the bottlenecks are. The finished script will be used to parse through 100,000+ comments so performance is a large factor.
Would using multithreading solve the issue? And if so how would I refactor this to benefit from it?
Is this strictly a network issue?
Yes, Multithreading will speed up the process. Run the network operations (ie. downloading) in a separate Thread.
Yes, it is a network related issue.
Your requests are I/O bound. You make a request to Youtube - it takes some time to get back the response, it's dependent mostly on the network, you can't make the process faster. However, you can use Threads to send multiple requests in parallel. That will not make the actual process faster but you will process more in less time.
Threading tutorial:
https://pymotw.com/2/threading/
http://www.tutorialspoint.com/python/python_multithreading.htm
An example somewhat similar to your task -- http://www.toptal.com/python/beginners-guide-to-concurrency-and-parallelism-in-python
Also since you will be doing a lot of scraping and processing, I would recommend using something like Scrapy - I personally use it for these kind of tasks.
Making multiple requests at once will speed up the process, but if it's taking 3 minutes to parse 10 comments you have some other issues and parsing 100,000 comments will take days. Unless there's a pressing reason to use lxml I'd suggest you look at BeautifulSoup and let it provide you with lists of the comment tags and their text content rather than doing it yourself. I'm guessing most of the slowness is in lxml transforming the content you pass to it and then in your manual counting to find positions in a string. I'm also suspicious of the calls to sleep-- what are those for?
Assuming this
print >> fp, ujson.dumps(comment, escape_forward_slashes=False)
count += 1
sys.stdout.write('Downloaded %d comment(s)\r' % count)
is just for debugging, move it into download_comments and use logging so you can turn it on and off. Dumping each individual comment to JSON as you go is going to be slow; you might want to start dumping these into a database now to avoid that. And re-examine why you're doing things one comment at a time: BeautifulSoup should give you a full list of comments & their text with each page load so you can handle them in batches which will be handy once you start parsing larger groups.