I have a code that retrieves news results from this newspaper using a query and a time frame (could be up to a year).
The results are paginated up to 10 articles per page and since I couldn't find a way to increase it, I issue a request for each page then retrieve the title, url and date of each article. Each cycle (the HTTP request and the parsing) takes from 30 seconds to a minute and that's extremely slow. And eventually it will stop with a response code of 500. I am wondering if there is ways to speed it up or maybe make multiple requests at once. I simply want to retrieve the articles details in all the pages.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
import csv
URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0'
def run(**params):
countryFile = open("EgyptDaybyDay.csv","a")
i=1
results = True
while results:
params["index"]=str(i)
response = requests.get(URL.format(**params))
print response.status_code
htmlFile = BeautifulSoup(response.content)
articles = htmlFile.findAll("div", { "class" : "newslist" })
for article in articles:
url = (article.a['href']).encode('utf-8','ignore')
title = (article.img['alt']).encode('utf-8','ignore')
dateline = article.find("div",{"class": "floatright"})
m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string)
date = m.group(1)
w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
w.writerow((date, title, url ))
if not articles:
results = False
i+=1
countryFile.close()
run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")
This is a good opportunity to try out gevent.
You should have a separate routine for the request.get part so that your application doesn't have to wait for IO blocking.
You can then spawn multiple workers and have queues to pass requests and articles around.
Maybe something similar to this:
import gevent.monkey
from gevent.queue import Queue
from gevent import sleep
gevent.monkey.patch_all()
MAX_REQUESTS = 10
requests = Queue(MAX_REQUESTS)
articles = Queue()
mock_responses = range(100)
mock_responses.reverse()
def request():
print "worker started"
while True:
print "request %s" % requests.get()
sleep(1)
try:
articles.put('article response %s' % mock_responses.pop())
except IndexError:
articles.put(StopIteration)
break
def run():
print "run"
i = 1
while True:
requests.put(i)
i += 1
if __name__ == '__main__':
for worker in range(MAX_REQUESTS):
gevent.spawn(request)
gevent.spawn(run)
for article in articles:
print "Got article: %s" % article
The most probably slow down is the server, so parallelising the http requests would be the best way to go about making the code run faster, although there's very little you can do to speed up the server response. There's a good tutorial over at IBM for doing exactly this
It seems to me that you're looking for a feed, which that newspaper doesn't advertise. However, it's a problem that has been solved before -- there are many sites that will generate feeds for you for an arbitrary website thus at least solving one of your problems. Some of these require some human guidance, and others have less opportunity for tweaking and are more automatic.
If you can at all avoid doing the pagination and parsing yourself, I'd recommend it. If you cannot, I second the use of gevent for simplicity. That said, if they're sending you back 500's, your code is likely less of an issue and added parallelism may not help.
You can try making all the calls asynchronously .
Have a look at this :
http://pythonquirks.blogspot.in/2011/04/twisted-asynchronous-http-request.html
You could use gevent as well rather than twisted but just telling you the options.
This might very well come close to what you're looking for.
Ideal method for sending multiple HTTP requests over Python? [duplicate]
Source code:
https://github.com/kennethreitz/grequests
Related
For an upcoming project this year, I wanted to look into some languages that I haven't really used yet, but that repeatedly catch my interest. Nim is one of them 😊.
I wrote the following code to make async requests:
import asyncdispatch, httpclient, strformat, times, strutils
let urls = newHttpClient().getContent("https://gist.githubusercontent.com/tobealive/b2c6e348dac6b3f0ffa150639ad94211/raw/31524a7aac392402e354bced9307debd5315f0e8/100-popular-urls.txt").splitLines()[0..99]
proc getHttpResp(client: AsyncHttpClient, url: string): Future[string] {.async.} =
try:
result = await client.getContent(url)
echo &"{url} - response length: {len(result)}"
except Exception as e:
echo &"Error: {url} - {e.name}"
proc requestUrls(urls: seq[string]) {.async.} =
let start = epochTime()
echo "Starting requests..."
var futures: seq[Future[string]]
for url in urls:
var client = newAsyncHttpClient()
futures.add client.getHttpResp(&"http://www.{url}")
for i in 0..urls.len-1:
discard await futures[i]
echo &"Requested {len(urls)} websites in {epochTime() - start}."
waitFor requestUrls(urls)
Results doing some loops:
Iterations: 10. Total errors: 94.
Average time to request 100 websites: 9.98s.
The finished application will only request from a single ressource. So for example, when requesting Google search queries (for simplicity just the numbers from 1 to 100), the result look like:
Iterations: 1. Total errors: 0.
Time to request 100 google searches: 3.75s.
Compared to Python, there are still significant differences:
import asyncio, time, requests
from aiohttp import ClientSession
urls = requests.get(
"https://gist.githubusercontent.com/tobealive/b2c6e348dac6b3f0ffa150639ad94211/raw/31524a7aac392402e354bced9307debd5315f0e8/100-popular-urls.txt"
).text.split('\n')
async def getHttpResp(url: str, session: ClientSession):
try:
async with session.get(url) as resp:
result = await resp.read()
print(f"{url} - response length: {len(result)}")
except Exception as e:
print(f"Error: {url} - {e.__class__}")
async def requestUrls(urls: list[str]):
start = time.time()
print("Starting requests...")
async with ClientSession() as session:
await asyncio.gather(*[getHttpResp(f"http://www.{url}", session) for url in urls])
print(f"Requested {len(urls)} websites in {time.time() - start}.")
# await requestUrls(urls) # jupyter
asyncio.run(requestUrls(urls))
Results:
Iterations: 10. Total errors: 10.
Average time to request 100 websites: 7.92s.
When requesting only google search queries:
Iterations: 1. Total errors: 0.
Time to request 100 google searches: 1.38s.
Additionally: the difference in response time remains when comparing a single response to an individual URL and just getting the response status code.
(I'm not big into python, but when using it, it's often impressive what it's C libraries deliver.)
To improve the Nim code, I thought it might be worth trying to add channels and multiple clients (this is from a still very limited point of view on my second day of programming in Nim + generally not having a lot of experience with concurrent requests). But I haven't really figured out how to get it to work.
Doing a lot of request to the same endpoint in the nim example (e.g. when doing the google searches) it also may result in a Too Many Requests error if this amount of Google searches are performed repeatedly. In python this doesn't seem to be the case.
So it would be great if you could share your approach on what can be done to improve the response quota and request time!
If anyone wants a repo for cloning and tinkering, this one contains the example with the loop:
https://github.com/tobealive/nim-async-requests-example
I tried to remember how Nims async works, and can unfortunately see no real issue in your code. Compiling with -d:release seems to make not a big difference. One idea is the timeout, which may be different for Python. From https://nim-lang.org/docs/httpclient.html#timeouts we learn that there is no timeout for async, so a very slow page may keep the connection open for a long time. Maybe Python does a time-out? I was not able to test the Python module, aiohttp is missing on my box. Below is a test of mine, not that different from yours. I made main() not async, by using waitFor all(f). Sorry that I could not really help you, maybe you should really try the chronos variant.
# nim r -d:ssl -d:release t.nim
import std/[asyncdispatch, httpclient, strutils, strformat, times]
const
UrlSource = "https://gist.githubusercontent.com/tobealive/" &
"b2c6e348dac6b3f0ffa150639ad94211/raw/31524a7aac392402e354bced9307debd5315f0e8/" &
"100-popular-urls.txt"
proc getHttpResp(client: AsyncHttpClient, url: string): Future[string] {.async.} =
try:
result = await client.getContent(url)
echo &"{url} - response length: {len(result)}"
except Exception as e:
echo &"Error: {url} - {e.name}"
proc main =
let start = epochTime()
echo "Starting requests..."
var urls = newHttpClient().getContent(UrlSource).splitLines
if urls.len > 100: # in case that there are more than 100, clamp it
urls.setLen(100)
# urls.setLen(3) # for fast tests with only a few urls
var f: seq[Future[string]]
for url in urls:
let client = newAsyncHttpClient()
f.add(client.getHttpResp(&"http://www.{url}"))
let res: seq[string] = waitFor all(f)
for x in res:
echo x.len
echo fmt"Requested {len(urls)} websites in {epochTime() - start:.2f} seconds."
main()
Testing with an extended version of the above program, I get the feeling that the total transfer rate is just limited to a few MB/s, and my idea about timeouts was very wrong. I did some Google search about the topic, was not able to find much useful info. As you wrote in your initial post already, Nim's async from standard library is not parallel, but it is (theoretical) possible to use it with multiple threads. When I have more free time, I may do a test with Chronos.
# nim r -d:ssl -d:release t.nim
import std/[asyncdispatch, httpclient, strutils, strformat, times]
const
UrlSource = "https://gist.githubusercontent.com/tobealive/" &
"b2c6e348dac6b3f0ffa150639ad94211/raw/31524a7aac392402e354bced9307debd5315f0e8/" &
"100-popular-urls.txt"
proc getHttpResp(client: AsyncHttpClient, url: string): Future[string] {.async.} =
let start = epochTime()
try:
result = await client.getContent(url)
stdout.write &"{url} - response length: {len(result)}"
except Exception as e:
stdout.write &"Error: {url} - {e.name}"
echo fmt" --- Request took {epochTime() - start:.2f} seconds."
proc main =
var transferred: int = 0
let start = epochTime()
echo "Starting requests..."
var urls = newHttpClient().getContent(UrlSource).splitLines
if urls.len > 100: # in case that there are more than 100, clamp it
urls.setLen(100)
#urls.setLen(3) # for fast tests with only a few urls
var f: seq[Future[string]]
for url in urls:
let client = newAsyncHttpClient()
f.add(client.getHttpResp(&"http://www.{url}"))
let res: seq[string] = waitFor all(f)
for x in res:
transferred += x.len
echo fmt"Sum of transferred data: {transferred} bytes. ({transferred.float / (1024 * 1024).float / (epochTime() - start):.2f} MBytes/s)"
echo fmt"Requested {len(urls)} websites in {epochTime() - start:.2f} seconds."
main()
References:
https://xmonader.github.io/nimdays/day04_asynclinkschecker.html
Obviously I'm still new to Python by looking at my code but failing my way through it.
I am scraping Amazon jobs search results but keep getting a connection reset error 10054 after about 50 requests to the url. I added a Crawlera proxy network to prevent getting banned but still not working. I know the url is long but it seems to work without having to add too many other separate parts to the url. The results page has about 12,000 jobs total with 10 jobs per page, so I don't even know if scraping that much data is the problem to begin with. Amazon shows each page in the url as 'result_limit=10', so I've been going through each page by 10s instead of 1 page per request. Not sure if that's right. Also, the last page stops at 9,990.
The code works but not sure how to get passed the connection error. As you can see, I've added things like a user agent but not sure if it even does anything. Any help would be appreciated as I've been stuck on this for countless days and hours. Thanks!
def get_all_jobs(pages):
requests = 0
start_time = time()
total_runtime = datetime.now()
for page in pages:
try:
ua = UserAgent()
header = {
'User-Agent': ua.random
}
response = get('https://www.amazon.jobs/en/search.json?base_query=&city=&country=USA&county=&'
'facets%5B%5D=location&facets%5B%5D=business_category&facets%5B%5D=category&'
'facets%5B%5D=schedule_type_id&facets%5B%5D=employee_class&facets%5B%5D=normalized_location'
'&facets%5B%5D=job_function_id&job_function_id%5B%5D=job_function_corporate_80rdb4&'
'latitude=&loc_group_id=&loc_query=USA&longitude=&'
'normalized_location%5B%5D=Seattle%2C+Washington%2C+USA&'
'normalized_location%5B%5D=San+Francisco'
'%2C+California%2C+USA&normalized_location%5B%5D=Sunnyvale%2C+California%2C+USA&'
'normalized_location%5B%5D=Bellevue%2C+Washington%2C+USA&'
'normalized_location%5B%5D=East+Palo+Alto%2C+California%2C+USA&'
'normalized_location%5B%5D=Santa+Monica%2C+California%2C+USA&offset={}&query_options=&'
'radius=24km®ion=&result_limit=10&schedule_type_id%5B%5D=Full-Time&'
'sort=relevant'.format(page),
headers=header,
proxies={
"http": "http://1ea01axxxxxxxxxxxxxxxxxxx:#proxy.crawlera.com:8010/"
}
)
# Monitor the frequency of requests
requests += 1
# Pauses the loop between 8 and 15 seconds
sleep(randint(8, 15))
current_time = time()
elapsed_time = current_time - start_time
print("Amazon Request:{}; Frequency: {} request/s; Total Run Time: {}".format(requests,
requests / elapsed_time, datetime.now() - total_runtime))
clear_output(wait=True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn("Request: {}; Status code: {}".format(requests, response.status_code))
# Break the loop if number of requests is greater than expected
if requests > 999:
warn("Number of requests was greater than expected.")
break
yield from get_job_infos(response)
except AttributeError as e:
print(e)
continue
def get_job_infos(response):
amazon_jobs = json.loads(response.text)
for website in amazon_jobs['jobs']:
site = website['company_name']
title = website['title']
location = website['normalized_location']
job_link = 'https://www.amazon.jobs' + website['job_path']
yield site, title, location, job_link
def main():
# Page range starts from 0 and the middle value increases by 10 each page.
pages = [str(i) for i in range(0, 9990, 10)]
with open('amazon_jobs.csv', "w", newline='', encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Website", "Title", "Location", "Job URL"])
writer.writerows(get_all_jobs(pages))
if __name__ == "__main__":
main()
i'm not expert on amazon anti bot policies, but if they have flagged you once, your ip could be flagged for a while, they might have a limit to how many similar requests you can do in a certain time frame.
google for a patch to urllib so you can see the request headers in real time, other than ip/domain per certain time frame, amazon will look at your request headers to determine if you're not human. compare what you're sending with a regular browser request headers
just standard practice, keep cookies for a normal amount of time, use proper referers and a popular user agent
all this can be done with requests library, pip install requests, see session object
it looks like you're sending a request to an internal amazon url without a referer header..... that doesnt happen in a normal browser
another example, keeping cookies from one user agent and then switching to another is also not what browser does
I am using python and cookielib to talk to an HTTP server that has its date incorrectly set. I have no control over this server, so fixing its time is not a possibility. Unfortunately, the server's incorrect time messes up cookielib because the cookies appear to be expired.
Interestingly, if I go to the same website with any web browser, the browser accepts the cookie and it gets saved. I assume that modern webbrowsers come across misconfigured web servers all the time and see that their Date header is set incorrectly, and adjust cookie expiration dates accordingly.
Has anyone come across this problem before? Is there any way of handling it within Python?
I hacked together a solution that includes live-monkey patching of the urllib library. Definitely not ideal, but if others find a better way, please let me know:
cook_proc = urllib2.HTTPCookieProcessor(cookielib.LWPCookieJar())
cookie_processing_lock = threading.Lock()
def _process_cookies(request, response):
'''Process cookies, but do so in a way that can handle servers with bad
clocks set.'''
# We do some real monkey hacking here, so put it in a lock.
with cookie_processing_lock:
# Get the server date.
date_header = cookielib.http2time(
response.info().getheader('Date') or '')
# Save the old cookie parsing function.
orig_parse = cookielib.parse_ns_headers
# If the server date is off by more than an hour, we'll adjust it.
if date_header:
off_by = time.time() - date_header
if abs(off_by) > 3600:
logging.warning("Server off %.1f hrs."%(abs(off_by)/3600))
# Create our monkey patched
def hacked_parse(ns_headers):
try:
results = orig_parse(ns_headers)
for r in results:
for r_i, (key, val) in enumerate(r):
if key == 'expires':
r[r_i] = key, val + off_by
logging.info("Fixing bad cookie "
"expiration time for: %s"%r[0][0])
logging.info("COOKIE RESULTS: %s", results)
return results
except Exception as e:
logging.error("Problem parse cookie: %s"%e)
raise
cookielib.parse_ns_headers = hacked_parse
response = cook_proc.http_response(request, response)
# Make sure we set the cookie processor back.
cookielib.parse_ns_headers = orig_parse
Edit 2
Second approach. For now, I gave up on using multiple instances and configured scrapy settings not to use concurrent requests. It's slow but stable. I opened a bounty. Who can help to make this work concurrently? If I configure scrapy to run concurrently, I get segmentation faults.
class WebkitDownloader( object ):
def __init__(self):
os.environ["DISPLAY"] = ":99"
self.proxyAddress = "a:b#" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
def process_response(self, request, response, spider):
self.request = request
self.response = response
if 'cached' not in response.flags:
webkitBrowser = webkit.WebkitBrowser(proxy = self.proxyAddress, gui=False, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
#print "added to queue: " + str(self.counter)
webkitBrowser.get(html=response.body, num_retries=0)
html = webkitBrowser.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
response = response.replace(**kwargs)
webkitBrowser.setPage(None)
del webkitBrowser
return response
Edit:
I tried to answer my own question in the meantime and implemented a queue but it does not run asynchronously for some reason. Basically when webkitBrowser.get(html=response.body, num_retries=0) is busy, scrapy is blocked until the method is finished. New requests are not assigned to the remaining free instances in self.queue.
Can anyone please point me into right direction to make this work?
class WebkitDownloader( object ):
def __init__(self):
proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
self.queue = list()
for i in range(8):
self.queue.append(webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=5.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt']))
def process_response(self, request, response, spider):
i = 0
for webkitBrowser in self.queue:
i += 1
if webkitBrowser.status == "WAITING":
break
webkitBrowser = self.queue[i]
if webkitBrowser.status == "WAITING":
# load webpage
print "added to queue: " + str(i)
webkitBrowser.get(html=response.body, num_retries=0)
webkitBrowser.scrapyResponse = response
while webkitBrowser.status == "PROCESSING":
print "waiting for queue: " + str(i)
if webkitBrowser.status == "DONE":
print "fetched from queue: " + str(i)
#response = webkitBrowser.scrapyResponse
html = webkitBrowser.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
#response = response.replace(**kwargs)
webkitBrowser.status = "WAITING"
return response
I am using WebKit in a scrapy middleware to render JavaScript. Currently, scrapy is configured to process 1 request at a time (no concurrency).
I'd like to use concurrency (e.g. 8 requests at a time) but then I need to make sure that 8 instances of WebkitBrowser() receive requests based on their individual processing state (a fresh request as soon as WebkitBrowser.get() is done and ready to receive the next request)
How would I achieve that with Python? This is my current middleware:
class WebkitDownloader( object ):
def __init__(self):
proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
self.w = webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
def process_response(self, request, response, spider):
if not ".pdf" in response.url:
# load webpage
self.w.get(html=response.body, num_retries=0)
html = self.w.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
response = response.replace(**kwargs)
return response
I don't follow everything in your question because I don't know scrapy and I don't understand what would cause the segfault, but I think I can address one question: why is scrapy blocked when webkitBrowser.get is busy?
I don't see anything in your "queue" example that would give you the possibility of parallelism. Normally, one would use either the threading or multiprocessing module so that multiple things can run "in parallel". Instead of simply calling webkitBrowser.get, I suspect that you may want to run it in a thread. Retrieving web pages is a case where python threading should work reasonably well. Python can't do two CPU-intensive tasks simultaneously (due to the GIL), but it can wait for responses from web servers in parallel.
Here's a recent SO Q/A with example code that might help.
Here's an idea of how to get you started. Create a Queue. Define a function which takes this queue as an argument, get's the web page and puts the response in the queue. In the main program, enter a while True: loop after spawning all the get threads: check the queue and process the next entry, or time.sleep(.1) if it's empty.
I am aware this is an old question, but I had the similar question and hope this information I stumbled upon helps others with this similar question:
If scrapyjs + splash works for you (given you are using a webkit browser, they likely do, as splash is webkit-based), it is probably the easiest solution;
If 1 does not work, you may be able to run multiple spiders at the same time with scrapyd or do multiprocessing with scrapy;
Depending on wether your browser render is primarily waiting (for pages to render), IO intensive or CPU-intensive, you may want to use non-blocking sleep with twisted, multithreading or multiprocessing. For the latter, the value of sticking with scrapy diminishes and you may want to hack a simple scraper (e.g. the web crawler authored by A. Jesse Jiryu Davis and Guido van Rossum: code and document) or create your own.
I am trying to do batch searching and go over a list of strings and print the first address that google search returns:
#!/usr/bin/python
import json
import urllib
import time
import pandas as pd
df = pd.read_csv("test.csv")
saved_column = df.Name #you can also use df['column_name']
for name in saved_column:
query = urllib.urlencode({'q': name})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
search_response = urllib.urlopen(url)
search_results = search_response.read()
results = json.loads(search_results)
data = results['responseData']
address = data[u'results'][0][u'url']
print address
I get a 403 error from the server:
'Suspected Terms of Service Abuse. Please see http://code.google.com/apis/errors', u'responseStatus': 403
Is what I'm doing is not allowed according to google's terms of service?
I also tried to put time.sleep(5) in the loop but I get the same error.
Thank you in advance
Not allowed by Google TOS. You really can't scrape google without them getting angry. It's also a pretty sophisticated blocker, so you can get around for a little while with random delays, but it fails pretty quickly.
Sorry, you're out of luck on this one.
https://developers.google.com/errors/?csw=1
The Google Search and Language APIs shown to the right have been officially deprecated.
Also
We received automated requests, such as scraping and prefetching. Automated requests are prohibited; all requests must be made as a result of an end-user action.