Scraper getting stuck in request loop - python

So I am trying to build a very basic scraper that pulls information from my server, using that information it creates a link which it then yields a request for, after parsing that, it grabs a single link from the parsed page, uploads it back to the server using a get request. The problem i am encountering is that it will pull info from the server, create a link, and then yield the request, and depending on the response time there (which is unreliably consistent) it will dump out and start over with another get request to the server. The way that my server logic is designed is that it is pulling the next data set that needs worked on, and until a course of action is decided with this data set, it will continuously try to pull it and parse it. I am fairly new to scrapy and in need of assistance. I know that my code is wrong but I haven't been able to come up with another method of approach without changing a lot of server code and creating unnecessary hassle, and I am not super savvy with scrapy, or python unfortunately
My start requests method:
name = "scrapelevelone"
start_urls = []
def start_requests(self):
print("Start Requests is initiatied")
while True:
print("Were looping")
r = requests.get('serverlink.com')
print("Sent request")
pprint(r.text)
print("This is the request response text")
print("Now try to create json object: ")
try:
personObject = json.loads(r.text)
print("Made json object: ")
pprint(personObject)
info = "streetaddress=" + '+'.join(personObject['address1'].split(" ")) + "&citystatezip=" + '+'.join(personObject['city'].split(" ")) + ",%20" + personObject['state'] + "%20" + personObject['postalcodeextended']
nextPage = "https://www.webpage.com/?" + info
print("Creating info")
newRequest = scrapy.Request(nextPage, self.parse)
newRequest.meta['item'] = personObject
print("Yielding request")
yield newRequest
except Exception:
print("Reach JSON exception")
time.sleep(10)
And everytime the parse function gets called it does all the logic, creates a request.get statement at the end and it's supposed to send data to the server. And it all does what is supposed to if it gets to the end. I tried a lot of different things to try and get the scraper to loop and constantly request to the server for more information. I want the scraper to run indefinitely but that defeats the purpose when I can't step away from the computer because it chokes on a request. Any recommendations for keeping the scraper running 24/7 without using the stupid while loop in the start_requests function? And on top of that, can anyone tell me why it gets stuck in a loop of requests? :( I have a huge headache trying to troubleshoot this and finally gave in to a forum...

What you should do is start with your server url and keep retrying it constantly by yielding Request objects. If data you have is new then parse it and schedule your requests:
class MyCrawler:
start_urls = ['http://myserver.com']
past_data = None
def parse(self, response):
data = json.loads(response.body_as_unicode())
if data == past_data: # if data is the same, retry
# time.sleep(10) # you can add delay but sleep will stop everything
yield Request(response.url, dont_filter=True, priority=-100)
return
past_data = data
for url in data['urls']:
yield Request(url, self.parse_url)
# keep retrying
yield Request(response.url, dont_filter=True, priority=-100)
def parse_url(self, repsonse):
#...
yield {'scrapy': 'item'}

Related

call several callback function in scrapy

I am using scrapy and I have several problems:
first problem: I put start_requests in a loop but the function is not started from each iteration
second problem: I need to call different callback related to the start_urls given by the loop, but I can't give a dynamic name for the callback. I would like to put callback=parse_i and i come from the loop above.
liste [[liste1],[liste2],[liste3]]
for i in range (0, 2):
start_urls = liste[i]
def start_requests(self):
#print(self.start_urls)
for u in self.start_urls:
try:
req = requests.get(u)
except requests.exceptions.ConnectionError:
print("Connection refused")
if req.status_code != 200:
print("Request failed, status code is :", req.status_code)
continue
yield scrapy.Request(u, callback=self.parse, meta={'dont_merge_cookies': True}, dont_filter=False)
thanks
I need to call different callback related to the start_urls given by the loop, but I can't give a dynamic name for the callback. I would like to put callback=parse_i and i come from the loop above.
The callback attribute just needs to be a callable, so you can use getattr just like normal:
my_callback = getattr(self, 'parse_{}'.format(i))
yield Request(u, callback=my_callback)
Separately, while you didn't ask this, it is highly unusual to make a URL call from within start_requests since (a) that's why one would use Scrapy to begin with, to deal with all that non-200 stuff and (b) doing so with requests will not honor any of the throttling, proxy, user-agent, resumption, or host of other knobs over which one would wish to influence a scraping job.

Simplify a streamed request.get and JSON response decode

I have been working on some code that will grab emergency incident information from a service called PulsePoint. It works with software built into computer controlled dispatch centers.
This is an app that empowers citizen heroes that are CPR trained to help before a first resonder arrives on scene. I'm merely using it to get other emergency incidents.
I reversed-engineered there app as they have no documentation on how to make your own requests. Because of this i have knowingly left in the api key and auth info because its in plain text in the Android manifest file.
I will definitely make a python module eventually for interfacing with this service, for now its just messy.
Anyhow, sorry for that long boring intro.
My real question is, how can i simplify this function so that it looks and runs a bit cleaner in making a timed request and returning a json object that can be used through subscripts?
import requests, time, json
def getjsonobject(agency):
startsecond = time.strftime("%S")
url = REDACTED
body = []
currentagency = requests.get(url=url, verify=False, stream=True, auth=requests.auth.HTTPBasicAuth(REDACTED, REDCATED), timeout = 13)
for chunk in currentagency.iter_content(1024):
body.append(chunk)
if(int(startsecond) + 5 < int(time.strftime("%S"))): #Shitty internet proof, with timeout above
raise Exception("Server sent to much data")
jsonstringforagency = str(b''.join(body))[2:][:-1] #Removes charecters that define the response body so that the next line doesnt error
currentagencyjson = json.loads(jsonstringforagency) #Loads response as decodable JSON
return currentagencyjson
currentincidents = getjsonobject("lafdw")
for inci in currentincidents["incidents"]["active"]:
print(inci["FullDisplayAddress"])
Requests handles acquiring the body data, checking for json, and parsing the json for you automatically, and since you're giving the timeout arg I don't think you need separate timeout handling. Request also handles constructing the URL for get requests, so you can put your query information into a dictionary, which is much nicer. Combining those changes and removing unused imports gives you this:
import requests
params = dict(both=1,
minimal=1,
apikey=REDACTED)
url = REDACTED
def getjsonobject(agency):
myParams = dict(params, agency=agency)
return requests.get(url, verify=False, params=myParams, stream=True,
auth=requests.auth.HTTPBasicAuth(REDACTED, REDACTED),
timeout = 13).json()
Which gives the same output for me.

Scrapy : Sending information to prior function

I am using scrapy 1.1 to scrape a website. The site requires periodic relogin. I can tell when this is needed because when login is required a 302 redirection occurs. Based on # http://sangaline.com/post/advanced-web-scraping-tutorial/ , I have subclassed the RedirectMiddleware, making the location http header available in the spider under:
request.meta['redirect_urls']
My problem is that after logging in , I have set up a function to loop through 100 pages to scrape . Lets say after 15 pages I see that I have to log back in (based on the contents of request.meta['redirect_urls']) . My code looks like:
def test1(self, response):
......
for row in empties: # 100 records
d = object_as_dict(row)
AA
yield Request(url=myurl,headers=self.headers, callback=self.parse_lookup, meta={d':d}, dont_filter=True)
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
print str(response.meta['redirect_urls'])
BB
d = response.meta['d']
So as you can see, I get 'notified' of the need to relogin in parse_lookup at BB , but need to feed this information back to cancel the loop creating requests in test1 (AA). How can I make the information in parse lookup available in the prior callback function?
Why not use a DownloaderMiddleware?
You could write a DownloaderMiddleware like so:
Edit: I have edited the original code to address a second problem the OP had in the comments.
from scrapy.http import Request
class CustomMiddleware():
def process_response(self, request, response, spider):
if 'redirect_urls' in response.meta:
# assuming your spider has a method for handling the login
original_url = response.meta["redirect_urls"][0]
return Request(url="login_url",
callback=spider.login,
meta={"original_url": original_url})
return response
So you "intercept" the response before it goes to the parse_lookup and relogin/fix what is wrong and yield new requests...
Like Tomáš Linhart said the requests are asynchronous so I don't know if you could run into problems by "reloging in" several times in a row, as multiple requests might be redirected at the same time.
Remember to add the middleware to your settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 542,
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
You can't achieve what you want because Scrapy uses asynchronous processing.
In theory you could use approach partially suggested in comment by #Paulo Scardine, i.e. raise an exception in parse_lookup. For it to be useful, you would then have to code your spider middleware and handle this exception in process_spider_exception method to log back in and retry failed requests.
But I think better and simpler approach would be to do the same once you detect the need to login, i.e. in parse_lookup. Not sure exactly how CONCURRENT_REQUESTS_PER_DOMAIN works, but setting this to 1 might let you process one request at time and so there should be no failing requests as you always log back in when you need to.
Don't iterate over the 100 items and create requests for all of them. Instead, just create a request for the first item, process it in your callback function, yield the item, and only after that's done create the request for the second item and yield it. With this approach, you can check for the location header in your callback and either make the request for the next item or login and repeat the current item request.
For example:
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
# It's a redirect
yield Request(url=your_login_url, callback=self.parse_login_response, meta={'current_item_url': response.request.url}
else:
# It's a normal response
item = YourItem()
... # Extract your item fields from the response
yield item
next_item_url = ... # Extract the next page URL from the response
yield Request(url=next_item_url, callback=self.parse_lookup)
This assumes that you can get the next item URL from the current item page, otherwise just put the list of URLs in the first request's META dict and pass it along.
I think it should be better not to try all 100 requests all at once, instead you should try to "serialize" the requests, for example you could add all your empties in the request's meta and pop them out as necessary, or put the empties as a field of your spider.
Another alternative would be to use the scrapy-inline-requests package to accomplish what you want, but you should probably extend your middleware to perform the login.

How to send a repeat request in urllib2.urlopen in Python if the first call is just stuck?

I am making a call to a URL in Python using urllib2.urlopen in a while(True) loop
My URL keeps changing every time (as there is a change in a particular parameter of the URL every time).
My code look as as follows:
def get_url(url):
'''Get json page data using a specified API url'''
response = urlopen(url)
data = str(response.read().decode('utf-8'))
page = json.loads(data)
return page
I am calling the above method from the main function by changing the url every time I make the call.
What I observe is that after few calls to the function, suddenly (I don;t know why), the code gets stuck at the statement
response = urlopen(url)
and it just waits and waits...
How do I best handle this situation?
I want to make sure that if it does not respond within say 10 seconds, I make the same call again.
I read about
response = urlopen(url, timeout=10)
but then what about the repeated call if this fails?
Depending on how many retries you want to attempt, use a try/catch inside a loop:
while True:
try:
response = urlopen(url, timeout=10)
break
except:
# do something with the error
pass
# do something with response
data = str(response.read().decode('utf-8'))
...
This will silence all exceptions, which may not be ideal (more on that here: Handling urllib2's timeout? - Python)
With this method you can retry once.
def get_url(url, trial=1):
try:
'''Get json page data using a specified API url'''
response = urlopen(url, timeout=10)
data = str(response.read().decode('utf-8'))
page = json.loads(data)
return page
except:
if trial == 1:
return get_url(url, trial=2)
else:
return

Faster Scraping of JSON from API: Asynchronous or?

I need to scrape roughly 30GB of JSON data from a website API as quickly as possible. I don't need to parse it -- I just need to save everything that shows up on each API URL.
I can request quite a bit of data at a time -- say 1MB or even 50MB 'chunks' (API parameters are encoded in the URL and allow me to select how much data I want per request)
the API places a limit of 1 request per second.
I would like to accomplish this on a laptop and 100MB/sec internet connection
Currently, I am accomplishing this (synchronously & too slowly) by:
-pre-computing all of the (encoded) URL's I want to scrape
-using Python 3's requests library to request each URL and save the resulting JSON one-by-one in separate .txt files.
Basically, my synchronous, too-slow solution looks like this (simplified slightly):
#for each pre-computed encoded URL do:
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
json.dump(curr_url_request.json(), outfile)
What would be a better/faster way to do this? Is there a straight-forward way to accomplish this asynchronously but respecting the 1-request-per-second threshold? I have read about grequests (no longer maintained?), twisted, asyncio, etc but do not have enough experience to know whether/if one of these is the right way to go.
EDIT
Based on Kardaj's reply below, I decided to give async Tornado a try. Here's my current Tornado version (which is heavily based on one of the examples in their docs). It successfully limits concurrency.
The hangup is, how can I do an overall rate-limit of 1 request per second globally across all workers? (Kardaj, the async sleep makes a worker sleep before working, but does not check whether other workers 'wake up' and request at the same time. When I tested it, all workers grab a page and break the rate limit, then go to sleep simultaneously).
from datetime import datetime
from datetime import timedelta
from tornado import httpclient, gen, ioloop, queues
URLS = ["https://baconipsum.com/api/?type=meat",
"https://baconipsum.com/api/?type=filler",
"https://baconipsum.com/api/?type=meat-and-filler",
"https://baconipsum.com/api/?type=all-meat&paras=2&start-with-lorem=1"]
concurrency = 2
def handle_request(response):
if response.code == 200:
with open("FOO"+'.txt', "wb") as thisfile:#fix filenames to avoid overwrite
thisfile.write(response.body)
#gen.coroutine
def request_and_save_url(url):
try:
response = yield httpclient.AsyncHTTPClient().fetch(url, handle_request)
print('fetched {0}'.format(url))
except Exception as e:
print('Exception: {0} {1}'.format(e, url))
raise gen.Return([])
#gen.coroutine
def main():
q = queues.Queue()
tstart = datetime.now()
fetching, fetched = set(), set()
#gen.coroutine
def fetch_url(worker_id):
current_url = yield q.get()
try:
if current_url in fetching:
return
#print('fetching {0}'.format(current_url))
print("Worker {0} starting, elapsed is {1}".format(worker_id, (datetime.now()-tstart).seconds ))
fetching.add(current_url)
yield request_and_save_url(current_url)
fetched.add(current_url)
finally:
q.task_done()
#gen.coroutine
def worker(worker_id):
while True:
yield fetch_url(worker_id)
# Fill a queue of URL's to scrape
list = [q.put(url) for url in URLS] # this does not make a list...it just puts all the URLS into the Queue
# Start workers, then wait for the work Queue to be empty.
for ii in range(concurrency):
worker(ii)
yield q.join(timeout=timedelta(seconds=300))
assert fetching == fetched
print('Done in {0} seconds, fetched {1} URLs.'.format(
datetime.now() - tstart, len(fetched)))
if __name__ == '__main__':
import logging
logging.basicConfig()
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(main)
You are parsing the content and then serializing it again. You can just write the content directly to a file.
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
outfile.write(curr_url_request.content)
That probably removes most of the processing overhead.
tornado has a very powerful asynchronous client. Here's a basic code that may do the trick:
from tornado.httpclient import AsyncHTTPClient
import tornado
URLS = []
http_client = AsyncHTTPClient()
loop = tornado.ioloop.IOLoop.current()
def handle_request(response):
if response.code == 200:
with open('json_output.txt', 'a') as outfile:
outfile.write(response.body)
#tornado.gen.coroutine
def queue_requests():
results = []
for url in URLS:
nxt = tornado.gen.sleep(1) # 1 request per second
res = http_client.fetch(url, handle_request)
results.append(res)
yield nxt
yield results # wait for all requests to finish
loop.add_callback(loop.stop)
loop.add_callback(queue_requests)
loop.start()
This is a straight-forward approach that may lead to too many connections with the remote server. You may have to resolve such problem using a sliding window while queuing the requests.
In case of request timeouts or specific headers required, feel free to read the doc

Categories