I watched Raymond Hettinger's Idiomatic Python talk, and learned about the sentinel argument to iter().
I'd like to try to apply it to a piece of code I'm working on iterating over an API that uses pagination (it's Twilio, but not relevant to my question).
I have an API that returns: a list of data, and a next page URL. When the pagination is exhausted, the next page URL returns as an empty string. I wrote the fetching function as a generator and looks roughly like this:
def fetch(url):
while url:
data = requests.get(url).json()
url = data['next_page_uri']
for row in data[resource]:
yield row
This code works fine, but I'd like to try to remove the while loop and replace it with a call to iter() using the next_page_uri value as the sentinel argument.
Alternately, could this be written with a yield from?
I think this might be what you mean… but as stated in the comments, it doesn't help much:
def fetch_paged(url):
while url:
res = requests.get(url)
res.raise_for_status()
data = res.json()
yield data
url = data['next_page_uri']
def fetch(url):
for data in fetch_paged(url):
yield from data[resource]
(I've taken the opportunity to put in a call to raise_for_status() which will raise for non-successful, i.e. res.status_code < 400, responses)
not sure if it's any "better", but possibly if you're going to be reusing the fetch_paged functionality a lot
Note: lots of other APIs put this next_page_uri into the response headers in standard ways which the requests library knows how to deal with and exposes via the res.links attribute
Related
Similar to what is done in the link:
How can i use multiple requests and pass items in between them in scrapy python
I am trying to chain requests from spiders like in Dave McLain's answer. Returning a request object from parse function works fine, allowing the spider to continue with the next request.
def parse(self, response):
# Some operations
self.url_index += 1
if self.url_index < len(self.urls):
return scrapy.Request(url=self.urls[self.url_index], callback=self.parse)
return items
However, I have the default Spider Middleware where I do some caching and logging operations in the spider_process_output. Returning the request object from the parse function first goes into middleware. So, the middleware has to return the request object as well.
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
if hasattr(spider, 'multiple_urls'):
if spider.url_index + 1 < len(spider.urls):
return [result]
# return [scrapy.Request(url=spider.urls[spider.url_index], callback=spider.parse)]
# Some operations ...
According to the documentation, it must return iterable of Request, or item objects. However, when I return the result (which is a Request object), or construct a new request object (as in the comment), the spider just terminates (by giving spider finished signal) without making a new request.
Documentation link: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#writing-your-own-spider-middleware
I am not sure if there is an issue with the documentation or the way I interpret it. But, returning request objects from the middleware doesn't make new request, instead it terminates the flow.
It was quite simple yet frustrating to solve the problem. The middleware is supposed to return iterable of request objects. However, putting the request object into a list (which is an iterable) doesn't seem to work. Using yield result in the process_spider_output middleware function instead works.
Since the main issue is resolved, I'll leave this answer as a reference. Better explanations of why this is the case are appreciated.
The Google AppEngine search API can return asynchronous results. The docs say very little about these futures, but they do have a .get_result() method which looks a lot like an ndb.Future. I thought it would be fun to try to use one in a tasklet:
#ndb.tasklet
def async_query(index):
results = yield [index.search_async('foo'), index.search_async('bar')]
raise ndb.Return(results)
Unfortunately, this doesn't work. ndb doesn't like this because the future returned by the search API doesn't seem to be compatible with ndb.Future. However, the tasklet documentation also specifically mentions that they have been made to work with urlfetch futures. Is there a way to get similar behavior with the search api?
It turns out that there is a way to make this work (at least, my tests on the dev_appserver seem to be passing ;-).
#ndb.tasklet
def async_query(index, query):
fut = index.search_async(query)
yield fut._rpc
raise ndb.Return(fut._get_result_hook())
Now if I want to do multiple queries and intermix some datastore queries (i.e. maybe I use search API to get IDs for model entities),
#ndb.tasklet
def get_entities(queries):
search_results = yield [async_query(q) for q in queries]
ids = set()
for search_result in search_results:
for doc in search_results.results:
ids.add(doc.doc_id)
entities = yield [FooModel.get_by_id_async(id_) for id_ in ids_]
raise ndb.Return(entities)
This is super hacky -- and I doubt that it is officially supported since I am using internal members of the async search result class ... Use at your own risk :-).
What's the best approach to write contracts for Scrapy spiders that have more than one method to parse the response?
I saw this answer but it didn't sound very clear to me.
My current example: I have a method called parse_product that extracts the information on a page but I have more data that I need to extract for the same product in another page, so I yield a new request at the end of this method to make a new request and let the new callback extracts theses fields and returns the item.
The problem is that if I write a contract for the second method, it will fail because it doesn't have the meta attribute (containing the item with most of the fields). If I write a contract for the first method, I can't check if it returns the fields, because it returns a new request, instead of the item.
def parse_product(self, response):
il = ItemLoader(item=ProductItem(), response=response)
# populate the item in here
# yield the new request sending the ItemLoader to another callback
yield scrapy.Request(new_url, callback=self.parse_images, meta={'item': il})
def parse_images(self, response):
"""
#url http://foo.bar
#returns items 1 1
#scrapes field1 field2 field3
"""
il = response.request.meta['item']
# extract the new fields and add them to the item in here
yield il.load_item()
In the example, I put the contract in the second method, but it gave me a KeyError exception on response.request.meta['item'], also, the fields field1 and field2 are populated in the first method.
Hope it's clear enough.
Frankly, I don't use Scrapy contracts and I don't really recommend anyone to use them either. They have many issues and someday may be removed from Scrapy.
In practice, I haven't had much luck using unit tests for spiders.
For testing spiders during development, I'd enable the cache and then re-run the spider as many times as needed to get the scraping right.
For regression bugs, I had better luck using item pipelines (or spider middlewares) that do validation on-the-fly (there is only so much you can catch in early testing anyway). It's also a good idea to have some strategies for recovering.
And for maintaining a healthy codebase, I'd be constantly moving library-like code out from the spider itself to make it more testable.
Sorry if this isn't the answer you're looking for.
So I'm pretty new to Flask and I'm trying to make my mind around one thing. So, if I understand well when you write a function within a Flask app and you use the #app.route decorator in that function, it only runs when you hit that path/url.
I have a small oauth app written in Flask that goes through all the authorization flow and then it return the token.
My question is how do I get that token from the #decorated function? For example, lets say I have something like this:
#app.route(/token/)
def getToken(code): #code from the callback url.
#/Stuff to get the Token/
#/**********************/
return token
If I hit the (/token/) url-path the function returns the token. But now I need to get that token and use it in another function to write and read from the API I just got the token from. My initial thought was doing this:
token = getToken(code)
But if I do that, I get this error:
RuntimeError: working outside of request context
So again, my question is, how do I get the token so I can pass it as a parameter to other functions.
Extract the token generation code into a separate function, so that you can call it from anywhere, including the view function. It's a good practice to keep the application logic away from the view, and it also helps with unit testing.
I assume your route includes a placeholder for code, which you skipped:
def generateToken(code):
#/Stuff to get the Token/
#/**********************/
return token
#app.route('/token/<string:code>')
def getToken(code):
return generateToken(code)
Just keep in mind that generateToken shouldn't depend on the request object. If you need any request data (e.g. HTTP header), you should pass it explicitly in arguments. Otherwise you will get the "working outside of request context" exception you mentioned.
It is possible to call request-dependent views directly, but you need to mock the request object, which is a bit tricky. Read the request context documentation to learn more.
not sure what the context is. You could just call the method.
from yourmodule import get_token
def yourmethod():
token = get_token()
Otherwise, you could use the requests library in order to retrieve the data from the route
>>> import requests
>>> response = requests.get('www.yoursite.com/yourroute/')
>>> print response.text
If you're looking for unittests, Flask comes with a mock client
def test_get_token():
resp = self.app.get('/yourroute')
# do something with resp.data
I'm crawling a web site (only two levels deep), and I want to scrape information from sites on both levels. The problem I'm running into, is I want to fill out the fields of one item with information from both levels. How do I do this?
I was thinking having a list of items as an instance variable that will be accessible by all threads (since it's the same instance of the spider), and parse_1 will fill out some fields, and parse_2 will have to check for the correct key before filling out the corresponding value. This method seems burdensome, and I'm still not sure how to make it work.
What I'm thinking is there must be a better way, maybe somehow passing an item to the callback. I don't know how to do that with the Request() method though. Ideas?
From scrapy documentation:
In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. You can use the Request.meta attribute for that.
Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
So, basically you can scrape first page and store all information in item and then send whole item with request for that second level url and have all the information in one item.