Using GAE's search API with ndb's tasklets - python

The Google AppEngine search API can return asynchronous results. The docs say very little about these futures, but they do have a .get_result() method which looks a lot like an ndb.Future. I thought it would be fun to try to use one in a tasklet:
#ndb.tasklet
def async_query(index):
results = yield [index.search_async('foo'), index.search_async('bar')]
raise ndb.Return(results)
Unfortunately, this doesn't work. ndb doesn't like this because the future returned by the search API doesn't seem to be compatible with ndb.Future. However, the tasklet documentation also specifically mentions that they have been made to work with urlfetch futures. Is there a way to get similar behavior with the search api?

It turns out that there is a way to make this work (at least, my tests on the dev_appserver seem to be passing ;-).
#ndb.tasklet
def async_query(index, query):
fut = index.search_async(query)
yield fut._rpc
raise ndb.Return(fut._get_result_hook())
Now if I want to do multiple queries and intermix some datastore queries (i.e. maybe I use search API to get IDs for model entities),
#ndb.tasklet
def get_entities(queries):
search_results = yield [async_query(q) for q in queries]
ids = set()
for search_result in search_results:
for doc in search_results.results:
ids.add(doc.doc_id)
entities = yield [FooModel.get_by_id_async(id_) for id_ in ids_]
raise ndb.Return(entities)
This is super hacky -- and I doubt that it is officially supported since I am using internal members of the async search result class ... Use at your own risk :-).

Related

Initializing state on dask-distributed workers

I am trying to do something like
resource = MyResource()
def fn(x):
something = dosemthing(x, resource)
return something
client = Client()
results = client.map(fn, data)
The issue is that resource is not serializable and is expensive to construct.
Therefore I would like to construct it once on each worker and be available to be used by fn.
How do I do this?
Or is there some other way to make resource available on all workers?
You can always construct a lazy resource, something like
class GiveAResource():
resource = [None]
def get_resource(self):
if self.resource[0] is None:
self.resource[0] = MyResource()
return self.resource[0]
An instance of this will serialise between processes fine, so you can include it as an input to any function to be executed on workers, and then calling .get_resource() on it will get your local expensive resource (which will get remade on any worker which appears later on).
This class would be best defined in a module rather than dynamic code.
There is no locking here, so if several threads ask for the resource at the same time when it has not been needed so far, you will get redundant work.

How do I can get list of Google YouTube API methods?

I need to get list of YouTube API (v3) methods, because I want to implement a simple client library, which will not contain URL to every method, and just call them with their name.
I'll use Python for this.
As #sous2817 said, you can see all the methods that the YouTube API supports in this documentation.
This reference guide explains how to use the API to perform all of these operations. The guide is organized by resource type. A resource represents a type of item that comprises part of the YouTube experience, such as a video, a playlist, or a subscription. For each resource type, the guide lists one or more data representations, and resources are represented as JSON objects. The guide also lists one or more supported methods (LIST, POST, DELETE, etc.) for each resource type and explains how to use those methods in your application.
Here are Python Code Samples which use the Google APIs Client Library for Python:
Call the API's captions.list method to list the existing caption tracks.
def list_captions(youtube, video_id):
results = youtube.captions().list(
part="snippet",
videoId=video_id
).execute()
for item in results["items"]:
id = item["id"]
name = item["snippet"]["name"]
language = item["snippet"]["language"]
print "Caption track '%s(%s)' in '%s' language." % (name, id, language)
return results["items"]
Call the API's captions.download method to download an existing caption track.
def download_caption(youtube, caption_id, tfmt):
subtitle = youtube.captions().download(
id=caption_id,
tfmt=tfmt
).execute()
print "First line of caption track: %s" % (subtitle)
More sample codes.

How much should a unit test know about the function it is testing?

I'm writing a test for a caching mechanism. The mechanism has two cache layers, the request cache and redis. The request cache uses Flask.g, an object that stores values for the duration of the request. It does this by creating a dictionary, on the Flask.g._cache attribute.
However, I think that the exact attribute is an implementation detail that my unit test shouldn't care about. I want to make sure it stores its values on Flask.g, but I don't care about how it does that. What would be a good way to test this?
I'm using the Python mock module, so I know I can mock out `Flask.g, but I'm not sure if there's a way to test whether there has been any property access on it, without caring about which property it is.
Is this even the right approach for tests like this?
Personally you shouldn't be mocking Flask.g if you are testing endpoints. since you would be creating a self.app (I may be unsure about this portion)
Secondly you will need to mock the redis client with something like this being the returned structure.
class RedisTestCase(object):
saved_dictionary = {}
def keys(self, pattern):
found_keys = []
for key in RedisTestCase.saved_dictionary: #loop keys for pattern
if pattern:
found_keys.append(key)
return found_keys
def delete(self, keys):
for key in keys:
if key in RedisTestCase.saved_dictionary:
del RedisTestCase.saved_dictionary[key]
def mset(self, items):
RedisTestCase.saved_dictionary.update(items)
def pipeline(self):
return RedisTestCase()

how to or when to use GAE #classmethod properly

I have this google app-engine datastore NDB model:
class pySessions(ndb.Model):
# sid is the key/id
data = ndb.Blobproperty(required=True)
expiry = ndb.DateTimeProperty(required=True)
#classmethod
def get_sid(cls, sid):
sid_key = ndb.Key(cls, sid)
return cls.query(cls.key == sid_key,
cls.expiry >= datetime.utcnow()).get()
For getting an specific sid I can use something like this:
data = pySessions.get_by_id(sid)
if session and session.expiry >= datetime.utcnow():
return session.data
return {}
Or I could use the #classmethod get_sid
data = pySessions.get_sid(sid)
Both work but while doing some tests, notice that the #classmethod was behaving slower, or not reading the updated session data.
I was testing with a basic counter but after incrementing, I was reloading (header location) the page, is here that I notice that for some unknown reason querying the NBB using the #classmethod get_sidwas having some issues, the only way I was available to make it work was when using the pdb debugger, since it was making a 'pause' and allowing the code to read the data slowly.
Any idea of what is the difference between using #classmethod or a custom query ?
I'm not sure why you think the speed difference has anything to do with using a classmethod. The two methods are doing completely different things: the classmethod is doing a query (by key and expiry), whereas the other code is doing a straight get, which is much quicker, but then only returning if the expiry is less than now.
Queries are always much slower than gets in the datastore, but the second method has the disadvantage of always fetching the data - and therefore incurring a cost - even if the expiry date has passed.
The other potential downside of the query is that queries (unlike gets) are subject to eventual consistency, so there is a likelihood of not seeing the most recently updated data.

Mapreduce on GAE Python - Cause ReducePipeline to issue callback on finalize?

I'd like to execute a custom callback function once a mapreduce job has finalized/completed.
The only useful references I found for this problem are a somewhat outdated Google site and a related, but again seemingly outdated Stackoverflow question.
Both those sources assume that I use control.start_map to kick off Mapreduce jobs, and rely on the fact that start_map takes a keyword argument mapreduce_parameters in which one can specify a done_callback argument to specify the url which should be called on completion. However, I'm using a different method (afaik the more recent, preferred one) in which a custom pipeline's run method yields a Mapreduce pipeline:
yield mapreduce_pipeline.MapreducePipeline(
"word_count",
"main.word_count_map",
"main.word_count_reduce",
"mapreduce.input_readers.BlobstoreZipInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
mapper_params={
"blob_key": blobkey,
},
reducer_params={
"mime_type": "text/plain",
},
shards=16)
The signature for MapreducePipeline doesn't allow for a mapreduce_parameters argument. The only places where I can see references to callback cropping up in the source is in mapper_pipeline.MapperPipeline.run, but it seems to be used internally only.
So, is there a way to get that callback parameter in there?
If not, does someone have good ideas on where and how to extend the library to provide such a functionality?
I set up my Mapreduce pipeline paradigm to look a little like the following:
class MRRecalculateSupportsPipeline(base_handler.PipelineBase):
def run(self, user_key):
# ...
yield mapreduce_pipeline.MapreducePipeline('user_recalculate_supports',
'myapp.mapreduces.user_recalculate_supports_map',
'myapp.mapreduces.user_recalculate_supports_reduce',
'mapreduce.input_readers.DatastoreInputReader', output_writer_spec=None,
mapper_params={"""..."""})
If you would like to capture the completion of this pipeline you have two options.
A) Use pipeline.After to run a completion pipeline after the MR pipeline completes.
pipe_future = yield mapreduce_pipeline.MapreducePipeline('user_recalculate_supports',
'myapp.mapreduces.user_recalculate_supports_map',
'myapp.mapreduces.user_recalculate_supports_reduce',
'mapreduce.input_readers.DatastoreInputReader', output_writer_spec=None,
mapper_params={"""..."""})
with pipeline.After(pipe_future):
yield CalcCompletePipeline(...) # this could be a mapreduce pipeline, or any pipeline using the same base_handler.PipelineBase parent class.
B) Use the finalized method of the top-level pipeline to handle completion. Personally, I'd stick with option A, because you can trace the path in /_ah/*/status?root= view.
class EmailNewReleasePipeline(base_handler.PipelineBase):
"""Email followers about a new release"""
# TODO: product_key is the name of the parameter, but it's built for albums ...
def run(self, product_key, testing=False):
# Send those emails ...
yield mapreduce_pipeline.MapreducePipeline(...)
def finalized(self):
"""Save product as launched"""
...
product.launched = True
product.put()
Here are the docs on the finalization of a pipeline.
At least a not-so-much-investment workaround for this issue is to simply yield another Map/Mapreduce pipeline that does the desired postprocessing.
E.g.:
class MainPipeline(base_handler.PipelineBase):
def run(self):
mapper_params = { ... }
reducer_params = { ... }
yield mapreduce_pipeline.MapReducePipeline(
...,
mapper_params=mapper_params,
reducer_params=reducer_params)
yield PostprocessPipeline(reducer_params)
class PostprocessPipeline(base_handler.PipelineBase):
def run(self, reducer_params):
do_some_postprocessing(reducer_params)
That workaround doesn't have access to the Mapreduce state, which I suppose could somehow be retrieved from the pipeline ID, but it's not yet obvious to me how. So, you'll have to set another flag/memcache/ds entry to check if the pipeline was completed successfully or not (if that's relevant to the postprocessing).

Categories