Mapreduce on Google App Engine - python

I'm very confused with the state and documentation of mapreduce support in GAE.
In the official doc https://developers.google.com/appengine/docs/python/dataprocessing/, there is an example, but :
the application use mapreduce.input_readers.BlobstoreZipInputReader, and I would like to use mapreduce.input_readers.DatastoreInputReader. The documentation mention the parameters of DatastoreInputReader, but not the return value sent back to the map fonction....
the application "demo" (page Helloworld) has a mapreduce.yaml file wich IS NOT USED in the application ???
So I found http://code.google.com/p/appengine-mapreduce/. The is a complete example with mapreduce.input_readers.DatastoreInputReader, but it is written that reduce phase isn't supported yet !
So I would like to know if it is possible to implement the first form of mapreduce, with the DatastoreInputReader, to execute a real map / reduce to get a GROUP BY equivalent ?

The second example is from the earlier release, which did indeed just support the mapper phase. However, as the first example shows, the full map/reduce functionality is now supported and has been for some time. The mapreduce.yaml is from that earlier version, it is not used now.
I'm not sure what your actual question is. The value sent to the map function from DatastoreInputReader is, not surprisingly, the individual entity which is taken from the kind being mapped over.

Related

Batch call Dependency id requirements?

I have a script which does the following:
Create campaign
Create AdSet (requires campaign_id)
Create AdCreative (requires adset_id)
Create Ad (requires creative_id and adset_id)
I am trying to lump all of them into a batch request. However, I realized that my none of these gets created except for my campaign (step 1) if I use remote_create(batch=my_batch). This is probably due to the dependencies of the ids that are needed in by each of the subsequent steps.
I read the documentation and it mentions that one can "Specifying dependencies between operations in the request" (https://developers.facebook.com/docs/graph-api/making-multiple-requests) between calls via {result=(parent operation name):(JSONPath expression)}
Is this possible with the python API?
Can this be achieved with the way I am using remote_creates?
Unfortunately python sdk doesn't currently support this. There is a github issue for it: https://github.com/facebook/facebook-python-ads-sdk/issues/256.
I have also encountered this issue also and have described my workaround in the comments on the issue:
"I found a decent workaround for getting this behaviour without too much trouble. Basically I set the id fields that have dependencies with values like "{result=:$,id}" and prior to doing execute() on the batch object I iterate over ._batch and add as the 'name' entry. When I run execute sure enough it works perfectly. Obviously this solution does have it's limitations such where you are doing multiple calls to the same endpoint that need to be fed into other endpoints and you would have duplicated resource names and would need to customize the name further to string them together.
Anyways, hope this helps someone!"

Overview over used and non used function Python / Django application

Concept of this question is to gather information on how you would
proceed to gather information about whether a function and/or class is
in use in the entirety of an application.
Background information
The application is 3-5 years old, based originally on Python 2.4 (Upgraded over the years to latest Python 2.7.11), Django 1.0 (Upgraded over the years to 1.4.22), some custom frameworks which implement some ruby on rails magic (Create a controller file with function names and they turn into an HTTP endpoint, functions with _ infront will not be visible to the enduser), total number of endpoints, derived from django.url* tells me I have to manually create 100 endpoints for various needs and purpose. Number of Django apps/modules is around 20, they are entangled into each other, and I know not all are used, but heres the thing, how would I proceed to gather information to tell which function are used or not? So I could do a refactoring of the code, to reduce noise?
I've used PyCharm and its intelligence but based on how the application and how Python works, some of the suggestion are making the application not working.
Example of above: some functions in models, and views are not using self and then PyCharm thinks 'well this function can be changed to a static method' but somewhere else in the code the previous developer use self."function_name" and by using that call it actually imply "Please provide me with self and the argument".
TLDR: How to proceed to weed out dead and not used code in an easy and efficient way? Thanks for all input in advance.

appengine-mapreduce fails with out of memory during shuffle stage

I have ~50M entities stored in datastore. Each item can be of one type out of a total 7 types.
Next, I have a simple MapReduce job that counts the number of items of each type. It is written in python and based on appengine-mapreduce library. The Mapper emits (type, 1). The reducer simply adds the number of 1s received for each type.
When I run this job with 5000 shards, the map-stage runs fine. It uses a total of 20 instances which is maximum possible based on my task-queue configuration.
However, the shuffle-hash stage makes use of only one instance and fails with an out-of-memory error. I am not able to understand why only one instance is being used for hashing and how can I fix this out-of-memory error.
I have tried writing a combiner but I never saw a combiner stage on the mapreduce status page or in the logs.
Also, the wiki for appengine-mapreduce on github is obsolete and I cannot find an active community where I can ask questions.
You are correct that the Python shuffle is in-memory based and does not scale. There is a way to make the Python MR use the Java MR shuffle phase (which is fast and scales). Unfortunately documentation about it (the setup and the how the 2 libraries communicate) is poor. See this issue for more information.

GAE mapreduce: How to access counters when the counting is done?

I have a mapper pipeline where the map function increments a counter using
yield op.counters.Increment("mycounter")
But I don't know how to access the value of "mycounter" after the pipeline has completed. I have seen examples using a completion handler, but they seem to refer to an older mapreduce library, where one could actually define a completion handler.
My best guess is that I need to define a final stage in the pipeline that has access to the mapper pipeline's counters -- but how exactly?
As answered in this related question, this feature is not available right now. There's a feature request in their issue tracker (Issue 208), which is currently with status "Started". Please star it ;-)

How to check if DataStore Indexes are being served on AppEngine?

How can I check if datastore Indexes as defined in index.yaml are serving in the python code?
I am using Python 1.3.6 AppEngine SDK.
Attempt to perform a query that requires that index. If it raises a NeedIndexError, it's not uploaded or not yet serving.
I don't think there's a way to check without adding some logging to the SDK code. If you're using the SQLite stub, __FindIndexForQuery, lines 1114-1140, is the part that looks for applicable indices to a query, and (at line 1140), returns, and I quote:
An entity_pb.CompositeIndex PB, if a
suitable index exists; otherwise None
A little logging at that point (and when it's about to fall off the end having exhausted the loop -- that's how it returns None) will give you a trace of all your indices that are actually used, as part of the logs of course. The protocol buffer it returns is an instance of the class defined in this file, starting at line 2576.
If you can explain why you want to know this, it would, I think, be quite reasonable to open a feature request on the App Engine tracker, asking Google to add the logging that I'm suggesting, so you don't have to keep maintaining your edited version of the file!
(If you use the file stub, the relevant file is here, and the part to instrument is around line 824 and following; of course, this part will be used only if you're running the SDK in "require indices" mode, AKA "strict mode", otherwise, indices are created in, not used by, the SDK;-)

Categories