I Have few questions about google classroom batch requests
I see this notice in documentation page "The Classroom API is currently experiencing issues with batch requests. Use multithreading for heavy request loads, instead.". But the notice below, leads me to a blog post saying if I use a proper client library version and only send homogeneous requests, it is still fine. So is this notice about Classroom API batch requests having issues still valid with a proper client library(google-api-python-client==1.7.11)?
This one is not about batch requests, but leads to the third question below. When we list courses/teachers/students there is a page-size parameter. If it's below 30 it returns correct number but anything above 30 it still returns 30 and I have to send second request to get the rest. Is this behavior documented somewhere?
With batch requests when requests have more results like in Q2, is there a proper way to gather rest of the results. What I have so far is something like this.
def callback_s(id, res, exc):
if exc:
print('exception',str(exc))
t = res.get('students',[])
np = res.get('nextPageToken')
if np:
#how to get rest of the results
def get_students(courses):
service = discovery.build('classroom', 'v1', credentials=creds)
br = service.new_batch_http_request(callback=callback_s)
for c in courses:
sr = service.courses().students().list(courseId=c['id'])
br.add(sr, request_id=c['id'])
br.execute()
Any pointers would be greatly appreciated.
Classroom API batch request issues:
The warning on top of the page refers to batch requests in general. This certainly includes whatever libraries you use, as long as they are using the same API (and of course, that's the case for the official Python library).
The blog post you mention is about discontinuing support for global batch endpoints, so that batch requests have to be API-specific from now on. That's totally unrelated to the current issues regarding Classroom API batch requests. It's also older than the warning, and is not taking those problems into account.
pageSize maximum value:
The documentation for pageSize doesn't specify the maximum value. For teachers.list and students.list mentions the default value (30). If you're setting a value higher than 30 and still returning only 30, chances are that's the maximum value too.
This doesn't seem to be documented, though:
pageSize: Maximum number of items to return. The default is 30 if unspecified or 0.
Beware, that doesn't seem to be the limit for courses.list (no default pageSize is mentioned, and a call to it retrieves way more than 30).
Multiple pages and batch requests:
You cannot request multiple pages from a list request at once using batch requests, since you need the nextPageToken from the previous page to request the next page (using pageToken). That is to say, you have to make one request after the other.
Related
would it be possible to implement a rate limiting feature on my tornado app? like limit the number of HTTP request from a specific client if they are identified to send too many requests per second (which red flags them as bots).
I think I could it manually by storing the requests on a database and analyzing the requests per IP address but I was just checking if there is already an existing solution for this feature.
I tried checking the github page of tornado, I have the same questions as this post but no explicit answer was provided. checked tornado's wiki links as well but I think rate limiting is not handled yet.
Instead of storing them in the DB, would be better to have them in a dictionary stored in memory for easy usability.
Also can you share the details whether the api has a load-balancer and which web-server is used.
The enterprise grade solution to your problem is ambassador.
You can use ambassador's solutions like envoy proxy and edge stack and have it set up that can do the needful.
additional to tore the data, you can use any popular cached db, or d that store as key:value pairs, for example redis.
if you doing this for a very small project, can use some npm/pip packages.
Read the docs: https://www.getambassador.io/products/edge-stack/api-gateway/
You should probably do this before your requests reach Tornado.
But if it's an application level feature (limiting requests depending on level of subscription), then you can do it in Tornado in lots of ways, depending on how complex you want the rate limiting to be.
Probably the simplest way is to have a dict on your tornado.web.Application that uses the ip as the key and the timestamp of the last request as the value and check every request against it in prepare- if not enough time passed since last request, raise a tornado.web.HTTPError(429) (ideally with a Retry-After header). If you do this, you will still need to clean up this dict now and then to remove entries that have not made a request recently or it will grow (you could do it finish on every request).
If you have another fast/in-memory storage attached (memcache, redis, sqlite), you should use that, but you definitely should not use an RDBMS as all those writes will not be great for its performance.
I have used python's requests module to do a POST call(within a loop) to a single URL with varying sets of data in each iteration. I have already used session reuse to use the underlying TCP connection for each call to the URL during the loop's iterations.
However, I want to further speed up my processing by 1. caching the URL and the authentication values(user id and password) as they would remain the same in each call 2. Spawning multiple sub-processes which could take a certain number of calls and pass them as a group thus allowing for parallel processing of these smaller sub-processes
Please note that I pass my authentication as headers in base64 format and a pseudo code of my post call would typically look like this:
S=requests.Session()
url='https://example.net/'
Loop through data records:
headers={'authorization':authbase64string,'other headers'}
data="data for this loop"
#Post call
r=s.post(url,data=data,headers=headers)
response=r.json()
#end of loop and program
Please review the scenario and suggest any techniques/tips which might be of help-
Thanks in advance,
Abhishek
You can:
do it as you described (if you want to make it faster then you can run it using multiprocessing) and e.g. add headers to session, not request.
modify target server and allow to send one post request with multiple data (so you're going to limit time spent on connecting, etc)
do some optimalizations on server side, so it's going to reply faster (or just store requests and send you response using some callback)
It would be much easier if you described the use case :)
I'm working on an application that will have to use multiple external APIs for information and after processing the data, will output the result to a client. The client uses a web interface to query, once query is send to server, server process send requests to different API providers and after joining the responses from those APIs then return response to client.
All responses are in JSON.
current approach:
import requests
def get_results(city, country, query, type, position):
#get list of apis with authentication code for this query
apis = get_list_of_apis(type, position)
results = [ ]
for api in apis:
result = requests.get(api)
#parse json
#combine result in uniform format to display
return results
Server uses Django to generate response.
Problem with this approach
(i) This may generate huge amounts of data even though client is not interested in all.
(ii) JSON response has to be parsed based on different API specs.
How to do this efficiently?
Note: Queries are being done to serve job listings.
Most APIs of this nature allow for some sort of "paging". You should code your requests to only draw a single page from each provider. You can then consolidate the several pages locally into a single stream.
If we assume you have 3 providers, and page size is fixed at 10, you will get 30 responses. Assuming you only show 10 listings to the client, you will have to discard and re-query 20 listings. A better idea might be to locally cache the query results for a short time (say 15 minutes to an hour) so that you don't have to requery the upstream providers each time your user advances a page in the consolidated list.
As far as the different parsing required for different providers, you will have to handle that internally. Create different classes for each. The list of providers is fixed, and small, so you can code a table of which provider-url gets which class behavior.
Shameless plug but I wrote a post on how I did exactly this in Durango REST framework here.
I highly recommend using Django REST framework, it makes everything so much easier
Basically, the model on your APIs end is extremely simple and simply contains information on what external API is used and the ID for that API resource. A GenericProvider class then provides an abstract interface to perform CRUD operations on the external source. This GenericProvider uses other providers that you create and determines what provider to use via the provider field on the model. All of the data returned by the GenericProvider is then serialised as usual.
Hope this helps!
I'm using the gmail API to search emails from users. I've created the following search query:
ticket after:2015/11/04 AND -from:me AND -in:trash
When I run this query in the browser interface of Gmail I get 11 messages (as expected). When I run the same query in the API however, I get only 10 messages. The code I use to query the gmail API is written in Python and looks like this:
searchQuery = 'ticket after:2015/11/04 AND -from:me AND -in:trash'
messagesObj = google.get('/gmail/v1/users/me/messages', data={'q': searchQuery}, token=token).data
print messagesObj.resultSizeEstimate # 10
I sent the same message on to another gmail address and tested it from that email address and (to my surprise) it does show up in an API-search with that other email address, so the trouble is not the email itself.
After endlessly emailing around through various test-gmail accounts I *think (but not 100% sure) that the browser-interface search function has a different definition of "me". It seems that in the API-search it does not include emails which come from email addresses with the same name while these results are in fact included in the result of the browser-search. For example: if "Pete Kramer" sends an email from petekramer#icloud.com to pete#gmail.com (which both have their name set to "Pete Kramer") it will show in the browser-search and it will NOT show in the API-search.
Can anybody confirm that this is the problem? And if so, is there a way to circumvent this to get the same results as the browser-search returns? Or does anybody else know why the results from the gmail browser-search differ from the gmail API-search? Al tips are welcome!
I would suspect it is the after query parameter that is giving you trouble. 2015/11/04 is not a valid ES5 ISO 8601 date. You could try the alternative after:<time_in_seconds_since_epoch>
# 2015-11-04 <=> 1446595200
searchQuery = 'ticket AND after:1446595200 AND -from:me AND -in:trash'
messagesObj = google.get('/gmail/v1/users/me/messages', data={'q': searchQuery}, token=token).data
print messagesObj.resultSizeEstimate # 11 hopefully!
The q parameter of the /messages/list works the same as on the web UI for me (tried on https://developers.google.com/gmail/api/v1/reference/users/messages/list#try-it )
I think the problem is that you are calling /messages rather than /messages/list
The first time your application connects to Gmail, or if partial synchronization is not available, you must perform a full sync. In a full sync operation, your application should retrieve and store as many of the most recent messages or threads as are necessary for your purpose. For example, if your application displays a list of recent messages, you may wish to retrieve and cache enough messages to allow for a responsive interface if the user scrolls beyond the first several messages displayed. The general procedure for performing a full sync operation is as follows:
Call messages.list to retrieve the first page of message IDs.
Create a batch request of messages.get requests for each of the messages returned by the list request. If your application displays message contents, you should use format=FULL or format=RAW the first time your application retrieves a message and cache the results to avoid additional retrieval operations. If you are retrieving a previously cached message, you should use format=MINIMAL to reduce the size of the response as only the labelIds may change.
Merge the updates into your cached results. Your application should store the historyId of the most recent message (the first message in the list response) for future partial synchronization.
Note: You can also perform synchronization using the equivalent Threads resource methods. This may be advantageous if your application primarily works with threads or only requires message metadata.
Partial synchronization
If your application has synchronized recently, you can perform a partial sync using the history.list method to return all history records newer than the startHistoryId you specify in your request. History records provide message IDs and type of change for each message, such as message added, deleted, or labels modified since the time of the startHistoryId. You can obtain and store the historyId of the most recent message from a full or partial sync to provide as a startHistoryId for future partial synchronization operations.
Limitations
History records are typically available for at least one week and often longer. However, the time period for which records are available may be significantly less and records may sometimes be unavailable in rare cases. If the startHistoryId supplied by your client is outside the available range of history records, the API returns an HTTP 404 error response. In this case, your client must perform a full sync as described in the previous section.
From gmail API Documentation
https://developers.google.com/gmail/api/guides/sync
I'm using elasticsearch and the RESTful API supports supports reading bodies in GET requests for search criteria.
I'm currently doing
response = urllib.request.urlopen(url, data).read().decode("utf-8")
If data is present, it issues a POST, otherwise a GET. How can I force a GET despite the fact that I'm including data (which should be in the request body as per a POST)
Nb: I'm aware I can use a source property in the Url but the queries we're running are complex and the query definition is quite verbose resulting in extremely long urls (long enough that they can interfere with some older browsers and proxies).
I'm not aware of a nice way to do this using urllib. However, requests makes it trivial (and, in fact, trivial with any arbitrary verb and request content) by using the requests.request* function:
requests.request(method='get', url='localhost/test', data='some data')
Constructing a small test web server will show that the data is indeed sent in the body of the request, and that the method perceived by the server is indeed a GET.
*note that I linked to the requests.api.requests code because that's where the actual function definition lives. You should call it using requests.request(...)