How Do I Download 30,000 Images Using Google Drive API?

How Do I Download 30,000 Images Using Google Drive API? - python

I need to download 30,000 images using the Google Drive API (I have all of their file_ids saved locally) so that I can upload them to AWS S3, but after only 20-30 image requests to the API, I get a 403 error, which means I'm exceeding the API Quota (1,000 requests per 100sec per user - not sure how I'm exceeding it but that's besides the point). My code sleeps for 2 seconds between each request and I still get this error. I need to download and upload these files in a reasonable amount of time, any suggestions?

I am not sure which library are you using to get the request. But as per my understanding urlopen will raise an HTTPError for those it can’t handle like `‘403’ (request forbidden).
Reference - List of Errors
403: ('Forbidden',
'Request forbidden -- authorization will not help').
Instead you can use - urlretrieve()
Sharing a small code sample : -
import urllib.request
url = 'http://example.com/'
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8') # a `str`; this step can't be used if data is binary

Downloading images with Drive API will count as one request per image, so can easily surpass the quota limit
Luckily there is a workaround - you can use batch requests which allows you to download up to 100 images with one request.
The documentaiton provides samples for implementation in Python.
Btw., you can check your quota usage in your GCP console.

Related

Qualtrics API, getting "[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)"

I occasionally get the above error when making requests with the python requests library to Qualtrics APIs.
In a nutshell, I have a Google Cloud Function on Google Cloud that will trigger when a csv file is placed on a specific Cloud Storage Bucket. The function will create a Qualtrics distribution list on Qualtrics, upload the contacts and then download the distribution links.
Every day, three files are uploaded on Cloud Storage, each for a different survey, and so three Google Cloud instances will be started.
My gripes with the issues are:
it doesn't happen regularly, in fact the workflow correctly worked for the past year
it doesn't seem to be tied to the processed files: when the function crashes and I manually restart it by reuploading the same csv into the bucket, it will work smoothly
The problem started around when we added the third daily csv to the process, and tend to happen when two files are being processed at the same time. For these reasons my suspects are:
I'm getting rate limited by Qualtrics (but I would expect a more clear message from Qualtrics)
The requests get in some way "crossed up" when two files are processed. I'm not sure if requests.request implicitly opens a session with the APIs. In that case the problem could be generated by multiple sessions being open at the same time from two csv being processed at the same time
As I said, the error seem to happen without a pattern, and it has happened on any part of the code where I'm doing a request to the APIs, so I'm not sure if sharing extensive code is helpful, but in general the requests are performed in a pretty standard way:
requests.request("POST", requestUrl, data=requestPayload, headers=headers)
requests.request("GET", requestUrl, headers=headers)
etc
i.e.: I'm not using any particular custom option

In the end I kind of resolved the issue with a workaround:
I separated the processing of the three csv so that there is no overlap in processing time between two files
implemented a retry policy in the POST request
Since then, separating processing time for the files reduced substantially the number of errors (from one or more each day to around 1 error a week), and even when they happen the retry policy circumvents the error at the first retry.
I realize this may not be the ideal solution, so I'm open to alternatives if someone comes up with something better (or even more insights on the root problem).

Google Classroom Api batch requests via client library

I Have few questions about google classroom batch requests
I see this notice in documentation page "The Classroom API is currently experiencing issues with batch requests. Use multithreading for heavy request loads, instead.". But the notice below, leads me to a blog post saying if I use a proper client library version and only send homogeneous requests, it is still fine. So is this notice about Classroom API batch requests having issues still valid with a proper client library(google-api-python-client==1.7.11)?
This one is not about batch requests, but leads to the third question below. When we list courses/teachers/students there is a page-size parameter. If it's below 30 it returns correct number but anything above 30 it still returns 30 and I have to send second request to get the rest. Is this behavior documented somewhere?
With batch requests when requests have more results like in Q2, is there a proper way to gather rest of the results. What I have so far is something like this.
def callback_s(id, res, exc):
if exc:
print('exception',str(exc))
t = res.get('students',[])
np = res.get('nextPageToken')
if np:
#how to get rest of the results
def get_students(courses):
service = discovery.build('classroom', 'v1', credentials=creds)
br = service.new_batch_http_request(callback=callback_s)
for c in courses:
sr = service.courses().students().list(courseId=c['id'])
br.add(sr, request_id=c['id'])
br.execute()
Any pointers would be greatly appreciated.

Classroom API batch request issues:
The warning on top of the page refers to batch requests in general. This certainly includes whatever libraries you use, as long as they are using the same API (and of course, that's the case for the official Python library).
The blog post you mention is about discontinuing support for global batch endpoints, so that batch requests have to be API-specific from now on. That's totally unrelated to the current issues regarding Classroom API batch requests. It's also older than the warning, and is not taking those problems into account.
pageSize maximum value:
The documentation for pageSize doesn't specify the maximum value. For teachers.list and students.list mentions the default value (30). If you're setting a value higher than 30 and still returning only 30, chances are that's the maximum value too.
This doesn't seem to be documented, though:
pageSize: Maximum number of items to return. The default is 30 if unspecified or 0.
Beware, that doesn't seem to be the limit for courses.list (no default pageSize is mentioned, and a call to it retrieves way more than 30).
Multiple pages and batch requests:
You cannot request multiple pages from a list request at once using batch requests, since you need the nextPageToken from the previous page to request the next page (using pageToken). That is to say, you have to make one request after the other.

openshift 3 django - request too large

I migrated a django app from Openshift 2 to Openshift3 Online. It has an upload feature that allows users to upload audio files. The files are usually larger than 50MB. In Openshift3 if I try to upload the file it only works for files up to around 12 MB. Larger than 12 MB leads to an error message in the firefox saying "connection canceled". Chromium gives more details:
Request Entity Too Large
The requested resource
/myApp/upload
does not allow request data with POST requests, or the amount of data provided in the request exceeds the capacity limit.
I'm using wsgi_mod-express. From searching on this error message on google, I could see that it I'm probably hitting any limit in the webserver configuration. Which limit could that be and how would I be able to change it?

As per help messages from running mod_wsgi-express start-server --help:
--limit-request-body NUMBER
The maximum number of bytes which are allowed in a
request body. Defaults to 10485760 (10MB).
Change your app.sh to add the option and set it to a larger value.

503 error when downloading data from imdb api

I am trying to download a plot for almost 25 000 movies with the usage of imdbpy module for python. To speed up, I'm using Pool function from Multiprocessing module. However after almost 100 requests the 503 error occurs with a following message: Service Temporarily Unavailable. After 10-15 minutes I can process again but after approximately 20 requests the same error occurs again.
I am aware that it might be a simple block from the api to prevent too many calls however I can't find any info about maximum number of requests per time unit on the web.
Do you have any idea how to process so many calls without being shutdown? Moreover, do you know where I can find the documentation of imdb api?
Best

Please, don't do it.
Scraping is forbidden by IMDb's terms of service, and IMDbPY was never intended to be used to mass-scrape the web site: in fact it's explicitly designed to fetch a single movie at a time.
In theory IMDbPY can manage the plain text data files they distribute, but unfortunately they recently changed both the format and the content of the data.
IMDb has no APIs that I know of; if you have to manage such a huge portion of their data, you have to get a licence.
Please consider the use of http://www.omdbapi.com/

IBM Watson SpechtoTextV1 error - Python

I have been trying my hands on IBM Watson speechtotext api. However, it works with short length audio files, but not with audio files which are around 5 mins. It's giving me below error
"watson {'code_description': 'Bad Request', 'code': 400, 'error': 'No speech detected for 30s.'}"
I am using Watson's trial account. Is there a limitation in case of trial account? or bug in below code.
Python code:-
from watson_developer_cloud import SpeechToTextV1
speech_to_text = SpeechToTextV1(
username='XXX',
password='XXX',
x_watson_learning_opt_out=False
)
with open('trial.flac', 'rb') as audio_file:
print(speech_to_text.recognize(audio_file, content_type='audio/flac', model='en-US_NarrowbandModel', timestamps=False, word_confidence=False, continuous=True))
Appreciate any help!

Please see the implementation notes from the Speech to Text API Explorer for the recognize API you are attempting to use:
Implementation Notes
Sends audio and returns transcription results for
a sessionless recognition request. Returns only the final results; to
enable interim results, use session-based requests or the WebSocket
API. The service imposes a data size limit of 100 MB. It automatically
detects the endianness of the incoming audio and, for audio that
includes multiple channels, downmixes the audio to one-channel mono
during transcoding.
Streaming mode
For requests to transcribe live
audio as it becomes available or to transcribe multiple audio files
with multipart requests, you must set the Transfer-Encoding header to
chunked to use streaming mode. In streaming mode, the server closes
the connection (status code 408) if the service receives no data chunk
for 30 seconds and the service has no audio to transcribe for 30
seconds. The server also closes the connection (status code 400) if no
speech is detected for inactivity_timeout seconds of audio (not
processing time); use the inactivity_timeout parameter to change the
default of 30 seconds.
There are two factors here. First there is a data size limit of 100 MB, so I would make sure you do not send files larger then that to the Speech to Text service. Secondly, you can see the server will close the connection and return a 400 error if there is no speech detected for the amount of seconds defined for inactivity_timeout. It seems the default value is 30 seconds, so this matches the error you are seeing above.
I would suggest you make sure there is valid speech in the first 30 seconds of your file and/or make the inactivity_timeout parameter larger to see if the problem still exists. To make things easier, you can test the failing file and other sound files by using the API Explorer in a browser:
Speech to Text API Explorer

In the API documentation, there is this python code, it will avoid to close the server when the default 30s finishes, and works for other errors too.
It's like a "try and except" with the extra step of instanciating the function in a class.
def on_error(self, error):
print('Error received: {}'.format(error))
Here it is the link
https://cloud.ibm.com/apidocs/speech-to-text?code=python

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.