Related
Using my answer to my question on how to download files from a public Google drive I managed in the past to download images using their IDs from a python script and Google API v3 from a public drive using the following bock of code:
from google_auth_oauthlib.flow import Flow, InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload, MediaIoBaseDownload
from google.auth.transport.requests import Request
import io
import re
SCOPES = ['https://www.googleapis.com/auth/drive']
CLIENT_SECRET_FILE = "myjson.json"
authorized_port = 6006 # authorize URI redirect on the console
flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRET_FILE, SCOPES)
cred = flow.run_local_server(port=authorized_port)
drive_service = build("drive", "v3", credentials=cred)
regex = "(?<=https://drive.google.com/file/d/)[a-zA-Z0-9]+"
for i, l in enumerate(links_to_download):
url = l
file_id = re.search(regex, url)[0]
request = drive_service.files().get_media(fileId=file_id)
fh = io.FileIO(f"file_{i}", mode='wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print("Download %d%%." % int(status.progress() * 100))
In the mean time I discovered pydrive and pydrive2, two wrappers around Google API v2 that allows to do very useful things such as listing files from folders and basically allows to do the same thing with a lighter syntax:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
import io
import re
CLIENT_SECRET_FILE = "client_secrets.json"
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
regex = "(?<=https://drive.google.com/file/d/)[a-zA-Z0-9]+"
for i, l in enumerate(links_to_download):
url = l
file_id = re.search(regex, url)[0]
file_handle = drive.CreateFile({'id': file_id})
file_handle.GetContentFile(f"file_{i}")
However now whether I use pydrive or the raw API I cannot seem to be able to download the same files and instead I am met with:
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://www.googleapis.com/drive/v3/files/fileID?alt=media returned "File not found: fileID.". Details: "[{'domain': 'global', 'reason': 'notFound', 'message': 'File not found: fileID.', 'locationType': 'parameter', 'location': 'fileId'}]">
I tried everything and registered 3 different apps using Google console it seems it might be (or not) a question of scoping (see for instance this answer, with apps having access to only files in my Google drive or created by this app). However I did not have this issue before (last year).
When going to the Google console explicitly giving https://www.googleapis.com/auth/drive as a scope to the API mandates filling a ton of fields with application's website/conditions of use/confidentiality rules/authorized domains and youtube videos explaining the app. However I will be the sole user of this script.
So I could only give explicitly the following scopes:
/auth/drive.appdata
/auth/drive.file
/auth/drive.install
Is it because of scoping ? Is there a solution that doesn't require creating a homepage and a youtube video ?
EDIT 1:
Here is an example of links_to_download:
links_to_download = ["https://drive.google.com/file/d/fileID/view?usp=drivesdk&resourcekey=0-resourceKeyValue"]
EDIT 2:
It is super instable sometimes it works without a sweat sometimes it doesn't. When I relaunch the script multiple times I get different results. Retry policies are working to a certain extent but sometimes it fails multiple times for hours.
Well thanks to the security update released by Google few months before. This makes the link sharing stricter and you need resource key as well to access the file in-addition to the fileId.
As per the documentation , You need to provide the resource key as well for newer links, if you want to access it in the header X-Goog-Drive-Resource-Keys as fileId1/resourceKey1.
If you apply this change in your code, it will work as normal. Example edit below:
regex = "(?<=https://drive.google.com/file/d/)[a-zA-Z0-9]+"
regex_rkey = "(?<=resourcekey=)[a-zA-Z0-9-]+"
for i, l in enumerate(links_to_download):
url = l
file_id = re.search(regex, url)[0]
resource_key = re.search(regex_rkey, url)[0]
request = drive_service.files().get_media(fileId=file_id)
request.headers["X-Goog-Drive-Resource-Keys"] = f"{file_id}/{resource_key}"
fh = io.FileIO(f"file_{i}", mode='wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print("Download %d%%." % int(status.progress() * 100))
Well, the regex for resource key was something I quickly made, so cannot be sure on if it supports every case. But this provides you the solution.
Now, you may have to listen to old and new links based on this and set the changes.
Hello I have a python cloud function :
import requests
import pandas as pd
import datetime
from google.cloud import storage
import os
api_key = os.environ['API_KEY']
url = f'https://api.sportsdata.io/v3/nba/scores/json/TeamSeasonStats/2022?key=${api_key}'
def export_data(url):
response = requests.get(url) # Make a GET request to the URL
print(response)
payload = response.json() # Parse `response.text` into JSON
# Use the flatsplode package to quickly turn the JSON response to a DF
new_list = pd.DataFrame(list(payload))
# Convert your df to str: it is straightforward, just do not provide
# any value for the first param path_or_buf
csv_str = new_list.to_csv()
# Then, upload it to cloud storage
def upload_blob(bucket_name, data, destination_blob_name):
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
# Note the use of upload_from_string here. Please, provide
# the appropriate content type if you wish
blob.upload_from_string(data, content_type='text/csv')
upload_blob('basketball_api_data', csv_str, 'data-' + str(datetime.date.today()) + '.csv')
export_data(url)
Its basically getting Basketball API Data: however I am getting errors deploying this
my requirements.txt looks like this :
# Function dependencies, for example:
# package>=version
google-cloud-storage
pandas
requests
datetime
os
google.cloud
Am I writing my cloud function in a wrong way? What am I doing wrong to cause error deploying this to GCF?
This is my error:
I am trying to download an Azure Blob Storage file from my storage account, to do so, I have checked what the URL is and I am doing the following:
with urllib.request.urlopen("<url_file>") as resp:
img = np.asarray(bytearray(resp.read()), dtype="uint8")
But I am getting the following error:
urllib.error.HTTPError: HTTP Error 404: The specified resource does not exist.
I have doubled checked that the url is correct. Could this have something to do with not having passed the keys of my subscription or any other info about the Storage Account?
Any idea?
As on Dec 26, 2019 I am unable to import BaseBlobService from azure cloud storage. Neither of BlobPermissions, generate_blob_shared_access_signature worked for me. Below is something I used and it worked in my case and hope it helps
from azure.storage.blob import generate_blob_sas, AccountSasPermissions
def scan_product():
account_name=<account_name>
container_name=<container_name>
blob_name=<blob_name>
account_key=<account_key>
url = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}"
sas_token = generate_blob_sas(
account_name=account_name,
account_key=account_key,
container_name=container_name,
blob_name=blob_name,
permission=AccountSasPermissions(read=True),
expiry=datetime.utcnow() + timedelta(hours=1)
)
url_with_sas = f"{url}?{sas_token}"
Actually, you can generate a blob url with sas token in Azure Storage SDK for Python for accessing directly, as my sample code below.
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
account_name = '<account name>'
account_key = '<account key>'
container_name = '<container name>'
blob_name = '<blob name>'
url = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}"
service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_blob_shared_access_signature(container_name, blob_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
url_with_sas = f"{url}?{token}"
Then,
import urllib
import numpy as np
req = urllib.urlopen(url_with_sas)
img = np.asarray(bytearray(req.read()), dtype=np.uint8)
For downloading using url directly, you should put the blob in a public container, or in the private container then you should generate a sas token for the blob(the url looks like : https://xxx.blob.core.windows.net/aa1/0116.txt?sp=r&st=2019-06-26T09:47:04Z&se=2019-06-26xxxxx).
I test your code with the url which contains a sas token, it can be downloaded.
Test result:
How to generate sas token for a blob:
To solve the issue all I needed to do was to change the Blob Storage access level to Blob (anonymous read access for blob only). Once this is done, it will work.
I'm running online predictions on google cloud machine learning API using the google api python client and a model hosted for me at google cloud.
When I predict sending one image, the server, including all traffic, is taking about 40 seconds. When I send two images, after some time, I receive the message:
timeout: The read operation timed out
I would like to set the timeout to other value, but I didn't find how.
This is my code:
import base64
import io
import time
from PIL import Image
from oauth2client.service_account import ServiceAccountCredentials
from googleapiclient import discovery
SCOPES = ['https://www.googleapis.com/auth/cloud-platform']
SERVICE_ACCOUNT_FILE = 'mycredentialsfile.json'
credentials = ServiceAccountCredentials.from_json_keyfile_name(
SERVICE_ACCOUNT_FILE, scopes=SCOPES)
ml = discovery.build('ml', 'v1', credentials=credentials)
projectID = 'projects/{}'.format('projectID') + '/models/{}'.format('modelID')
width = 640
height = 480
instances = []
for image in ["image5.jpg", "image6.jpg"]:
img = Image.open(image)
img = img.resize((width, height), Image.ANTIALIAS)
output_str = io.BytesIO()
img.save(output_str, "JPEG")
instance = {"b64": base64.b64encode(output_str.getvalue()).decode("utf-8") }
output_str.close()
instances.append(instance)
input_json = {"instances": instances }
request = ml.projects().predict(body=input_json, name=projectID)
print("Starting prediction")
start_time = time.time()
response = request.execute()
print("%s seconds" % (time.time() - start_time))
I found a way researching samples from google api python client on github and trying same changes.
Using the httplib2 to authenticate you can set the timeout.
Following the modified code:
import base64
import io
import time
from PIL import Image
# Need: pip install google-api-python-client
import httplib2
from oauth2client.service_account import ServiceAccountCredentials
from googleapiclient import discovery
SCOPES = ['https://www.googleapis.com/auth/cloud-platform']
# API & Services -> Credentials -> Create Credential -> service account key
SERVICE_ACCOUNT_FILE = 'mycredentialsfile.json'
credentials = ServiceAccountCredentials.from_json_keyfile_name(
SERVICE_ACCOUNT_FILE, scopes=SCOPES)
http = httplib2.Http(timeout=200)
http = credentials.authorize(http)
ml = discovery.build('ml', 'v1', http=http)
projectID = 'projects/{}'.format('projectID ') + '/models/{}'.format('modelID')
width = 640
height = 480
instances = []
for image in ["image5.jpg", "image6.jpg"]:
img = Image.open(image)
img = img.resize((width, height), Image.ANTIALIAS)
output_str = io.BytesIO()
img.save(output_str, "JPEG")
instance = {"b64": base64.b64encode(output_str.getvalue()).decode("utf-8") }
output_str.close()
instances.append(instance)
input_json = {"instances": instances }
request = ml.projects().predict(body=input_json, name=projectID)
print("Starting prediction")
start_time = time.time()
response = request.execute()
print("%s seconds" % (time.time() - start_time))
I think with a few modifications you can use this to set timeout to almost any google cloud API in python client.
I hope this helps.
You have already solved the problem, but I found the other way to do this.
import socket
socket.setdefaulttimeout(150)
If call discovery.build without http, http client is instantiated by build_http in build method.
https://googleapis.github.io/google-api-python-client/docs/epy/googleapiclient.http-pysrc.html#build_http
As you can see here, build_http create a http client instance with timeout if it is set before creating http client.
So all you have to do is setting this value by socket.setdefaulttimeout :)
yes. I agree with Shohei's answer above. It took me a while to find this simple and elegant resolution. You only need to add the following to the code
import socket
timeout_in_sec = 60*3 # 3 minutes timeout limit
socket.setdefaulttimeout(timeout_in_sec)
# then you could create your ML service object as usually, and it will have the extended timeout limit.
ml_service = discovery.build('ml', 'v1')
I am trying to download files from google drive and all I have is the drive's URL.
I have read about google API that talks about some drive_service and MedioIO, which also requires some credentials( mainly JSON file/OAuth). But I am unable to get any idea about how it is working.
Also, tried urllib2.urlretrieve, but my case is to get files from the drive. Tried wget too but no use.
Tried PyDrive library. It has good upload functions to drive but no download options.
Any help will be appreciated.
Thanks.
If by "drive's url" you mean the shareable link of a file on Google Drive, then the following might help:
import requests
def download_file_from_google_drive(id, destination):
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
if __name__ == "__main__":
file_id = 'TAKE ID FROM SHAREABLE LINK'
destination = 'DESTINATION FILE ON YOUR DISK'
download_file_from_google_drive(file_id, destination)
The snipped does not use pydrive, nor the Google Drive SDK, though. It uses the requests module (which is, somehow, an alternative to urllib2).
When downloading large files from Google Drive, a single GET request is not sufficient. A second one is needed - see wget/curl large file from google drive.
I recommend gdown package.
pip install gdown
Take your share link
https://drive.google.com/file/d/0B9P1L--7Wd2vNm9zMTJWOGxobkU/view?usp=sharing
and grab the id - eg. 1TLNdIufzwesDbyr_nVTR7Zrx9oRHLM_N by pressing the download button (look for at the link), and swap it in after the id below.
import gdown
url = 'https://drive.google.com/uc?id=0B9P1L--7Wd2vNm9zMTJWOGxobkU'
output = '20150428_collected_images.tgz'
gdown.download(url, output, quiet=False)
Having had similar needs many times, I made an extra simple class GoogleDriveDownloader starting on the snippet from #user115202 above. You can find the source code here.
You can also install it through pip:
pip install googledrivedownloader
Then usage is as simple as:
from google_drive_downloader import GoogleDriveDownloader as gdd
gdd.download_file_from_google_drive(file_id='1iytA1n2z4go3uVCwE__vIKouTKyIDjEq',
dest_path='./data/mnist.zip',
unzip=True)
This snippet will download an archive shared in Google Drive. In this case 1iytA1n2z4go3uVCwE__vIKouTKyIDjEq is the id of the sharable link got from Google Drive.
Here's an easy way to do it with no third-party libraries and a service account.
pip install google-api-core and google-api-python-client
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
from google.oauth2 import service_account
import io
credz = {} #put json credentials her from service account or the like
# More info: https://cloud.google.com/docs/authentication
credentials = service_account.Credentials.from_service_account_info(credz)
drive_service = build('drive', 'v3', credentials=credentials)
file_id = '0BwwA4oUTeiV1UVNwOHItT0xfa2M'
request = drive_service.files().get_media(fileId=file_id)
#fh = io.BytesIO() # this can be used to keep in memory
fh = io.FileIO('file.tar.gz', 'wb') # this can be used to write to disk
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print("Download %d%%." % int(status.progress() * 100))
PyDrive allows you to download a file with the function GetContentFile(). You can find the function's documentation here.
See example below:
# Initialize GoogleDriveFile instance with file id.
file_obj = drive.CreateFile({'id': '<your file ID here>'})
file_obj.GetContentFile('cats.png') # Download file as 'cats.png'.
This code assumes that you have an authenticated drive object, the docs on this can be found here and here.
In the general case this is done like so:
from pydrive.auth import GoogleAuth
gauth = GoogleAuth()
# Create local webserver which automatically handles authentication.
gauth.LocalWebserverAuth()
# Create GoogleDrive instance with authenticated GoogleAuth instance.
drive = GoogleDrive(gauth)
Info on silent authentication on a server can be found here and involves writing a settings.yaml (example: here) in which you save the authentication details.
There's in the docs a function that downloads a file when we provide an ID of the file to download,
from __future__ import print_function
import io
import google.auth
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from googleapiclient.http import MediaIoBaseDownload
def download_file(real_file_id):
"""Downloads a file
Args:
real_file_id: ID of the file to download
Returns : IO object with location.
Load pre-authorized user credentials from the environment.
TODO(developer) - See https://developers.google.com/identity
for guides on implementing OAuth2 for the application.
"""
creds, _ = google.auth.default()
try:
# create drive api client
service = build('drive', 'v3', credentials=creds)
file_id = real_file_id
# pylint: disable=maybe-no-member
request = service.files().get_media(fileId=file_id)
file = io.BytesIO()
downloader = MediaIoBaseDownload(file, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print(F'Download {int(status.progress() * 100)}.')
except HttpError as error:
print(F'An error occurred: {error}')
file = None
return file.getvalue()
if __name__ == '__main__':
download_file(real_file_id='1KuPmvGq8yoYgbfW74OENMCB5H0n_2Jm9')
This bears the question:
How do we get the file ID to download the file?
Generally speaking, a URL from a shared file from Google Drive looks like this
https://drive.google.com/file/d/1HV6vf8pB-EYnjcJcH65eGZVMa2v2tcMh/view?usp=sharing
where 1HV6vf8pB-EYnjcJcH65eGZVMa2v2tcMh corresponds to fileID.
You can simply copy it from the URL or, if you prefer, it's also possible to create a function to get the fileID from the URL.
For instance, given the following url = https://drive.google.com/file/d/1HV6vf8pB-EYnjcJcH65eGZVMa2v2tcMh/view?usp=sharing,
def url_to_id(url):
x = url.split("/")
return x[5]
Printing x will give
['https:', '', 'drive.google.com', 'file', 'd', '1HV6vf8pB-EYnjcJcH65eGZVMa2v2tcMh', 'view?usp=sharing']
And so, as we want to return the 6th array value, we use x[5].
This has also been described above,
from pydrive.auth import GoogleAuth
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
This creates its own server too do the dirty work of authenticating
file_obj = drive.CreateFile({'id': '<Put the file ID here>'})
file_obj.GetContentFile('Demo.txt')
This downloads the file
import requests
def download_file_from_google_drive(id, destination):
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id , 'confirm': 1 }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
if __name__ == "__main__":
file_id = 'TAKE ID FROM SHAREABLE LINK'
destination = 'DESTINATION FILE ON YOUR DISK'
download_file_from_google_drive(file_id, destination)
Just repeating the accepted answer but adding confirm=1 parameter so it always downloads even if the file is too big
# Importing [PyDrive][1] OAuth
from pydrive.auth import GoogleAuth
def download_tracking_file_by_id(file_id, download_dir):
gauth = GoogleAuth(settings_file='../settings.yaml')
# Try to load saved client credentials
gauth.LoadCredentialsFile("../credentials.json")
if gauth.credentials is None:
# Authenticate if they're not there
gauth.LocalWebserverAuth()
elif gauth.access_token_expired:
# Refresh them if expired
gauth.Refresh()
else:
# Initialize the saved creds
gauth.Authorize()
# Save the current credentials to a file
gauth.SaveCredentialsFile("../credentials.json")
drive = GoogleDrive(gauth)
logger.debug("Trying to download file_id " + str(file_id))
file6 = drive.CreateFile({'id': file_id})
file6.GetContentFile(download_dir+'mapmob.zip')
zipfile.ZipFile(download_dir + 'test.zip').extractall(UNZIP_DIR)
tracking_data_location = download_dir + 'test.json'
return tracking_data_location
The above function downloads the file given the file_id to a specified downloads folder. Now the question remains, how to get the file_id? Simply split the url by id= to get the file_id.
file_id = url.split("id=")[1]
I tried using google Colaboratory: https://colab.research.google.com/
Suppose your sharable link is https://docs.google.com/spreadsheets/d/12hiI0NK7M0KEfscMfyBaLT9gxcZMleeu/edit?usp=sharing&ouid=102608702203033509854&rtpof=true&sd=true
all you need is id that is 12hiI0NK7M0KEfscMfyBaLT9gxcZMleeu
command in cell
!gdown 12hiI0NK7M0KEfscMfyBaLT9gxcZMleeu
run the cell and you will see that file is downloaded in /content/Amazon_Reviews.xlsx
Note: one should know how to use Google colab
This example is based on an similar to RayB, but keeps the file in memory
and is a little simpler, and you can paste it into colab and it works.
import googleapiclient.discovery
import oauth2client.client
from google.colab import auth
auth.authenticate_user()
def download_gdrive(id):
creds = oauth2client.client.GoogleCredentials.get_application_default()
service = googleapiclient.discovery.build('drive', 'v3', credentials=creds)
return service.files().get_media(fileId=id).execute()
a = download_gdrive("1F-yaQB8fdsfsdafm2l8WFjhEiYSHZrCcr")
You can install https://pypi.org/project/googleDriveFileDownloader/
pip install googleDriveFileDownloader
And download the file, here is the sample code to download
from googleDriveFileDownloader import googleDriveFileDownloader
a = googleDriveFileDownloader()
a.downloadFile("https://drive.google.com/uc?id=1O4x8rwGJAh8gRo8sjm0kuKFf6vCEm93G&export=download")