Ive been testing resumable upload of a file (500MB) to google cloud storage using python but it doesn't seem to be working.
As per the official documentation(https://cloud.google.com/storage/docs/resumable-uploads#python): Resumable uploads occur when the object is larger than 8 MiB, and multipart uploads occur when the object is smaller than 8 MiB This threshold cannot be changed. The Python client library uses a buffer size that's equal to the chunk size. 100 MiB is the default buffer size used for a resumable upload, and you can change the buffer size by setting the blob.chunk_size property.
This is the python code Ive written to test resumable upload
def upload_to_bucket(blob_name, path_to_file, bucket_name):
"""Upload a file to the bucket"""
storage_client = storage.Client.from_service_account_json(RAW_DATA_BUCKET_PERMISSIONS_FILEPATH)
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.upload_from_filename(path_to_file)
The time to upload the file using this function took about 84s. I then deleted the file and then re-ran this function, but cut-off my internet connection after about 40s. After establishing internet connection again, i re-ran the upload function expecting the upload time to be much shorter, instead it took the about 84s again.
Is this how resumable upload is suppose to work?
We have field units in remote locations with spotty cellular connection running raspberry pis. We have issues getting data out sometimes. This data is about 0.2-1MB in size. Having a resumable solution that works with small file sizes, and doesn't have to try and upload the whole file each time after an initial failure would be great.
Perhaps there is a better way? Thanks for any help, Rich :)
I believe that the documentation is trying to say that the client will, within that one function call, resume an upload in the event of a transient network failure. It does not mean that if you re-run the program and attempt to upload the same file to the same blob name a second time, that the client library will be able to detect your previous attempt and resume the operation.
In order to resume an operation, you'll need a session ID for an upload session. You can create one by calling blob.create_resumable_upload_session(). That'll get you a URL which you can upload data or query for recorded progress on the server. You'll need to save it somewhere your program will notice it on the next run.
You can either use an HTTP utility to do a PUT directly to the URL, or you could use the ResumableUpload class of the google-resumable-media package to manage the upload to that URL for you.
There is little info out there that demonstrates how this is done. This is how i ended up getting it to work. I'm sure there is a better way so let me know
def upload_to_bucket(blob_name, path_to_file, bucket_name):
"""Upload a file to the bucket"""
upload_url = f"https://www.googleapis.com/upload/storage/v1/b/{bucket_name}/o?uploadType=resumable&name={blob_name}"
file_total_bytes = os.path.getsize(path_to_file)
print('total bytes of file ' + str(file_total_bytes))
# intiate a resumable upload session
upload = ResumableUpload(upload_url, CHUNK_SIZE)
# provide authentication
transport = AuthorizedSession(credentials=CLIENT._credentials)
metadata={'name': blob_name}
with open(path_to_file, "rb") as file_to_transfer:
response = upload.initiate(transport, file_to_transfer, metadata, 'application/octet-stream', total_bytes=file_total_bytes)
print('Resumable Upload URL ' + response.headers['Location'])
# save resumable url to json file in case there is an issue
add_resumable_url_to_file(path_to_file, upload.resumable_url)
while True:
try:
response = upload.transmit_next_chunk(transport)
if response.status_code == 200:
#upload complete
break
if response.status_code != 308:
# save Resumable URL and try next time
raise Exception('Failed to upload chunk')
print(upload.bytes_uploaded)
except Exception as ex:
print(ex)
print('cloud upload complete')
remove_resumable_url_from_file(path_to_file)
def resume_upload_to_bucket(resumable_upload_url, path_to_file,):
# check resumable upload status
response = requests.post(resumable_upload_url, timeout=60)
if response.status_code == 200:
print('Resumable upload completed successfully')
remove_resumable_url_from_file(path_to_file)
return
# get the amount of bytes previously uploaded
previous_amount_bytes_uploaded = int(response.headers.get('Range', '0').split('-')[-1]) + 1
file_total_bytes = os.path.getsize(path_to_file)
with open(path_to_file, "rb") as file_to_transfer:
# Upload the remaining data
for i in range(previous_amount_bytes_uploaded, file_total_bytes, CHUNK_SIZE):
# chunk = file_to_transfer[i:i + CHUNK_SIZE]
file_byte_location = file_to_transfer.seek(i)
print(file_byte_location)
chunk = file_to_transfer.read(CHUNK_SIZE)
headers = {'Content-Range': f'bytes {i}-{i + len(chunk) - 1}/{file_total_bytes}'}
response = requests.put(resumable_upload_url, data=chunk, headers=headers, timeout=60)
if response.status_code == 200:
#upload complete
break
if response.status_code != 308:
# save Resumable URL and try next time
raise Exception('Failed to upload chunk')
print('resumable upload completed')
remove_resumable_url_from_file(path_to_file)
Related
Apologies for any incorrect formatting, long time since I posted anything on stack overflow.
I'm looking to send a json payload of data to Azure IoT Hub which I am then going to process using an Azure Function App to display real-time telemetry data in Azure Digital Twin.
I'm able to post the payload to IoT Hub and view it using the explorer fine, however my function is unable to take this and display this telemetry data in Azure Digital Twin. From Googling I've found that the json file needs to be utf-8 encrypted and set to application/json, which I think might be the problem with my current attempt at fixing this.
I've included a snipped of the log stream from my azure function app below, as shown the "body" part of the message is scrambled which is why I think it may be an issue in how the payload is encoded:
"iothub-message-source":"Telemetry"},"body":"eyJwb3dlciI6ICIxLjciLCAid2luZF9zcGVlZCI6ICIxLjciLCAid2luZF9kaXJlY3Rpb24iOiAiMS43In0="}
2023-01-27T13:39:05Z [Error] Error in ingest function: Cannot access child value on Newtonsoft.Json.Linq.JValue.
My current test code is below for sending payloads to IoT Hub, with the potential issue being that I'm not encoding the payload properly.
import datetime, requests
import json
deviceID = "JanTestDT"
IoTHubName = "IoTJanTest"
iotHubAPIVer = "2018-04-01"
iotHubRestURI = "https://" + IoTHubName + ".azure-devices.net/devices/" + deviceID + "/messages/events?api-version=" + iotHubAPIVer
SASToken = 'SharedAccessSignature'
Headers = {}
Headers['Authorization'] = SASToken
Headers['Content-Type'] = "application/json"
Headers['charset'] = "utf-8"
datetime = datetime.datetime.now()
payload = {
'power': "1.7",
'wind_speed': "1.7",
'wind_direction': "1.7"
}
payload2 = json.dumps(payload, ensure_ascii = False).encode("utf8")
resp = requests.post(iotHubRestURI, data=payload2, headers=Headers)
I've attempted to encode the payload correctly in several different ways including utf-8 within request.post, however this produces an error that a dict cannot be encoded or still has the body encrypted within the Function App log stream unable to decipher it.
Thanks for any help and/or guidance that can be provided on this - happy to elaborate further on anything that is not clear.
is there any particular reason why you want to use Azure IoT Hub Rest API end point instead of using Python SDK? Also, even though you see the values in JSON format when viewed through Azure IoT Explorer, the message format when viewed through a storage end point such as blob reveals a different format as you pointed.
I haven't tested the Python code with REST API, but I have a Python SDK that worked for me. Please refer the code sample below
import os
import random
import time
from datetime import date, datetime
from json import dumps
from azure.iot.device import IoTHubDeviceClient, Message
def json_serial(obj):
"""JSON serializer for objects not serializable by default json code"""
if isinstance(obj, (datetime, date)):
return obj.isoformat()
raise TypeError("Type %s not serializable" % type(obj))
CONNECTION_STRING = "<AzureIoTHubDevicePrimaryConnectionString>"
TEMPERATURE = 45.0
HUMIDITY = 60
MSG_TXT = '{{"temperature": {temperature},"humidity": {humidity}, "timesent": {timesent}}}'
def run_telemetry_sample(client):
print("IoT Hub device sending periodic messages")
client.connect()
while True:
temperature = TEMPERATURE + (random.random() * 15)
humidity = HUMIDITY + (random.random() * 20)
x = datetime.now().isoformat()
timesent = dumps(datetime.now(), default=json_serial)
msg_txt_formatted = MSG_TXT.format(
temperature=temperature, humidity=humidity, timesent=timesent)
message = Message(msg_txt_formatted, content_encoding="utf-8", content_type="application/json")
print("Sending message: {}".format(message))
client.send_message(message)
print("Message successfully sent")
time.sleep(10)
def main():
print("IoT Hub Quickstart #1 - Simulated device")
print("Press Ctrl-C to exit")
client = IoTHubDeviceClient.create_from_connection_string(CONNECTION_STRING)
try:
run_telemetry_sample(client)
except KeyboardInterrupt:
print("IoTHubClient sample stopped by user")
finally:
print("Shutting down IoTHubClient")
client.shutdown()
if __name__ == '__main__':
main()
You can edit the MSG_TXT variable in the code to match the payload format and pass the values. Note that the SDK uses Message class from Azure IoT Device library which has an overload for content type and content encoding. Here is how I have passed the overloads in the code message = Message(msg_txt_formatted, content_encoding="utf-8", content_type="application/json")
I have validated the message by routing to a Blob Storage container and could see the telemetry data in the JSON format. Please refer below image screenshot referring the data captured at end point.
Hope this helps!
Say I have the below function. (Sorry if I included too much.)
#backoff.on_exception(
backoff.expo,
Exception,
max_tries=10,
factor=3600,
)
#sleep_and_retry
#limits(calls=10, period=60)
def download_file(url: str, filename: str, speedlimit: bool) -> requests.Response:
"""Download PDF files."""
response = requests.get(url, headers=pdf_headers, stream=True, timeout=None)
response.raise_for_status()
if response.status_code != 200:
response.close
raise Exception
else:
fin = f"{config.Paths.pdf_path}{filename}.pdf"
with tqdm.wrapattr(
open(fin, "wb"),
"write",
total=int(response.headers.get("content-length", 0)),
) as fout:
try:
for chunk in response.iter_content(chunk_size=8):
fout.write(chunk)
if speedlimit:
time.sleep(0.000285) # Limits bandwith to ~20kB/s
time.sleep(random.uniform(6, 7.2))
except Exception:
fout.flush()
response.close
return response
Does separating the response code check and adding stream, instead of doing it in one single shot, cause two requests instead of a single one?
I would think not, given the connection should already remain active (pdf_headers = {"Connection": "keep-alive"}) from the requests.get(). But I am troubleshooting rate limiting and seeing what I feel is a really low rate limit (less than 10 calls a minute), possibly even less than 500 calls an hour. Unfortunately, it's a government (state) website (IIS/ASP) that doesn't send retry-after headers with a 429 status code. And no rate limit listed in their robots.txt. (TY for making my job harder, government IT people.)
I am not using threading/multi-processing. So I know it's not an issue with multiple calls to a single server.
I'm trying to get the direct download link for a file in Google Drive using the Google Drive API (v3), but I'm also trying to do this without making the file publicly shared.
Here is what I've tried:
https://www.googleapis.com/drive/v3/files/**FILE_ID**?alt=media&supportsAllDrives=True&includeItemsFromAllDrives=True&key=**API_KEY**
Now this works if the file is shared publicly. But when the file isn't shared publicly you get this message:
{'error': {'errors': [{'domain': 'global', 'reason': 'notFound', 'message': 'File not found: 10k0Qogwcz7k0u86m7W2HK-LO7bk8xAF8.', 'locationType': 'parameter', 'location': 'fileId'}], 'code': 404, 'message': 'File not found: 10kfdsfjDHJ38-UHJ34D82.'}}
After doing some googling I found a post on stack overflow saying that I need to add a request header with my access token, but this doesn't work and the application just hangs
Here is the full code:
### SETTING UP GOOGLE API
scopes = 'https://www.googleapis.com/auth/drive'
store = file.Storage('storage.json')
credentials = store.get()
if not credentials or credentials.invalid:
flow = client.flow_from_clientsecrets('client_secret.json', scopes)
credentials = tools.run_flow(flow, store)
accessToken = credentials.access_token
refreshToken = credentials.refresh_token
drive = build('drive', 'v3', credentials=credentials)
### SENDING REQUEST
req_url = "https://www.googleapis.com/drive/v3/files/"+file_id+"?alt=media&supportsAllDrives=True&includeItemsFromAllDrives=True&key="+GOOGLE_API_KEY
headers={'Authorization': 'Bearer %s' % accessToken}
request_content = json.loads((requests.get(req_url)).content)
print(request_content)
------------------ EDIT: ------------------
I've gotten really close to an answer, but I can't seem to figure out why this doesn't work.
So I've figured out previously that alt=media generates a download link for the file, but when the file is private this doesn't work.
I just discovered that you can add &access_token=.... to access private files, so I came up with this API call:
https://www.googleapis.com/drive/v3/files/**FILE_ID**?supportsAllDrives=true&alt=media&access_token=**ACCESS_TOKEN**&key=**API_KEY**
When I go to that url on my browser I get this message:
We're sorry...
... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.
I find this confusing because if I remove alt=media, I am able to call on that request and I get some metadata about the file.
I believe your goal as follows.
From I'm trying to get the direct download link for a file in Google Drive using the Google Drive API (v3),, I understand that you want to retrieve webContentLink.
The file that you want to retrieve the webContentLink is the files except for Google Docs files.
You have already been able to get the file metadata using Drive API. So your access token can be used for this.
Modification points:
When the file is not shared, the API key cannot be used. By this, https://www.googleapis.com/drive/v3/files/**FILE_ID**?alt=media&supportsAllDrives=True&includeItemsFromAllDrives=True&key=**API_KEY** returns File not found. I think that the reason of this issue is due to this.
When I saw your script in your question, it seems that you want to download the file content.
In your script, headers is not used. So in this case, the access token is not used.
In the method of "Files: get", there is no includeItemsFromAllDrives.
In your script, I think that an error occurs at credentials.access_token. How about this? If my understanding is correct, please try to modify to accessToken = credentials.token.
In Drive API v3, the default response values don't include webContentLink. So in this case, the field value is required to be set like fields=webContentLink.
When your script is modified, it becomes as follows.
Modified script:
file_id = '###' # Please set the file ID.
req_url = "https://www.googleapis.com/drive/v3/files/" + file_id + "?supportsAllDrives=true&fields=webContentLink"
headers = {'Authorization': 'Bearer %s' % accessToken}
res = requests.get(req_url, headers=headers)
obj = res.json()
print(obj.get('webContentLink'))
Or, you can use drive = build('drive', 'v3', credentials=credentials) in your script, you can also use the following script.
file_id = '###' # Please set the file ID.
drive = build('drive', 'v3', credentials=credentials)
request = drive.files().get(fileId=file_id, supportsAllDrives=True, fields='webContentLink').execute()
print(request.get('webContentLink'))
Note:
In this modified script,
When the file is in the shared Drive and you don't have the permissions for retrieving the file metadata, an error occurs.
When your access token cannot be used for retrieving the file metadata, an error occurs.
So please be careful above points.
When * is used for fields, all file metadata can be retrieved.
Reference:
Files: get
Added:
You want to download the binary data from the Google Drive by the URL.
The file size is large like "2-10 gigabytes".
In this case, unfortunately, webContentLink cannot be used. Because in the case of the such large file, webContentLink is redirected. So I think that the method that the file is publicly shared and use the API key is suitable for achieving your goal. But, you cannot publicly shared the file.
From this situation, as a workaround, I would like to propose to use this method. This method is "One Time Download for Google Drive". At Google Drive, when the publicly shared file is downloaded, even when the permission of file is deleted under the download, the download can be run. This method uses this.
Flow
In this sample script, the API key is used.
Request to Web Apps with the API key and the file ID you want to download.
At Web Apps, the following functions are run.
Permissions of file of the received file ID are changed. And the file is started to be publicly shared.
Install a time-driven trigger. In this case, the trigger is run after 1 minute.
When the function is run by the time-driven trigger, the permissions of file are changed. And sharing file is stopped. By this, the shared file of only one minute can be achieved.
Web Apps returns the endpoint for downloading the file of the file ID.
After you got the endpoint, please download the file using the endpoint in 1 minute. Because the file is shared for only one minute.
Usage:
1. Create a standalone script
In this workaround, Google Apps Script is used as the server side. Please create a standalone script.
If you want to directly create it, please access to https://script.new/. In this case, if you are not logged in Google, the log in screen is opened. So please log in to Google. By this, the script editor of Google Apps Script is opened.
2. Set sample script of Server side
Please copy and paste the following script to the script editor. At that time, please set your API key to the variable of key in the function doGet(e).
Here, please set your API key in the function of doGet(e). In this Web Apps, when the inputted API key is the same, the script is run.
function deletePermission() {
const forTrigger = "deletePermission";
const id = CacheService.getScriptCache().get("id");
const triggers = ScriptApp.getProjectTriggers();
triggers.forEach(function(e) {
if (e.getHandlerFunction() == forTrigger) ScriptApp.deleteTrigger(e);
});
const file = DriveApp.getFileById(id);
file.setSharing(DriveApp.Access.PRIVATE, DriveApp.Permission.NONE);
}
function checkTrigger(forTrigger) {
const triggers = ScriptApp.getProjectTriggers();
for (var i = 0; i < triggers.length; i++) {
if (triggers[i].getHandlerFunction() == forTrigger) {
return false;
}
}
return true;
}
function doGet(e) {
const key = "###"; // <--- API key. This is also used for checking the user.
const forTrigger = "deletePermission";
var res = "";
if (checkTrigger(forTrigger)) {
if ("id" in e.parameter && e.parameter.key == key) {
const id = e.parameter.id;
CacheService.getScriptCache().put("id", id, 180);
const file = DriveApp.getFileById(id);
file.setSharing(DriveApp.Access.ANYONE_WITH_LINK, DriveApp.Permission.VIEW);
var d = new Date();
d.setMinutes(d.getMinutes() + 1);
ScriptApp.newTrigger(forTrigger).timeBased().at(d).create();
res = "https://www.googleapis.com/drive/v3/files/" + id + "?alt=media&key=" + e.parameter.key;
} else {
res = "unavailable";
}
} else {
res = "unavailable";
}
return ContentService.createTextOutput(res);
}
3. Deploy Web Apps
On the script editor, Open a dialog box by "Publish" -> "Deploy as web app".
Select "Me" for "Execute the app as:".
Select "Anyone, even anonymous" for "Who has access to the app:". This is a test case.
If Only myself is used, only you can access to Web Apps. At that time, please use your access token.
Click "Deploy" button as new "Project version".
Automatically open a dialog box of "Authorization required".
Click "Review Permissions".
Select own account.
Click "Advanced" at "This app isn't verified".
Click "Go to ### project name ###(unsafe)"
Click "Allow" button.
Click "OK"
4. Test run: Client side
This is a sample script of python. Before you test this, please confirm the above script is deployed as Web Apps. And please set the URL of Web Apps, the file ID and your API key.
import requests
url1 = "https://script.google.com/macros/s/###/exec"
url1 += "?id=###fileId###&key=###your API key###"
res1 = requests.get(url1)
url2 = res1.text
res2 = requests.get(url2)
with open("###sampleFilename###", "wb") as f:
f.write(res2.content)
In this sample script, at first, it requests to the Web Apps using the file ID and API key, and the file is shared publicly in 1 minute. And then, the file can be downloaded. After 1 minute, the file is not publicly shared. But the download of the file can be kept.
Note:
When you modified the script of Web Apps, please redeploy the Web Apps as new version. By this, the latest script is reflected to the Web Apps. Please be careful this.
References:
One Time Download for Google Drive
Web Apps
Taking advantage of Web Apps with Google Apps Script
access_token = ''
import json
r = session.request('get', 'https://www.googleapis.com/drive/v3/files?access_token=%s' % access_token)
response_text = str(r.content, encoding='utf-8')
files_list = json.loads(response_text).get('files')
files_id_list = []
for item in files_list:
files_id_list.append(item.get('id'))
for item in files_id_list:
file_r = session.request('get', 'https://www.googleapis.com/drive/v3/files/%s?alt=media&access_token=%s' % (item, access_token))
print(file_r.content)
I use the above code and Google shows:
We're sorry ...
... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.
I do n’t know if this method ca n’t be downloaded originally, or where is the problem?
The reason you are getting this error is you are requesting the data in a Loop.
causes so many requests to Google's server.
And hence the error
We're sorry ... ... but your computer or network may be sending automated queries
access_token should not be placed in the request body,We should put access_token in the header.Can try on this site oauthplayground
I am using AWS boto python library.
I am hitting 10000 requests/sec from Jmeter to my web app which write data to kinesis stream.
I have used 16 Shards for Stream. when I stopped Jmeter I saw some of the records not written to stream. I have also my logs, but did not find any error.
This is my sample code
try:
p_key = '{0}{1}'.format('partition_key',(1, 10000))
# Connect to AWS Kinesis region
kinesis_obj = kinesis.connect_to_region(region_name)
# Put data on AWS Kinesis stream
app.logger.debug(count++)
response = kinesis_obj.put_record(stream_name,record,p_key)
app.logger.debug(count++)
app.logger.debug(response)
except kinesis.exceptions.ResourceNotFoundException, re_ex:
write_log(record, re_ex)
except kinesis.exceptions.ResourceInUseException, inuse_ex:
write_log(record, inuse_ex)
except Exception, ex:
write_log(record, ex)
When i print count here got 25000 requests.
response = kinesis_obj.put_record(stream_name,record,p_key)
and,
I got 24900 requests here.
100 missing records.
There is no response in response with no exception.