FileUploadMiscError while persisting output file from Azure Batch - python

I'm facing the following error while trying to persist log files to Azure Blob storage from Azure Batch execution - "FileUploadMiscError - A miscellaneous error was encountered while uploading one of the output files". This error doesn't give a lot of information as to what might be going wrong. I tried checking the Microsoft Documentation for this error code, but it doesn't mention this particular error code.
Below is the relevant code for adding the task to Azure Batch that I have ported from C# to Python for persisting the log files.
Note: The container that I have configured gets created when the task is added, but there's no blob inside.
import datetime
import logging
import os
import azure.storage.blob.models as blob_model
import yaml
from azure.batch import models
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.common.cloudstorageaccount import CloudStorageAccount
from dotenv import load_dotenv
LOG = logging.getLogger(__name__)
def add_tasks(batch_client, job_id, task_id, io_details, blob_details):
task_commands = "This is a placeholder. Actual code has an actual task. This gets completed successfully."
LOG.info("Configuring the blob storage details")
base_blob_service = BaseBlobService(
account_name=blob_details['account_name'],
account_key=blob_details['account_key'])
LOG.info("Base blob service created")
base_blob_service.create_container(
container_name=blob_details['container_name'], fail_on_exist=False)
LOG.info("Container present")
container_sas = base_blob_service.generate_container_shared_access_signature(
container_name=blob_details['container_name'],
permission=blob_model.ContainerPermissions(write=True),
expiry=datetime.datetime.now() + datetime.timedelta(days=1))
LOG.info(f"Container SAS created: {container_sas}")
container_url = base_blob_service.make_container_url(
container_name=blob_details['container_name'], sas_token=container_sas)
LOG.info(f"Container URL created: {container_url}")
# fpath = task_id + '/output.txt'
fpath = task_id
LOG.info(f"Creating output file object:")
out_files_list = list()
out_files = models.OutputFile(
file_pattern=r"../stderr.txt",
destination=models.OutputFileDestination(
container=models.OutputFileBlobContainerDestination(
container_url=container_url, path=fpath)),
upload_options=models.OutputFileUploadOptions(
upload_condition=models.OutputFileUploadCondition.task_completion))
out_files_list.append(out_files)
LOG.info(f"Output files: {out_files_list}")
LOG.info(f"Creating the task now: {task_id}")
task = models.TaskAddParameter(
id=task_id, command_line=task_commands, output_files=out_files_list)
batch_client.task.add(job_id=job_id, task=task)
LOG.info(f"Added task: {task_id}")

There is a bug in Batch's OutputFile handling which causes it to fail to upload to containers if the full container URL includes any query-string parameters other than the ones included in the SAS token. Unfortunately, the azure-storage-blob Python module includes an extra query string parameter when generating the URL via make_container_url.
This issue was just raised to us, and a fix will be released in the coming weeks, but an easy workaround is instead of using make_container_url to craft the URL, craft it yourself like so: container_url = 'https://{}/{}?{}'.format(blob_service.primary_endpoint, blob_details['container_name'], container_sas).
The resulting URL should look something like this: https://<account>.blob.core.windows.net/<container>?se=2019-01-12T01%3A34%3A05Z&sp=w&sv=2018-03-28&sr=c&sig=<sig> - specifically it shouldn't have restype=container in it (which is what the azure-storage-blob package is including)

Related

Problem triggering nested dependencies in Azure Function

I have a problem using the videohash package for python when deployed to Azure function.
My deployed azure function does not seem to be able to use a nested dependency properly. Specifically, I am trying to use the package “videohash” and the function VideoHash from it. The
input to VideoHash is a SAS url token for a video placed on an Azure blob storage.  
In the monitor of my output it prints: 
Accessing the sas url token directly takes me to the video, so that part seems to be working.  
Looking at the source code for videohash this error seems to occur in the process of downloading the video from a given url (link:
https://github.com/akamhy/videohash/blob/main/videohash/downloader.py). 
.. where self.yt_dlp_path = str(which("yt-dlp")). This to me indicates, that after deploying the function, the package yt-dlp isn’t properly activated. This is a dependency from the videohash
module, but adding yt-dlp directly to the requirements file of the azure function also does not solve the issue. 
Any ideas on what is happening? 
 
Deploying code to Azure function, which resulted in the details highlighted in the issue description.
I have a work around where you download the video file on you own instead of the videohash using azure.storage.blob
To download you will need a BlobServiceClient , ContainerClient and connection string of azure storage account.
Please create two files called v1.mp3 and v2.mp3 before downloading the video.
file structure:
Complete Code:
import logging
from videohash import VideoHash
import azure.functions as func
import subprocess
import tempfile
import os
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
def main(req: func.HttpRequest) -> func.HttpResponse:
# local file path on the server
local_path = tempfile.gettempdir()
filepath1 = os.path.join(local_path, "v1.mp3")
filepath2 = os.path.join(local_path,"v2.mp3")
# Reference to Blob Storage
client = BlobServiceClient.from_connection_string("<Connection String >")
# Reference to Container
container = client.get_container_client(container= "test")
# Downloading the file
with open(file=filepath1, mode="wb") as download_file:
download_file.write(container.download_blob("v1.mp3").readall())
with open(file=filepath2, mode="wb") as download_file:
download_file.write(container.download_blob("v2.mp3").readall())
// video hash code .
videohash1 = VideoHash(path=filepath1)
videohash2 = VideoHash(path=filepath2)
t = videohash2.is_similar(videohash1)
return func.HttpResponse(f"Hello, {t}. This HTTP triggered function executed successfully.")
Output :
Now here I am getting the ffmpeg error which related to my test file and not related to error you are facing.
This work around as far as I know will not affect performance as in both scenario you are downloading blobs anyway

Cannot ingest data using snowpipe more than once

I am using the sample program from the Snowflake document on using Python to ingest the data to the destination table.
So basically, I have to execute put command to load data to the internal stage and then run the Python program to notify the snowpipe to ingest the data to the table.
This is how I create the internal stage and pipe:
create or replace stage exampledb.dbschema.example_stage;
create or replace pipe exampledb.dbschema.example_pipe
as copy into exampledb.dbschema.example_table
from
(
select
t.*
from
#exampledb.dbschema.example_stage t
)
file_format = (TYPE = CSV) ON_ERROR = SKIP_FILE;
put command:
put file://E:\\example\\data\\a.csv #exampledb.dbschema.example_stage OVERWRITE = TRUE;
This is the sample program I use:
from logging import getLogger
from snowflake.ingest import SimpleIngestManager
from snowflake.ingest import StagedFile
from snowflake.ingest.utils.uris import DEFAULT_SCHEME
from datetime import timedelta
from requests import HTTPError
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.serialization import load_pem_private_key
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.serialization import Encoding
from cryptography.hazmat.primitives.serialization import PrivateFormat
from cryptography.hazmat.primitives.serialization import NoEncryption
import time
import datetime
import os
import logging
logging.basicConfig(
filename='/tmp/ingest.log',
level=logging.DEBUG)
logger = getLogger(__name__)
# If you generated an encrypted private key, implement this method to return
# the passphrase for decrypting your private key.
def get_private_key_passphrase():
return '<private_key_passphrase>'
with open("E:\\ssh\\rsa_key.p8", 'rb') as pem_in:
pemlines = pem_in.read()
private_key_obj = load_pem_private_key(pemlines,
get_private_key_passphrase().encode(),
default_backend())
private_key_text = private_key_obj.private_bytes(
Encoding.PEM, PrivateFormat.PKCS8, NoEncryption()).decode('utf-8')
# Assume the public key has been registered in Snowflake:
# private key in PEM format
# List of files in the stage specified in the pipe definition
file_list=['a.csv.gz']
ingest_manager = SimpleIngestManager(account='<account_identifier>',
host='<account_identifier>.snowflakecomputing.com',
user='<user_login_name>',
pipe='exampledb.dbschema.example_pipe',
private_key=private_key_text)
# List of files, but wrapped into a class
staged_file_list = []
for file_name in file_list:
staged_file_list.append(StagedFile(file_name, None))
try:
resp = ingest_manager.ingest_files(staged_file_list)
except HTTPError as e:
# HTTP error, may need to retry
logger.error(e)
exit(1)
# This means Snowflake has received file and will start loading
assert(resp['responseCode'] == 'SUCCESS')
# Needs to wait for a while to get result in history
while True:
history_resp = ingest_manager.get_history()
if len(history_resp['files']) > 0:
print('Ingest Report:\n')
print(history_resp)
break
else:
# wait for 20 seconds
time.sleep(20)
hour = timedelta(hours=1)
date = datetime.datetime.utcnow() - hour
history_range_resp = ingest_manager.get_history_range(date.isoformat() + 'Z')
print('\nHistory scan report: \n')
print(history_range_resp)
After running the program, I just need to remove the file in the internal stage:
REMOVE #exampledb.dbschema.example_stage;
The code works as expected for the first time but when I truncate the data on that table and run the code again, the table on snowflake doesn't have any data in it.
Do I miss something here? How can I make this code can run multiple times?
Update:
I found that if I use a file with a different name each time I run, the data can load to the snowflake table.
So how can I run this code without changing the data filename?
Snowflake uses file loading metadata to prevent reloading the same files (and duplicating data) in a table. Snowpipe prevents loading files with the same name even if they were later modified (i.e. have a different eTag).
The file loading metadata is associated with the pipe object rather than the table. As a result:
Staged files with the same name as files that were already loaded are ignored, even if they have been modified, e.g. if new rows were added or errors in the file were corrected.
Truncating the table using the TRUNCATE TABLE command does not delete the Snowpipe file loading metadata.
However, note that pipes only maintain the load history metadata for 14 days. Therefore:
Files modified and staged again within 14 days:
Snowpipe ignores modified files that are staged again. To reload modified data files, it is currently necessary to recreate the pipe object using the CREATE OR REPLACE PIPE syntax.
Files modified and staged again after 14 days:
Snowpipe loads the data again, potentially resulting in duplicate records in the target table.
For more information have a look here

add delay for a task until particular file is moved from bucket

I am new to Airflow. I have to check for a file which is generated from DAG (eg: sample.txt) is moved from bucket(in my case the file I have generated will be moved away from the bucket when picked up by other system, and then there won't be this output file in the bucket. It might take few minutes for the file to be removed from bucket)
How to add a task in the same DAG where it waits/retires till the file is moved away from the bucket and when the sample.txt file is moved away then proceed with the next task.
Is there any operator which satisfies the above criteria? please throw some light on how to proceed
You can create a custom sensor based on the current GCSObjectExistenceSensor
The modification is simple:
from airflow.providers.google.cloud.sensors.gcs import GCSObjectExistenceSensor
class GCSObjectNotExistenceSensor(GCSObjectExistenceSensor):
def poke(self, context: dict) -> bool:
self.log.info('Sensor checks if : %s, %s does not exist', self.bucket, self.object)
hook = GCSHook(
gcp_conn_id=self.google_cloud_conn_id,
delegate_to=self.delegate_to,
impersonation_chain=self.impersonation_chain,
)
return not hook.exists(self.bucket, self.object)
Then use the sensor GCSObjectNotExistenceSensor in your code like:
gcs_object_does_not_exists = GCSObjectNotExistenceSensor(
bucket=BUCKET_1,
object=PATH_TO__FILE,
mode='poke',
task_id="gcs_object_does_not_exists_task",
)
The sensor will not let the pipeline to continue until the object PATH_TO__FILE is removed.
You can use airflow PythonOperator to achieve the task. Make the Python callable continuously poke GCS and check if the file is removed. Return from Python function when the file from GCS is removed.
from airflow.operators.python_operator import PythonOperator
from google.cloud import storage
import google.auth
def check_file_in_gcs():
credentials, project = google.auth.default()
storage_client = storage.Client('your_Project_id', credentials=credentials)
name = 'sample.txt'
bucket_name = 'Your_Bucket_name'
bucket = storage_client.bucket(bucket_name)
while True:
stats = storage.Blob(bucket=bucket, name=name).exists(storage_client)
if not stats:
print("Returning as file is removed!!!!")
return
check_gcs_file_removal = PythonOperator(
task_id='check_gcs_file_removal',
python_callable= check_file_in_gcs,
#op_kwargs={'params': xyz},
#Pass bucket name and other details if needed by commentating above
dag=dag
)
you might need to install Python packages for the google cloud libraries to work. Please install one from below. (Not sure which one to install exactly.Taken from my virtualenv)
google-api-core==1.16.0
google-api-python-client==1.8.0
google-auth==1.12.0
google-auth-httplib2==0.0.3
google-auth-oauthlib==0.4.1
google-cloud-core==1.3.0
google-cloud-storage==1.27.0

Using Python Logging module to save logs to S3, how to capture all levels of logs for all modules?

It's my first time using the logging module in Python(3.7). My code uses imported modules that also have their own log statements. When I first added log statements to my code, I didn't use getLogger(). I just used logging.basicConfig(filename) and called logger.debug() directly to log statements. When I did this, all the logs from both my script and also all the imported modules was output to the same file together.
Now I need to convert my code to save logs to s3 instead of a file. I tried the solution mentioned in How Can I Write Logs Directly to AWS S3 from Memory Without First Writing to stdout? (Python, boto3) - Stack Overflow but I have two issues with it:
None of the 'prefixes' are present in the output when I check on s3.
Only INFO statements are showing up. I was under the impression that logging.basicConfig(level=logging.INFO) would make it would output all logs at or above level INFO, but I'm only seeing INFO. Also, only INFO logs get printed to stdout, when before all levels were. I don't know why the 'prefixes' are missing.
from psaw import PushshiftAPI
api = PushshiftAPI()
import time
import logging
import boto3
import io
import atexit
def write_logs(body, bucket, key):
s3 = boto3.client("s3")
s3.put_object(Body=body.getvalue(), Bucket=bucket, Key=key)
logging.basicConfig(level=logging.INFO)
log = logging.getLogger()
log_stringio = io.StringIO()
handler = logging.StreamHandler(log_stringio)
log.addHandler(handler)
def collectRange(sub,start,end):
atexit.register(write_logs, body=log_stringio, bucket="<...>", key=f'{sub}/log.txt')
s3 = boto3.resource('s3')
object = s3.Object('<...>', f'{sub}/{sub}#{start}-{end}.csv')
now = time.time()
logging.info(f'Start Time:{now}')
logging.debug('First request')
gen = api.search_comments(after=start, before=end,<...>, subreddit=sub)
r=next(gen)
<...>
quit()
Output:
Found credentials in shared credentials file: ~/.aws/credentials
Start Time:1591310443.7060978
https://api.pushshift.io/reddit/comment/search?<...>
https://api.pushshift.io/reddit/comment/search?<...>
Desired output:
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:root:Start Time:1591310443.7060978
DEBUG:root:First request
INFO:psaw.PushshiftAPI:https://api.pushshift.io/reddit/comment/search?<...>
DEBUG:psaw.PushshiftAPI:<whatever is usually here>
DEBUG:psaw.PushshiftAPI:<whatever is usually here>
INFO:psaw.PushshiftAPI:https://api.pushshift.io/reddit/comment/search?<...>
DEBUG:psaw.PushshiftAPI:<whatever is usually here>
DEBUG:psaw.PushshiftAPI:<whatever is usually here>
Any help is appreciated. Thanks.
You can at least add the level name (or time) by following this documentation:
Changing the format of displayed messages.
And to get DEBUG as well you need to use the following instead of INFO
logging.basicConfig(..., level=logging.DEBUG)

TensorFlow Serving: Update model_config (add additional models) at runtime

I'm busy configuring a TensorFlow Serving client that asks a TensorFlow Serving server to produce predictions on a given input image, for a given model.
If the model being requested has not yet been served, it is downloaded from a remote URL to a folder where the server's models are located. (The client does this). At this point I need to update the model_config and trigger the server to reload it.
This functionality appears to exist (based on https://github.com/tensorflow/serving/pull/885 and https://github.com/tensorflow/serving/blob/master/tensorflow_serving/apis/model_service.proto#L22), but I can't find any documentation on how to actually use it.
I am essentially looking for a python script with which I can trigger the reload from client side (or otherwise to configure the server to listen for changes and trigger the reload itself).
So it took me ages of trawling through pull requests to finally find a code example for this. For the next person who has the same question as me, here is an example of how to do this. (You'll need the tensorflow_serving package for this; pip install tensorflow-serving-api).
Based on this pull request (which at the time of writing hadn't been accepted and was closed since it needed review): https://github.com/tensorflow/serving/pull/1065
from tensorflow_serving.apis import model_service_pb2_grpc
from tensorflow_serving.apis import model_management_pb2
from tensorflow_serving.config import model_server_config_pb2
import grpc
def add_model_config(host, name, base_path, model_platform):
channel = grpc.insecure_channel(host)
stub = model_service_pb2_grpc.ModelServiceStub(channel)
request = model_management_pb2.ReloadConfigRequest()
model_server_config = model_server_config_pb2.ModelServerConfig()
#Create a config to add to the list of served models
config_list = model_server_config_pb2.ModelConfigList()
one_config = config_list.config.add()
one_config.name= name
one_config.base_path=base_path
one_config.model_platform=model_platform
model_server_config.model_config_list.CopyFrom(config_list)
request.config.CopyFrom(model_server_config)
print(request.IsInitialized())
print(request.ListFields())
response = stub.HandleReloadConfigRequest(request,10)
if response.status.error_code == 0:
print("Reload sucessfully")
else:
print("Reload failed!")
print(response.status.error_code)
print(response.status.error_message)
add_model_config(host="localhost:8500",
name="my_model",
base_path="/models/my_model",
model_platform="tensorflow")
Add a model to TF Serving server and to the existing config file conf_filepath: Use arguments name, base_path, model_platform for the new model. Keeps the original models intact.
Notice a small difference from #Karl 's answer - using MergeFrom instead of CopyFrom
pip install tensorflow-serving-api
import grpc
from google.protobuf import text_format
from tensorflow_serving.apis import model_service_pb2_grpc, model_management_pb2
from tensorflow_serving.config import model_server_config_pb2
def add_model_config(conf_filepath, host, name, base_path, model_platform):
with open(conf_filepath, 'r+') as f:
config_ini = f.read()
channel = grpc.insecure_channel(host)
stub = model_service_pb2_grpc.ModelServiceStub(channel)
request = model_management_pb2.ReloadConfigRequest()
model_server_config = model_server_config_pb2.ModelServerConfig()
config_list = model_server_config_pb2.ModelConfigList()
model_server_config = text_format.Parse(text=config_ini, message=model_server_config)
# Create a config to add to the list of served models
one_config = config_list.config.add()
one_config.name = name
one_config.base_path = base_path
one_config.model_platform = model_platform
model_server_config.model_config_list.MergeFrom(config_list)
request.config.CopyFrom(model_server_config)
response = stub.HandleReloadConfigRequest(request, 10)
if response.status.error_code == 0:
with open(conf_filepath, 'w+') as f:
f.write(request.config.__str__())
print("Updated TF Serving conf file")
else:
print("Failed to update model_config_list!")
print(response.status.error_code)
print(response.status.error_message)
While the solutions mentioned here works fine, there is one more method that you can use to hot-reload your models. You can use --model_config_file_poll_wait_seconds
As mentioned here in the documentation -
By setting the --model_config_file_poll_wait_seconds flag to instruct the server to periodically check for a new config file at --model_config_file filepath.
So, you just have to update the config file at model_config_path and tf-serving will load any new models and unload any models removed from the config file.
Edit 1: I looked at the source code and it seems that the flag is present from the very early version of tf-serving but there have been instances where some users were not able to use this flag (see this). So, try to use the latest version if possible.
If you're using the method described in this answer, please note that you're actually launching multiple tensorflow model server instances instead of a single model server, effectively making the servers compete for resources instead of working together to optimize tail latency.

Categories