Load pickle file in Azure function from Azure Blob Storage - python

I try to build service via Azure functions that performs a matrix multiplication using a vector given by the http request and a fix numpy matrix. The matrix is stored in Azure blob storage as a pickle file and I want to load it via an input binding. However I do not manage to load the pickle file. I am able to load plain text files.
Right now my approach looks like this:
def main(req: func.HttpRequest, blobIn: func.InputStream) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
matrix = pickle.loads(blobIn.read())
vector = req.params.get('vector')
result = matrix.dot(vector)
return func.HttpResponse(json.dumps(result))
The error I get when running it that way is UnpicklingError: invalid load key, '\xef'. Another approach I tried after some googling was the following:
def main(req: func.HttpRequest, blobIn: func.InputStream) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
blob_bytes = matrix.read()
blob_to_read = BytesIO(blob_bytes)
with blob_to_read as f:
A = pickle.load(f)
vector = req.params.get('vector')
result = matrix.dot(vector)
return func.HttpResponse(json.dumps(result))
But it yields the same error. I also tried to save the matrix in a text file, get the string and build the matrix based on the string, but I encountered other issues.
So how can I load a pickle file in my Azure function? Is it even the correct approach to use input bindings to load such files or is there a better way? Many thanks for your help!

Thanks for evilSnobu's contribution.
So when face this problem, that means the pickle file you get in your code is corrupt.
The solution is add "dataType": "binary" to the input binding in function.json.
Like this:
{
"name": "inputBlob",
"type": "blob",
"dataType": "binary",
"direction": "in",
"path": "xxx/xxx.xxx",
"connection": "AzureWebJobsStorage"
}

Related

AttributeError: 'InputStream' object has no attribute 'decode' using pandas_read_xml to flatten xml

I'm still a beginner to this, but I will try to explain my problem as coherently as I can.
In case you're not familiar with Azure Cloud programming, I have a "blob trigger" where this script runs or triggers when a file is uploaded into a container in Azure. When this script triggers it passes an InputStream object to the function:
def main(myblob: func.InputStream):
logging.info(f"Python blob trigger function processed blob \n"
f"Name: {myblob.name}\n"
f"Blob Size: {myblob.length} bytes")
My problem lies when passing this InputStream object to a pandas_read_xml method.
import pandas_read_xml as pdx
df = pdx.read_xml(myblob)
df = pdx.fully_flatten(df)
The goal here is to pass an xml file to a dataframe and then flatten the xml so that I can get all of the data inside of the XML. This works when the file can be found locally on my own machine, but when I go to pass the InputStream object "myblob" to the read_xml() method I get this error:
AttributeError: 'InputStream' object has no attribute 'decode'
I've also tried downloading the blob to memory and pass that to the method like so:
#Connect to storage container/ download blob
container_str_url = 'REDACTED'
container_client = ContainerClient.from_container_url(container_str_url)
blob client accessing specific blob
blob_client = container_client.get_blob_client(blob= blob_name)
#download blob into memory
stream_downloader = blob_client.download_blob()
stream = BytesIO()
stream_downloader.readinto(stream)
df = pdx.read_xml(stream)
df = pdx.fully_flatten(df)
but this also doesn't work. Any idea on how I can use this library within this context? I think it works perfectly based off what I'm seeing whenever I use it on local files, I would love to find a way to use it here as well.
AttributeError: 'InputStream' object has no attribute 'decode'
In General, this error will occur because of decoding the already decoded string. If your Azure Functions Python Version is 3.X, then no need to decode.
If it is throwing the decode error, there is some error in datatype of that object so you should do both encoding and decoding for that input stream object stream_downloader you have defined in the code.
Encoding should be done in UTF-8 format or any other required format like binary, etc.
Also, function.json should contain the datatype of the input stream object (blob file) in Azure Functions Python project.
Sample Code Snippet:
{
"name": "inputblob",
"dataType": "binary",
"type": "blob",
"direction": "in",
"path": "blobdata/blobfile.xml",
"connection": "blobcontainer_conn_str"
},

Problem triggering nested dependencies in Azure Function

I have a problem using the videohash package for python when deployed to Azure function.
My deployed azure function does not seem to be able to use a nested dependency properly. Specifically, I am trying to use the package “videohash” and the function VideoHash from it. The
input to VideoHash is a SAS url token for a video placed on an Azure blob storage.  
In the monitor of my output it prints: 
Accessing the sas url token directly takes me to the video, so that part seems to be working.  
Looking at the source code for videohash this error seems to occur in the process of downloading the video from a given url (link:
https://github.com/akamhy/videohash/blob/main/videohash/downloader.py). 
.. where self.yt_dlp_path = str(which("yt-dlp")). This to me indicates, that after deploying the function, the package yt-dlp isn’t properly activated. This is a dependency from the videohash
module, but adding yt-dlp directly to the requirements file of the azure function also does not solve the issue. 
Any ideas on what is happening? 
 
Deploying code to Azure function, which resulted in the details highlighted in the issue description.
I have a work around where you download the video file on you own instead of the videohash using azure.storage.blob
To download you will need a BlobServiceClient , ContainerClient and connection string of azure storage account.
Please create two files called v1.mp3 and v2.mp3 before downloading the video.
file structure:
Complete Code:
import logging
from videohash import VideoHash
import azure.functions as func
import subprocess
import tempfile
import os
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
def main(req: func.HttpRequest) -> func.HttpResponse:
# local file path on the server
local_path = tempfile.gettempdir()
filepath1 = os.path.join(local_path, "v1.mp3")
filepath2 = os.path.join(local_path,"v2.mp3")
# Reference to Blob Storage
client = BlobServiceClient.from_connection_string("<Connection String >")
# Reference to Container
container = client.get_container_client(container= "test")
# Downloading the file
with open(file=filepath1, mode="wb") as download_file:
download_file.write(container.download_blob("v1.mp3").readall())
with open(file=filepath2, mode="wb") as download_file:
download_file.write(container.download_blob("v2.mp3").readall())
// video hash code .
videohash1 = VideoHash(path=filepath1)
videohash2 = VideoHash(path=filepath2)
t = videohash2.is_similar(videohash1)
return func.HttpResponse(f"Hello, {t}. This HTTP triggered function executed successfully.")
Output :
Now here I am getting the ffmpeg error which related to my test file and not related to error you are facing.
This work around as far as I know will not affect performance as in both scenario you are downloading blobs anyway

Navigating Event in AWS Lambda Python

So I'm fairly new to both AWS and Python. I'm on a uni assignment and have hit a road block.
I'm uploading data to AWS S3, this information is being sent to an SQS Queue and passed into AWS Lambda. I know, it would be much easier to just go straight from S3 to Lambda...but apparently "that's not the brief".
So I've got my event accurately coming into AWS Lambda, but no matter how deep I dig, I can't reach the information I need. In AMS Lambda, I run the following query.
def lambda_handler(event, context):
print(event)
Via CloudWatch, I get the output
{'Records': [{'messageId': '1d8e0a1d-d7e0-42e0-9ff7-c06610fccae0', 'receiptHandle': 'AQEBr64h6lBEzLk0Xj8RXBAexNukQhyqbzYIQDiMjJoLLtWkMYKQp5m0ENKGm3Icka+sX0HHb8gJoPmjdTRNBJryxCBsiHLa4nf8atpzfyCcKDjfB9RTpjdTZUCve7nZhpP5Fn7JLVCNeZd1vdsGIhkJojJ86kbS3B/2oBJiCR6ZfuS3dqZXURgu6gFg9Yxqb6TBrAxVTgBTA/Pr35acEZEv0Dy/vO6D6b61w2orabSnGvkzggPle0zcViR/shLbehROF5L6WZ5U+RuRd8tLLO5mLFf5U+nuGdVn3/N8b7+FWdzlmLOWsI/jFhKoN4rLiBkcuL8UoyccTMJ/QTWZvh5CB2mwBRHectqpjqT4TA3Z9+m8KNd/h/CIZet+0zDSgs5u', 'body': '{"Records":[{"eventVersion":"2.1","eventSource":"aws:s3","awsRegion":"eu-west-2","eventTime":"2021-03-26T01:03:53.611Z","eventName":"ObjectCreated:Put","userIdentity":{"principalId":"MY_ID"},"requestParameters":{"sourceIPAddress":"MY_IP_ADD"},"responseElements":{"x-amz-request-id":"BQBY06S20RYNH1XJ","x-amz-id-2":"Cdo0RvX+tqz6SZL/Xw9RiBLMCS3Rv2VOsu2kVRa7PXw9TsIcZeul6bzbAS6z4HF6+ZKf/2MwnWgzWYz+7jKe07060bxxPhsY"},"s3":{"s3SchemaVersion":"1.0","configurationId":"test","bucket":{"name":"MY_BUCKET","ownerIdentity":{"principalId":"MY_ID"},"arn":"arn:aws:s3:::MY_BUCKET"},"object":{"key":"test.jpg","size":246895,"eTag":"c542637a515f6df01cbc7ee7f6e317be","sequencer":"00605D33019AD8E4E5"}}}]}', 'attributes': {'ApproximateReceiveCount': '1', 'SentTimestamp': '1616720643174', 'SenderId': 'AIDAIKZTX7KCMT7EP3TLW', 'ApproximateFirstReceiveTimestamp': '1616720648174'}, 'messageAttributes': {}, 'md5OfBody': '1ab703704eb79fbbb58497ccc3f2c555', 'eventSource': 'aws:sqs', 'eventSourceARN': 'arn:aws:sqs:eu-west-2:ARN', 'awsRegion': 'eu-west-2'}]}
[Disclaimer, I've tried to edit out any identifying information but if there's any sensitive data I'm not understanding or missed, please let me know]
Anyways, just for a sample, I want to get the Object Key, which is test.jpg. I tried to drill down as much as I can, finally getting to: -
def lambda_handler(event, context):
print(event['Records'][0]['body'])
This returned the following (which was nice to see fully stylized): -
{
"Records": [
{
"eventVersion": "2.1",
"eventSource": "aws:s3",
"awsRegion": "eu-west-2",
"eventTime": "2021-03-26T01:08:16.823Z",
"eventName": "ObjectCreated:Put",
"userIdentity": {
"principalId": "MY_ID"
},
"requestParameters": {
"sourceIPAddress": "MY_IP"
},
"responseElements": {
"x-amz-request-id": "ZNKHRDY8GER4F6Q5",
"x-amz-id-2": "i1Cazudsd+V57LViNWyDNA9K+uRbSQQwufMC6vf50zQfzPaH7EECsvw9SFM3l3LD+TsYEmnjXn1rfP9GQz5G5F7Fa0XZAkbe"
},
"s3": {
"s3SchemaVersion": "1.0",
"configurationId": "test",
"bucket": {
"name": "MY_BUCKET",
"ownerIdentity": {
"principalId": "MY_ID"
},
"arn": "arn:aws:s3:::MY_BUCKET"
},
"object": {
"key": "test.jpg",
"size": 254276,
"eTag": "b0052ab9ba4b9395e74082cfd51a8f09",
"sequencer": "00605D3407594DE184"
}
}
}
]
}
However, from this stage on if I try to write print(event['Records'][0]['body']['Records']) or print(event['Records'][0]['s3']), I'll get told I require an integer, not a string. If I try to write print(event['Records'][0]['body'][0]), I'll be given a single character every time (in this cause the first { bracket).
I'm not sure if this has something to do with tuples, or if at this stage it's all saved as one large string, but at least in the output view it doesn't appear to be saved that way.
Does anyone have any idea what I'd do from this stage to access the further information? In the full release after I'm done testing, I'll be wanting to save an audio file and the file name as opposed to a picture.
Thanks.
You are having this problem because the contents of the body is a JSON. But in string format. You should parse it to be able to access it like a normal dictionary. Like so:
import json
def handler(event: dict, context: object):
body = event['Records'][0]['body']
body = json.loads(body)
# use the body as a normal dictionary
You are getting only a single char when using integer indexes because it is a string. So, using [n] in an string will return the nth char.
It's because your getting stringified JSON data. You need to load it back to its Python dict format.
There is a useful package called lambda_decorators. you can install with pip install lambda_decorators
so you can do this:
from lambda_decorators import load_json_body
#load_json_body
def lambda_handler(event, context):
print(event['Records'][0]['body'])
# Now you can access the the items in the body using there index and keys.
This will extract the JSON for you.

malloc(): unsorted double linked list corrupted Aborted (core dumped) python

While trying to download files from google drive concurrently using concurrent.futures module the below script throwing malloc(): unsorted double linked list corrupted.
files = [
{"id": "2131easd232", "name": "image1.jpg"},
{"id": "2131easdfsd232", "name": "image2.jpg"},
{"id": "2131ea32cesd232", "name": "image3.jpg"}
]
def download_file(data):
request = drive_service.files().get_media(fileId=data['id'])
fh = io.FileIO(data['name'], 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(download_file, files)
malloc(): unsorted double linked list corrupted Aborted (core dumped)
The script get executed fast (within 2 seconds) and only junk files ( files with size 0bytes ) getting created. BUt i am able to download the files synchronously without any interrupt.
I also ran into this issue. I was working with the google directory api. I think what solved it for me was to create the service object inside the function that gets threaded.
def get_user_data(useremail):
Threadedservice = build('admin', 'directory_v1', credentials=delegated_credentials)
userresults = Threadedservice.users().get(userKey=useremail, viewType='admin_view', fields='recoveryPhone, name(fullName)').execute()
My case was also similar, I was using YouTube Data API v3. It works only when either create the service object inside the function that gets threaded or outside the class and in the same module.
But it also works by copying your service/resouce object using copy.deepcopy() and use separate copied object in each thread.
from copy import deepcopy
object_copy = deepcopy(object)

Blob name patterns of Azure Functions in Python

I am implementing an Azure Function in Python which is triggered by a file uploaded to blob storage. I want to specify the pattern of the filename and use its parts inside my code as follows:
function.json:
{
"scriptFile": "__init__.py",
"bindings": [
{
"name": "inputblob",
"type": "blobTrigger",
"direction": "in",
"path": "dev/sources/{filename}.csv",
"connection": "AzureWebJobsStorage"
}
]
}
The executed __init__.py file looks as follows:
import logging
import azure.functions as func
def main(inputblob: func.InputStream):
logging.info('Python Blob trigger function processed %s', inputblob.filename)
The error message that I get is: AttributeError: 'InputStream' object has no attribute 'filename'.
As a reference, I used this documentation.
Did I do something wrong or is it not possible to achieve what I want in Python?
Your function code should be this:
import logging
import os
import azure.functions as func
def main(myblob: func.InputStream):
head, filename = os.path.split(myblob.name)
name = os.path.splitext(filename)[0]
logging.info(f"Python blob trigger function processed blob \n"
f"Name without extension: {name}\n"
f"Filename: {filename}")
It should be name instead of filename.:)
I know its really late but I was going through the same problem and I got a getaway so I decided to answer you here.
You can just do reassemble the string in python.
inside init.py -
filenameraw = inputblob.name
filenameraw = filenameraw.split('/')[-1]
filenameraw = filenameraw.replace(".csv","")
with this you'll get your desired output. :)

Categories