Automatically create zip file with dependencies for AWS Lambda - Python - python

I am building a script which automatically builds AWS Lambda function. I follow this github repo as an inspiration.
However, the lambda_handler that I want to deploy is having extra dependencies, such as numpy, pandas, or even lgbm. Simple example below:
import numpy as np
def lambda_handler(event, context):
result = np.power(event['data'], 2)
response = {'result': result}
return response
example of an event and its response:
event = {'data' = [1,2,3]}
lambda_handler(event)
> {"result": [1,4,9]}
I would like to automatically add the needed layer, while creating the AWS Lambda. For that I think I need to change create_lambda_deployment_package function that is in the lambda_basic.py in the repo. What I was thinking of doing is a following:
import zipfile
import glob
import io
def create_full_lambda_deployment_package(function_file_name):
buffer = io.BytesIO()
with zipfile.ZipFile(buffer, 'w') as zipped:
# Adding all the files around the lambda_handler directory
for i in glob.glob(function_file_name):
zipped.write(i)
# Adding the numpy directory
for i in glob.glob('./venv/lib/python3.7/site-packages/numpy/*'):
zipped.write(i, f'numpy/{i[41:]}')
buffer.seek(0)
return buffer.read()
Despite the fact that lambda is created and 'numpy' folder appears my lambda environment, unfortunately this doesn't work (error cannot import name 'integer' from partially initialized module 'numpy' (most likely due to a circular import) (/var/task/numpy/__init__.py)").
How could I fix this issue? Or is there maybe another way to solve my problem?

Related

Problem triggering nested dependencies in Azure Function

I have a problem using the videohash package for python when deployed to Azure function.
My deployed azure function does not seem to be able to use a nested dependency properly. Specifically, I am trying to use the package “videohash” and the function VideoHash from it. The
input to VideoHash is a SAS url token for a video placed on an Azure blob storage.  
In the monitor of my output it prints: 
Accessing the sas url token directly takes me to the video, so that part seems to be working.  
Looking at the source code for videohash this error seems to occur in the process of downloading the video from a given url (link:
https://github.com/akamhy/videohash/blob/main/videohash/downloader.py). 
.. where self.yt_dlp_path = str(which("yt-dlp")). This to me indicates, that after deploying the function, the package yt-dlp isn’t properly activated. This is a dependency from the videohash
module, but adding yt-dlp directly to the requirements file of the azure function also does not solve the issue. 
Any ideas on what is happening? 
 
Deploying code to Azure function, which resulted in the details highlighted in the issue description.
I have a work around where you download the video file on you own instead of the videohash using azure.storage.blob
To download you will need a BlobServiceClient , ContainerClient and connection string of azure storage account.
Please create two files called v1.mp3 and v2.mp3 before downloading the video.
file structure:
Complete Code:
import logging
from videohash import VideoHash
import azure.functions as func
import subprocess
import tempfile
import os
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
def main(req: func.HttpRequest) -> func.HttpResponse:
# local file path on the server
local_path = tempfile.gettempdir()
filepath1 = os.path.join(local_path, "v1.mp3")
filepath2 = os.path.join(local_path,"v2.mp3")
# Reference to Blob Storage
client = BlobServiceClient.from_connection_string("<Connection String >")
# Reference to Container
container = client.get_container_client(container= "test")
# Downloading the file
with open(file=filepath1, mode="wb") as download_file:
download_file.write(container.download_blob("v1.mp3").readall())
with open(file=filepath2, mode="wb") as download_file:
download_file.write(container.download_blob("v2.mp3").readall())
// video hash code .
videohash1 = VideoHash(path=filepath1)
videohash2 = VideoHash(path=filepath2)
t = videohash2.is_similar(videohash1)
return func.HttpResponse(f"Hello, {t}. This HTTP triggered function executed successfully.")
Output :
Now here I am getting the ffmpeg error which related to my test file and not related to error you are facing.
This work around as far as I know will not affect performance as in both scenario you are downloading blobs anyway

Running Selenium as Lambda layer

I am trying to run Python Selenium with lambda layers.
I have followed this GitHub page: https://github.com/alvaroseparovich/AWS-LAMBDA-LAYER-Selenium
First, I have uploaded the zip as lambda layer and then created a function as:
import starter as st
def lambda_handler(event, context):
driver = st.start_drive()
driver.get('https://www.google.com.br')
page_data = driver.page_source
driver.close()
return page_data
but, it's returning the error as:
Unable to import module 'lambda_function': No module named 'starter'
How can I determine the cause of it?
you can package your program with all the dependencies and make a zip file and upload it to lamda.
Try below,
https://docs.aws.amazon.com/lambda/latest/dg/python-package.html

module google.cloud has no attribute storage

I'm trying to run a beam script in python on GCP following this tutorial:
[https://levelup.gitconnected.com/scaling-scikit-learn-with-apache-beam-251eb6fcf75b][1]
but I keep getting the following error:
AttributeError: module 'google.cloud' has no attribute 'storage'
I have google-cloud-storage in my requirements.txt so really not sure what I'm missing here.
My full script:
import apache_beam as beam
import json
query = """
SELECT
year,
plurality,
apgar_5min,
mother_age,
father_age,
gestation_weeks,
ever_born,
case when mother_married = true then 1 else 0 end as mother_married,
weight_pounds as weight,
current_timestamp as time,
GENERATE_UUID() as guid
FROM `bigquery-public-data.samples.natality`
order by rand()
limit 100
"""
class ApplyDoFn(beam.DoFn):
def __init__(self):
self._model = None
from google.cloud import storage
import pandas as pd
import pickle as pkl
self._storage = storage
self._pkl = pkl
self._pd = pd
def process(self, element):
if self._model is None:
bucket = self._storage.Client().get_bucket('bqr_dump')
blob = bucket.get_blob('natality/sklearn-linear')
self._model = self._pkl.loads(blob.download_as_string())
new_x = self._pd.DataFrame.from_dict(element,
orient='index').transpose().fillna(0)
pred_weight = self._model.predict(new_x.iloc[:, 1:8])[0]
return [ {'guid': element['guid'],
'predicted_weight': pred_weight,
'time': str(element['time'])}]
# set up pipeline options
options = {'project': my-project-name,
'runner': 'DataflowRunner',
'temp_location': 'gs://bqr_dump/tmp',
'staging_location': 'gs://bqr_dump/tmp'
}
pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
with beam.Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| 'ReadTable' >> beam.io.Read(beam.io.BigQuerySource(
query=query,
use_standard_sql=True))
| 'Apply Model' >> beam.ParDo(ApplyDoFn())
| 'Save to BigQuery' >> beam.io.WriteToBigQuery(
'pzn-pi-sto:beam_test.weight_preds',
schema='guid:STRING,weight:FLOAT64,time:STRING',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))
and my requirements.txt:
google-cloud==0.34.0
google-cloud-storage==1.30.0
apache-beam[GCP]==2.20.0
This issue is usually related to two main reasons: the modules not being well installed, which means that something broke during the installation and the second reason, the import of the module not being correctly done.
To fix the issue, in case the reason is the broken modules, reinstalling or checking it in a virtual environment would be the solution. As indicated here, a similar case as yours, this should fix your case.
For the second reason, try to change your code and import all the modules in the beginning of the code, as demonstrated in this official example here. Your code should be something like this:
import apache_beam as beam
import json
import pandas as pd
import pickle as pkl
from google.cloud import storage
...
Let me know if this information helped you!
Make sure u have installed the correct version. Because the modules Google maintains will have constant updates. If u just give pip install for the required package it is directly going to install the latest version of the package.

Fastai - failed initiation of language model in Sentence Piece Processor, cache_dir parameter

I've been already browsing web for hours to find a solution for my, which i believe so might be a pretty petty issue.
I'm using fastai's Sentence Piece Processor (SPProcesor) at the very first steps of initiation of a language model.
My code for these steps looks like this:
bs = 48
processor = SPProcessor(lang='pl')
data_lm = (TextList.from_csv('', target_corpus, processor=processor)
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=bs)
)
data_lm.save(data_lm_file)
After execution i get an error which is as follows:
~/x/miniconda3/envs/fastai/lib/python3.6/site-packages/fastai/text/data.py in process(self, ds)
466 self.sp_model,self.sp_vocab = cache_dir/'spm.model',cache_dir/'spm.vocab'
467 if not getattr(self, 'vocab', False):
--> 468 with open(self.sp_vocab, 'r', encoding=self.enc) as f: self.vocab = Vocab([line.split('\t')[0] for line in f.readlines()])
469 if self.n_cpus <= 1: ds.items = self._encode_batch(ds.items)
470 else:
FileNotFoundError: [Errno 2] No such file or directory: 'tmp/spm/spm.vocab'
The proper outcome of the code executed above should be as following:
created folder named 'tmp', containing folder 'spm', within which should be placed 2 files named respectively: spm.vocab and spm.model.
What happens instead is that 'tmp' folder is created along with files named "cache_dir".vocab and "cache_dir".model inside my current directory.
Folder 'spm' is nowhere to be found.
I've found a sort of workaround solution.
It consists of manually creating a 'spm' folder inside 'tmp' and moving those 2 other mentioned above files into it, and changing their names to spm.vocab and spm.model.
That allows me to carry on with my processing yet I'd like to find a way to skip that neccessity of manually moving created files and else.
Maybe I need to pass some paramateres (probably cache_dir) with specific values before processing?
If you'd have any idea on how to solve that issue, please point me those.
I'd be grateful.
I can see similar error if I switch the code in fastai/text/data.py to an earlier version of this commit. Then, if I apply changes from the same commit it all works nicely. Now, the most recent version of the same file (the one which supposed to help with paths with spaces) seems to have yet another bug introduced there.
So pretty much it seems that the problem is that fastai is trying to give argument --model_prefix with quotes to the sentencepiece .SentencePieceTrainer.Train which makes it "misbehave".
One possibility for you would be to either (1) update to the later version of fastai (which might not help due to another bug in a newer version), or (2) manually apply changes from here to your installation's fastai/text/data.py. It's a very small change - just delete the line:
cache_dir = cache_dir/'spm'
and replace
f'--model_prefix="cache_dir" --vocab_size={vocab_sz} --model_type={model_type}']))
with the
f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
In case you are not comfortable with updating the code of the installation you can monkey-patch the module by substituting existing train_sentencepiece function by writing fixed version in your code and then doing something like fastai.text.data.train_sentencepiece = my_fixed_train_sentencepiece before other calls.
So if you are using newer version of the library the code might look like this:
import fastai
from fastai.core import PathOrStr
from fastai.text.data import ListRules, get_default_size, quotemark, full_char_coverage_langs
from typing import Collection
def train_sentencepiece(texts:Collection[str], path:PathOrStr, pre_rules: ListRules=None, post_rules:ListRules=None,
vocab_sz:int=None, max_vocab_sz:int=30000, model_type:str='unigram', max_sentence_len:int=20480, lang='en',
char_coverage=None, tmp_dir='tmp', enc='utf8'):
"Train a sentencepiece tokenizer on `texts` and save it in `path/tmp_dir`"
from sentencepiece import SentencePieceTrainer
cache_dir = Path(path)/tmp_dir
os.makedirs(cache_dir, exist_ok=True)
if vocab_sz is None: vocab_sz=get_default_size(texts, max_vocab_sz)
raw_text_path = cache_dir / 'all_text.out'
with open(raw_text_path, 'w', encoding=enc) as f: f.write("\n".join(texts))
spec_tokens = ['\u2581'+s for s in defaults.text_spec_tok]
SentencePieceTrainer.Train(" ".join([
f"--input={quotemark}{raw_text_path}{quotemark} --max_sentence_length={max_sentence_len}",
f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
f"--user_defined_symbols={','.join(spec_tokens)}",
f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
raw_text_path.unlink()
return cache_dir
fastai.text.data.train_sentencepiece = train_sentencepiece
And if you are using older version, then like the following:
import fastai
from fastai.core import PathOrStr
from fastai.text.data import ListRules, get_default_size, full_char_coverage_langs
from typing import Collection
def train_sentencepiece(texts:Collection[str], path:PathOrStr, pre_rules: ListRules=None, post_rules:ListRules=None,
vocab_sz:int=None, max_vocab_sz:int=30000, model_type:str='unigram', max_sentence_len:int=20480, lang='en',
char_coverage=None, tmp_dir='tmp', enc='utf8'):
"Train a sentencepiece tokenizer on `texts` and save it in `path/tmp_dir`"
from sentencepiece import SentencePieceTrainer
cache_dir = Path(path)/tmp_dir
os.makedirs(cache_dir, exist_ok=True)
if vocab_sz is None: vocab_sz=get_default_size(texts, max_vocab_sz)
raw_text_path = cache_dir / 'all_text.out'
with open(raw_text_path, 'w', encoding=enc) as f: f.write("\n".join(texts))
spec_tokens = ['\u2581'+s for s in defaults.text_spec_tok]
SentencePieceTrainer.Train(" ".join([
f"--input={raw_text_path} --max_sentence_length={max_sentence_len}",
f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
f"--user_defined_symbols={','.join(spec_tokens)}",
f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
raw_text_path.unlink()
return cache_dir
fastai.text.data.train_sentencepiece = train_sentencepiece

FileUploadMiscError while persisting output file from Azure Batch

I'm facing the following error while trying to persist log files to Azure Blob storage from Azure Batch execution - "FileUploadMiscError - A miscellaneous error was encountered while uploading one of the output files". This error doesn't give a lot of information as to what might be going wrong. I tried checking the Microsoft Documentation for this error code, but it doesn't mention this particular error code.
Below is the relevant code for adding the task to Azure Batch that I have ported from C# to Python for persisting the log files.
Note: The container that I have configured gets created when the task is added, but there's no blob inside.
import datetime
import logging
import os
import azure.storage.blob.models as blob_model
import yaml
from azure.batch import models
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.common.cloudstorageaccount import CloudStorageAccount
from dotenv import load_dotenv
LOG = logging.getLogger(__name__)
def add_tasks(batch_client, job_id, task_id, io_details, blob_details):
task_commands = "This is a placeholder. Actual code has an actual task. This gets completed successfully."
LOG.info("Configuring the blob storage details")
base_blob_service = BaseBlobService(
account_name=blob_details['account_name'],
account_key=blob_details['account_key'])
LOG.info("Base blob service created")
base_blob_service.create_container(
container_name=blob_details['container_name'], fail_on_exist=False)
LOG.info("Container present")
container_sas = base_blob_service.generate_container_shared_access_signature(
container_name=blob_details['container_name'],
permission=blob_model.ContainerPermissions(write=True),
expiry=datetime.datetime.now() + datetime.timedelta(days=1))
LOG.info(f"Container SAS created: {container_sas}")
container_url = base_blob_service.make_container_url(
container_name=blob_details['container_name'], sas_token=container_sas)
LOG.info(f"Container URL created: {container_url}")
# fpath = task_id + '/output.txt'
fpath = task_id
LOG.info(f"Creating output file object:")
out_files_list = list()
out_files = models.OutputFile(
file_pattern=r"../stderr.txt",
destination=models.OutputFileDestination(
container=models.OutputFileBlobContainerDestination(
container_url=container_url, path=fpath)),
upload_options=models.OutputFileUploadOptions(
upload_condition=models.OutputFileUploadCondition.task_completion))
out_files_list.append(out_files)
LOG.info(f"Output files: {out_files_list}")
LOG.info(f"Creating the task now: {task_id}")
task = models.TaskAddParameter(
id=task_id, command_line=task_commands, output_files=out_files_list)
batch_client.task.add(job_id=job_id, task=task)
LOG.info(f"Added task: {task_id}")
There is a bug in Batch's OutputFile handling which causes it to fail to upload to containers if the full container URL includes any query-string parameters other than the ones included in the SAS token. Unfortunately, the azure-storage-blob Python module includes an extra query string parameter when generating the URL via make_container_url.
This issue was just raised to us, and a fix will be released in the coming weeks, but an easy workaround is instead of using make_container_url to craft the URL, craft it yourself like so: container_url = 'https://{}/{}?{}'.format(blob_service.primary_endpoint, blob_details['container_name'], container_sas).
The resulting URL should look something like this: https://<account>.blob.core.windows.net/<container>?se=2019-01-12T01%3A34%3A05Z&sp=w&sv=2018-03-28&sr=c&sig=<sig> - specifically it shouldn't have restype=container in it (which is what the azure-storage-blob package is including)

Categories