AWS - OS Error permission denied Lambda Script - python

I'm trying to execute a Lambda Script in Python with an imported library, however I'm getting permission errors.
I am also getting some alerts about the database, but database queries are called after the subprocess so I don't think they are related. Could someone explain why do I get error?
Alert information
Alarm:Database-WriteCapacityUnitsLimit-BasicAlarm
State changed to INSUFFICIENT_DATA at 2016/08/16. Reason: Unchecked: Initial alarm creation
Lambda Error
[Errno 13] Permission denied: OSError Traceback (most recent call last):File "/var/task/lambda_function.py", line 36, in lambda_handler
xml_output = subprocess.check_output(["./mediainfo", "--full", "--output=XML", signed_url])
File "/usr/lib64/python2.7/subprocess.py", line 566, in check_output process = Popen(stdout=PIPE, *popenargs, **kwargs)
File "/usr/lib64/python2.7/subprocess.py", line 710, in __init__ errread, errwrite) File "/usr/lib64/python2.7/subprocess.py", line 1335, in _execute_child raise child_exception
OSError: [Errno 13] Permission denied
Lambda code
import logging
import subprocess
import boto3
SIGNED_URL_EXPIRATION = 300 # The number of seconds that the Signed URL is valid
DYNAMODB_TABLE_NAME = "TechnicalMetadata"
DYNAMO = boto3.resource("dynamodb")
TABLE = DYNAMO.Table(DYNAMODB_TABLE_NAME)
logger = logging.getLogger('boto3')
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
"""
:param event:
:param context:
"""
# Loop through records provided by S3 Event trigger
for s3_record in event['Records']:
logger.info("Working on new s3_record...")
# Extract the Key and Bucket names for the asset uploaded to S3
key = s3_record['s3']['object']['key']
bucket = s3_record['s3']['bucket']['name']
logger.info("Bucket: {} \t Key: {}".format(bucket, key))
# Generate a signed URL for the uploaded asset
signed_url = get_signed_url(SIGNED_URL_EXPIRATION, bucket, key)
logger.info("Signed URL: {}".format(signed_url))
# Launch MediaInfo
# Pass the signed URL of the uploaded asset to MediaInfo as an input
# MediaInfo will extract the technical metadata from the asset
# The extracted metadata will be outputted in XML format and
# stored in the variable xml_output
xml_output = subprocess.check_output(["./mediainfo", "--full", "--output=XML", signed_url])
logger.info("Output: {}".format(xml_output))
save_record(key, xml_output)
def save_record(key, xml_output):
"""
Save record to DynamoDB
:param key: S3 Key Name
:param xml_output: Technical Metadata in XML Format
:return:
"""
logger.info("Saving record to DynamoDB...")
TABLE.put_item(
Item={
'keyName': key,
'technicalMetadata': xml_output
}
)
logger.info("Saved record to DynamoDB")
def get_signed_url(expires_in, bucket, obj):
"""
Generate a signed URL
:param expires_in: URL Expiration time in seconds
:param bucket:
:param obj: S3 Key name
:return: Signed URL
"""
s3_cli = boto3.client("s3")
presigned_url = s3_cli.generate_presigned_url('get_object', Params={'Bucket': bucket, 'Key': obj},
ExpiresIn=expires_in)
return presigned_url

I'm fairly certain that this is a restriction imposed by the lambda execution environment, but it can be worked around by executing the script through the shell.
Try providing shell=True to your subprocess call:
xml_output = subprocess.check_output(["./mediainfo", "--full", "--output=XML", signed_url], shell=True)

I encountered a similar situation. I was receiving the error:
2016-11-28T01:49:01.304Z d4505c71-b50c-11e6-b0a1-65eecf2623cd Error: Command failed: /var/task/node_modules/youtube-dl/bin/youtube-dl --dump-json -f best https://soundcloud.com/bla/blabla
python: can't open file '/var/task/node_modules/youtube-dl/bin/youtube-dl': [Errno 13] Permission denied
For my (and every other) Node Lambda project containing third party libraries, there will be a directory called "node_modules" (most tutorials, such as this one, will detail how this directory is created) that has all the third party packages and their dependencies. The same principles apply to the other supported languages (currently Python and Java). THESE ARE THE FILES THAT AMAZON IS ACTUALLY PUTTING ON THE LAMBDA AMIS AND ATTEMPTING TO USE. So, to fix the issue, run this on the node_modules directory (or whatever directory your third party libraries live in):
chmod -R 777 /Users/bla/bla/bla/lambdaproject/node_modules
This command means making the file readable, writable and executable by all users. Which is apparently what the servers that execute Lambda functions need, in order to work. Hopefully this helps!

Related

Document AI process document fails with invalid argument when processing docs from GCS

I am getting an error very similar to the below, but I am not in EU:
Document AI: google.api_core.exceptions.InvalidArgument: 400 Request contains an invalid argument
When I use the raw_document and process a local pdf file, it works fine. However, when I specify a pdf file on a GCS location, it fails.
Error message:
the processor name: projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97
the form process request: name: "projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97"
inline_document {
uri: "gs://xxxx/temp/test1.pdf"
}
Traceback (most recent call last):
File "C:\Python39\lib\site-packages\google\api_core\grpc_helpers.py", line 66, in error_remapped_callable
return callable_(*args, **kwargs)
File "C:\Python39\lib\site-packages\grpc\_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "C:\Python39\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Request contains an invalid argument."
debug_error_string = "{"created":"#1647296055.582000000","description":"Error received from peer ipv4:142.250.80.74:443","file":"src/core/lib/surface/call.cc","file_line":1070,"grpc_message":"Request contains an invalid argument.","grpc_status":3}"
>
Code:
client = documentai.DocumentProcessorServiceClient(client_options=opts)
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
print(f'the processor name: {name}')
# document = {"uri": gcs_path, "mime_type": "application/pdf"}
document = {"uri": gcs_path}
inline_document = documentai.Document()
inline_document.uri = gcs_path
# inline_document.mime_type = "application/pdf"
# Configure the process request
# request = {"name": name, "inline_document": document}
request = documentai.ProcessRequest(
inline_document=inline_document,
name=name
)
print(f'the form process request: {request}')
result = client.process_document(request=request)
I do not believe I have permission issues on the bucket since the same set up works fine for a document classification process on the same bucket.
This is a known issue for Document AI, and is already reported in this issue tracker. Unfortunately the only workaround for now is to either:
Download your file, read the file as bytes and use process_documents(). See Document AI local processing for the sample code.
Use batch_process_documents() since by default is only accepts files from GCS. This is if you don't want to do the extra step on downloading the file.
This is still an issue 5 months later, and something not mentioned in the accepted answer is (and I could be wrong but it seems to me) that batch processes are only able to output their results to GCS, so you'll still incur the extra step of downloading something from a bucket (be it the input document under Option 1 or the result under Option 2). On top of that, you'll end up having to do cleanup in the bucket if you don't want the results there, so under many circumstances, Option 2 won't present much of an advantage other than the fact that the result download will probably be smaller than the input file download.
I'm using the client library in a Python Cloud Function and I'm affected by this issue. I'm implementing Option 1 for the reason that it seems simplest and I'm holding out for the fix. I also considered using the Workflow client library to fire a Workflow that runs a Document AI process, or calling the Document AI REST API, but it's all very suboptimal.

AWS ClientError when using Lambda and S3 to insert data to bucket

I am trying to put a json blob into an S3 bucket using lambda and I am getting the following error when looking at the cloudwatch logs
[ERROR] ClientError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
Traceback (most recent call last):
File "/var/task/main.py", line 147, in lambda_handler
save_articles_and_comments(sub, submissions)
File "/var/task/main.py", line 125, in save_articles_and_comments
object.put(Body=json.dumps(articles))
File "/var/task/boto3/resources/factory.py", line 520, in do_action
response = action(self, *args, **kwargs)
File "/var/task/boto3/resources/action.py", line 83, in __call__
response = getattr(parent.meta.client, operation_name)(*args, **params)
File "/var/task/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/var/task/botocore/client.py", line 635, in _make_api_call
raise error_class(parsed_response, operation_name)
All of the block public access settings are set to "off" and the bucket name in the code is the same as in S3. This is the code that is putting the json blob into their respeectable folders in my S3 bucket and the lambda handler
def save_articles_and_comments(sub, submissions):
"""
"""
s3 = boto3.resource('s3')
now = dt.datetime.utcnow()
formatted_date = now.strftime("%Y-%m-%d-%H-%M-%S")
articles, comments = data_for_subreddit(submissions)
print("Number of articles, comments {}, {}".format(len(articles), len(comments)))
articles_name = 'articles/' + formatted_date + '_' + sub + '_articles.json'
comments_name = 'comments/' + formatted_date + '_' + sub + '_comments.json'
object = s3.Object('diegos-reddit-bucket', articles_name)
object.put(Body=json.dumps(articles))
print("Finished writing articles to {}".format(articles_name))
object = s3.Object('diegos-reddit-bucket', comments_name)
object.put(Body=json.dumps(comments))
print("Finished writing comments to {}".format(comments_name))
def lambda_handler(x, y):
"""
"""
import time
import random
idx = random.randint(0, len(SUBREDDITS)-1)
start = time.time()
assert PRAW_KEY is not None
sub = SUBREDDITS[idx]
red = reddit_instance()
subreddit = red.subreddit(sub)
print("Pulling posts from {}, {}.".format(sub, "hot"))
submissions = subreddit.hot()
save_articles_and_comments(sub, submissions)
print("="*50)
print("Pulling posts from {}, {}.".format(sub, "new"))
submissions = subreddit.new()
save_articles_and_comments(sub, submissions)
print("="*50)
print("Pulling posts from {}, {}.".format(sub, "top"))
submissions = subreddit.top()
save_articles_and_comments(sub, submissions)
print("="*50)
print("Pulling posts from {}, {}.".format(sub, "rising"))
submissions = subreddit.rising()
save_articles_and_comments(sub, submissions)
end = time.time()
print("Elapsed time {}".format(end - start))
I do not see what is the problem in the code for me to getting said error. swapped out my lambda_handler function with a main to test locally. With the main it works, and writes to the S3 bucket and its respected folders. When I try and run via AWS Lambda, I get said error after the function is done pulling posts from the first subreddit and trying to put the json blob into the folder in the S3 bucket. This is what my output is supposed to look like
Pulling posts from StockMarket, hot.
Number of articles, comments 101, 909
Finished writing articles to articles/2020-06-03-02-48-44_StockMarket_articles.json
Finished writing comments to comments/2020-06-03-02-48-44_StockMarket_comments.json
==================================================
Pulling posts from StockMarket, new.
Number of articles, comments 101, 778
Finished writing articles to articles/2020-06-03-02-49-10_StockMarket_articles.json
Finished writing comments to comments/2020-06-03-02-49-10_StockMarket_comments.json
==================================================
Pulling posts from StockMarket, top.
Number of articles, comments 101, 5116
Finished writing articles to articles/2020-06-03-02-49-36_StockMarket_articles.json
Finished writing comments to comments/2020-06-03-02-49-36_StockMarket_comments.json
==================================================
Pulling posts from StockMarket, rising.
Number of articles, comments 24, 170
Finished writing articles to articles/2020-06-03-02-52-10_StockMarket_articles.json
Finished writing comments to comments/2020-06-03-02-52-10_StockMarket_comments.json
Elapsed time 215.6588649749756
Is there a problem in my code or is this a problem on the AWS side?
The problem occurs because you have no permissions to write objects to the bucket:
PutObject operation: Access Denied
To rectify the issue, have to look at lambda execution role: does it have permissions to write to S3? Also can inspect bucket policy.
With the main it works, and writes to the S3 bucket and its respected folders. When I try and run via AWS Lambda, I get said error
When you test locally, your code is using your own permissions (your IAM user) to write to S3. Thus it works. When you execute the code on lambda, your function does not use your permissions. Instead it uses permissions defined in lambda execution role.

boto3 textract start_document_text_detection doesn't accept folders for input files on s3

I've written a lambda to extract text from image files stored in s3. The lambda is triggered by new objects. The images are stored in folders.
When I test on files stored on the root of my S3 bucket, everything works fine. When I use a folder, things break.
When the documentLocation looks like this:
{'S3Object': {'Bucket': 'extractbucket', 'Name': 'img024.jpg'}}
everything works.
When it looks like this:
`{'S3Object': {'Bucket': 'extractbucket', 'Name': 'afold/img024.jpg'}}`
I get InvalidParameterException
Steps to reproduce
Here's my lambda function (Python3.8, region:us-east-2):
import json
import boto3
def lambda_handler(event, context):
bucket="extractbucket"
client = boto3.client('textract')
jobFile = event['Records'][0]['s3']['object']['key']
#process using S3 object
docLoc = {
"S3Object":{
"Bucket": bucket,
"Name": jobFile
}
}
response = client.start_document_text_detection(
DocumentLocation=docLoc,
JobTag=jobFile,
NotificationChannel={
"RoleArn":"arn:aws:iam::xxxxx:role/Textract_demo_sns",
"SNSTopicArn": "arn:aws:sns:us-east-2:xxxxx:TxtExtractComplete"
}
)
return {
'statusCode': 200,
'body': json.dumps("sent filejobID:" + jobFile + " to queue")
}
I test this using an S3 trigger test, putting the filename in the object/key. When I test with root files, it all works, when I test with files in a folder, things break. The break as below:
Debug logs
InvalidParameterException: An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 18, in lambda_handler
response = client.start_document_text_detection(
File "/var/runtime/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/var/runtime/botocore/client.py", line 626, in _make_api_call
raise error_class(parsed_response, operation_name)END
Any help would be appreciated, thanks for your time.
My testing shows that start_document_text_detection() works fine with objects in subdirectories.
I suspect that the Key contains URL-like characters rather than a pure slash. You can test this by printing the value of jobFile and looking in the logs to view the value.
Here is code that will avoid this problem:
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
You will also need to import urllib.
The JobTag identifier doesn't accept spaces or symbols, including '/'.
So when adding a file with a folder, and assigning the key to JobTag, JobTag senses a slash, and returns the invalid parameter error.
Resolution: Remove/replace slashes in the jobtag using python replace

Using Python to Manage AWS

I’m trying to use Python to create EC2 instances but I keep getting these errors.
Here is my code:
#!/usr/bin/env python
import boto3
ec2 = boto3.resource('ec2')
instance = ec2.create_instances(
ImageId='ami-0922553b7b0369273',
MinCount=1,
MaxCount=1,
InstanceType='t2.micro')
print instance[0].id
Here are the errors I'm getting
Traceback (most recent call last):
File "./createinstance.py", line 8, in <module>
InstanceType='t2.micro')
File "/usr/lib/python2.7/site-packages/boto3/resources/factory.py", line 520, in do_action
response = action(self, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/boto3/resources/action.py", line 83, in __call__
response = getattr(parent.meta.client, operation_name)(**params)
File "/usr/lib/python2.7/site-packages/botocore/client.py", line 320, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/lib/python2.7/site-packages/botocore/client.py", line 623, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-0922553b7b0369273]' does not exist
I also get an error when trying to create a key pair
Here's my code for creating the keypair
import boto3
ec2 = boto3.resource('ec2')
# create a file to store the key locally
outfile = open('ec2-keypair.pem','w')
# call the boto ec2 function to create a key pair
key_pair = ec2.create_key_pair(KeyName='ec2-keypair')
# capture the key and store it in a file
KeyPairOut = str(key_pair.key_material)
print(KeyPairOut)
outfile.write(KeyPairOut)
response = ec2.instance-describe()
print response
Here's are the error messages
./createkey.py: line 1: import: command not found
./createkey.py: line 2: syntax error near unexpected token `('
./createkey.py: line 2: `ec2 = boto3.resource('ec2')'
What I am I missing?
For your first script, one of two possibilities could be occurring:
1. The AMI you are referencing by the ID is not available because the key is incorrect or the AMI doesn't exist
2. AMI is unavailable in the region that your machine is setup for
You most likely are running your script from a machine that is not configured for the correct region. If you are running your script locally or on a server that does not have roles configured, and you are using the aws-cli, you can run the aws configure command to configure your access keys and region appropriately. If you are running your instance on a server with roles configured, your server needs to be ran in the correct region, and your roles need to allow access to EC2 AMI's.
For your second question (which in the future should probably be posted separate), your syntax error in your script is a side effect of not following the same format for how you wrote your first script. It is most likely that your python script is not in fact being interpreted as a python script. You should add the shebang at the top of the file and remove the spacing preceding your import boto3 statement.
#!/usr/bin/env python
import boto3
# create a file to store the key locally
outfile = open('ec2-keypair.pem','w')
# call the boto ec2 function to create a key pair
key_pair = ec2.create_key_pair(KeyName='ec2-keypair')
# capture the key and store it in a file
KeyPairOut = str(key_pair.key_material)
print(KeyPairOut)
outfile.write(KeyPairOut)
response = ec2.instance-describe()
print response

Get 400 error when try to upload a image file to a collection

I want to write a script to upload my photos to Google Drive. After couples of hours of digging into Google Document List API. I choose gdata-python-client 2.0.17(latest) to build my script. Everything works well, except that I cannot upload a file to a collection. Here is the exception.
Traceback (most recent call last):
File "./upload.py", line 27, in
upload(sys.argv[1])
File "./upload.py", line 22, in upload
client.create_resource(p, collection=f, media=ms)
File "/usr/local/lib/python2.7/dist-packages/gdata/docs/client.py", line 300, in create_resource
return uploader.upload_file(create_uri, entry, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/gdata/client.py", line 1090, in upload_file
start_byte, self.file_handle.read(self.chunk_size))
File "/usr/local/lib/python2.7/dist-packages/gdata/client.py", line 1048, in upload_chunk
raise error
gdata.client.RequestError: Server responded with: 400, <errors xmlns='http://schemas.google.com/g/2005'><error><domain>GData</domain><code>InvalidEntryException</code><internalReason>We're sorry, a server error occurred. Please try again.</internalReason></error></errors>
After hacking into the source code of gdata, I print some info for debugging.
>
Range: bytes 0-524287/729223
PUT TO: https://docs.google.com/feeds/upload/create-session/default/private/full/folder%3A0B96cfHivZx6ddGFwYXVCbzc4U3M/contents?upload_id=AEnB2UqnYRFTOyCCIGIESUIctWg6hvQIHY4JRMnL-CUQhHii3RGMFWZ12a7lXWd1hgOChd1Vqlr8d-BmvyfmhFhzhYK9Vnw4Xw
>
Range: bytes 524288-729222/729223
PUT TO: https://docs.google.com/feeds/upload/create-session/default/private/full/folder%3A0B96cfHivZx6ddGFwYXVCbzc4U3M/contents?upload_id=AEnB2UqnYRFTOyCCIGIESUIctWg6hvQIHY4JRMnL-CUQhHii3RGMFWZ12a7lXWd1hgOChd1Vqlr8d-BmvyfmhFhzhYK9Vnw4Xw
The exception raised when PUT the last part of file.
I would advise you to try the new Google Drive API v2 that makes it much easier with better support for media upload: https://developers.google.com/drive/v2/reference/files/insert
Once you get an authorized service instance, you can simply insert a new file like so:
from apiclient import errors
from apiclient.http import MediaFileUpload
# ...
def insert_file(service, title, description, parent_id, mime_type, filename):
"""Insert new file.
Args:
service: Drive API service instance.
title: Title of the file to insert, including the extension.
description: Description of the file to insert.
parent_id: Parent folder's ID.
mime_type: MIME type of the file to insert.
filename: Filename of the file to insert.
Returns:
Inserted file metadata if successful, None otherwise.
"""
media_body = MediaFileUpload(filename, mimetype=mime_type, resumable=True)
body = {
'title': title,
'description': description,
'mimeType': mime_type
}
# Set the parent folder.
if parent_id:
body['parents'] = [{'id': parent_id}]
try:
file = service.files().insert(
body=body,
media_body=media_body).execute()
return file
except errors.HttpError, error:
print 'An error occured: %s' % error
return None

Categories