AWS Python lambda times out forever after first timeout - python

Problem:
I have a python lambda that constantly receives data every second and puts it into DynamoDB. I noticed that after the first time DynamoDB takes a little more and the function times out, all the following calls also timeout and it never recovers.
The way to bring the lambda back to normal is to redeploy it.
When it starts timing out, it does not display any logs. It times out without executing any of the code.
Below is a picture of our console that represents the issue.
In order to reproduce the issue faster with this function I did the following:
Redeploy it and see it is working fine.
Reduce the memory available to the lambda to the minimum and timeout to 1 second. This will cause the first timeout
Increase back the memory of the lambda to normal and even increase the timeout. However, the timeouts persist
Is there a way to resolve this issue without having to redeploy?
I have seen the same description of issue but with nodejs in this post: https://forums.aws.amazon.com/thread.jspa?threadID=234417.
I haven't seen any description related with the python lambda env
More information about the setup:
Lambda environments tested: python3.6 and python3.7
Tool to deploy lambda: serverless 1.57.0
serverless plugins used: serverless-python-requirements, serverless-wsgi
I am not using any VPC for the lambda
Thank you for the help,

Figured out the trigger for the bug.
When the lambda function zip uploaded is too large, after the first time it times out, it never recovers!
My solution was to carefully strip out the unnecessary dependencies to make the package smaller.
I created a repository using a docker container for people to reproduce the issue more easily:
https://github.com/pedrohbtp/bug-aws-lambda-infinite-timeout
Thanks for the messages in the comments. I appreciate whoever takes time to try to help here in SO.

Related

Invoke Lambda Function for 30 minutes

I am trying to create a lambda function that is scraping data from Wikipedia. A number of scripts are running in the same lambda that has an execution time of more than 30minutes. The issue here is Lambda timeout after 15minutes. I got some idea to use the step function to re-run the lambda, but I have no idea how to start the lambda from where it left the other time.
I don't have the option to use any other AWS services.
Runtime: Python
You can not run a Lambda for more than 900 seconds (15 minutes). That is a hard limit at this time.
As others mentioned, you could use Step Functions or use other services like EC2 or change the design of your "application".
But maybe you should stop scraping Wikipedia and instead use Wikidata (which is basically an API for all the data in the Wikipedia).
Check it out: https://www.wikidata.org

dynamodb simple query execution time

I have a python aws lambda function that queries a aws dynamodb.
As my api now takes about 1 second to respond to a very simple query/table setup I wanted to understand where i can optimize.
The table has only 3 items (users) at the moment and the following structure:
user_id (Primary Key, String),
details ("[{
"_nested_atrb1_str": "abc",
"_nested_atrb2_str": "def",
"_nested_map": [nested_item1,nested_item2]},
{..}]
The query is super simple:
response = table.query(
KeyConditionExpression=Key('userid').eq("xyz")
)
The query takes 0.8-0.9 seconds.
Is this a normal query time for a table with only 3 items where each
user only has max 5 attributes(incl nested)?
If yes, can i expect
similar times if the structure stays the same but the number of items
(users) increases hundred-fold ?
There are a few things to investigate. First off, is your timing of .8 - .9 seconds based on timing the query directly by wrapping the query in a time or timeit like timer? If it is the query truly taking that time then there is definitely something not quite right with the interaction to Dynamo from Lambda.
If the time you're seeing is actually from the invoke of your Lambda (I assume this is through API Gateway as a REST API since you mentioned "api") then the time you're seeing could be due to many factors. Can you profile the API call? I would check to see through Postman or even browser tools if you can profile to see the time for DNS lookup, SSL setup, etc. Additionally, CloudWatch will give you metrics specific to the call times for your Lambda once the request has reached Lambda. You could also look at enabling X-Ray which will give you more details in regards to the execution of your Lambda. If your Lambda is running in a VPC you could also be encountering cold starts that are leading to the latency you're seeing.
X-Ray:
https://aws.amazon.com/xray/
Cold Starts: just Google "AWS Lambda cold starts" and you'll find all kinds of info
For anyone with similar experiences, I received the below AWS developer support response with some useful references. It didn't solve my problem but I now understand that this is mainly related to the low (test)volume and lambda startup time.
1) Is this a normal query time for a table with only 3 items where each user only has max 5 attributes(incl nested)?
The time is slow but could be due to a number of factors based on your setup. Since you are using Lambda you need to keep in mind that every time you trigger your lambda function it sets up your environment and then executes the code. An AWS Lambda function runs within a container—an execution environment that is isolated from other functions. When you run a function for the first time, AWS Lambda creates a new container and begins executing the function's code. A Lambda function has a handler that is executed once per invocation. After the function executes, AWS Lambda may opt to reuse the container for subsequent invocations of the function. In this case, your function handler might be able to reuse the resources that you defined in your initialization code. (Note that you cannot control how long AWS Lambda will retain the container, or whether the container will be reused at all.) Your table is really small, I had a look at it. [1]
2) Can I expect similar times if the structure stays the same but the number of items (users) increases hundred-fold?
If the code takes longer to execute and you have more data in DynamoDB eventually it could slow down, again based on your set up.
Some of my recommendations on optimizing your set up.
1) Have Lambda and DynamoDB within the same VPC. You can query your DynamoDB via a VPC endpoint. This will cut out any network latencies. [2][3]
2) Increase memory on lambda for faster startup and execution times.
3) As your application scales. Make sure to enable auto-scaling on your DynamoDB table and also increase your RCU and WCU to improve DynamoDBs performance when handling requests. [4]
Additionally, have a look at DynamoDB best practices. [5]
Please feel free to contact me with any additional questions and for further guidance. Thank you. Enjoy your day. Have a great day.
References
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.BestPracticesWithDynamoDB.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/vpc-endpoints-dynamodb.html
https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices.html
Profiling my small lambda code (outside of lambda) I got these results that you may find interesting.
Times in milliseconds
# Initially
3 calls to DB,
1350 ms 1st call (read)
1074 ms 2nd call (write)
1051 ms 3rd call (read)
# After doing this outside the DB calls and providing it to each one
dynamodb = boto3.resource('dynamodb',region_name=REGION_NAME)
12 ms executing the line above
1324 ms 1st call (read)
285 ms 2nd call (write)
270 ms 3rd call (read)
# seeing that reusing was producing savings I did the same with
tableusers = dynamodb.Table(TABLE_USERS)
12 create dynamodb handler
3 create table handler
1078 read reusing dynamodb and table
280 write reusing dynamodb and table
270 read reusing dynamodb (not table)
So initially it took 3.4 seconds, now ~1.6 seconds for just adding 2 lines of code.
I got these results using %lprun on jupyter / Colab
# The -u 0.001 sets the time unit at 1ms (default is 1 microsecond)
%lprun -u 0.001 -f lambdaquick lambdaquick()
If you only do 1 DB request and nothing else with the DB, try to put the 2 DB handlers outside the lambda handler as amittn recommends.
Disclaimer: I just learned all this, including deep profiling. So all this may be nonsense.
Note: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. -- Donald Knuth" from https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html
https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.Python.03.html
If you are seeing this issue only on the first invocation then its definitely due to cold start of lambda's. Otherwise on the consequent requests there should be a improvement which might help you to diagnose the actual pain point. Also cloudwatch logs will help in tracking the request.
I am assuming that you are reusing your connections as it cuts several milliseconds off your execution time. If not this will help you achieve that.
Any variable outside the lambda_handler function will be frozen in between Lambda invocations and possibly reused. The documentation states to “not assume that AWS Lambda always reuses the container because AWS Lambda may choose not to reuse the container.” but it's observed that depending on the volume of executions, the container is almost always reused.

Why is Spark's show() function very slow?

I have
df.select("*").filter(df.itemid==itemid).show()
and that never terminated, however if I do
print df.select("*").filter(df.itemid==itemid)
It prints in less than a second. Why is this?
That's because select and filter are just building up the execution instructions, so they aren't doing anything with the data. Then, when you call show it actually executes those instructions. If it isn't terminating, then I'd review the logs to see if there are any errors or connection issues. Or maybe the dataset is still too large - try only taking 5 to see if that comes back quick.
This usually happens if you dont have enough available memory in computer. free up some memory and try again.

Python watch-dog script : load url asynchronously

I have simple Python script which do check few urls :
f = urllib2.urlopen(urllib2.Request(url))
as i have socket timeout setted on 5 seconds sometimes is bothering to wait 5sec * number of urls on results.
Is there any easy standartized way how to run those url checks asynchronously without big overhead. Script must use standart python components on vanilla ubuntu distribution (no additional installations).
Any ideas ?
I wrote something called multibench a long time ago. I used it for almost the same thing you want to do here, which was to call multiple concurrent instances of wget and see how long it takes to complete. It is a crude load testing and performance monitoring tool. You will need to adapt this somewhat, because this runs the same command n times.
Install additional software. It's a waste of time you re-invent something just because of some packaging decisions made by someone else.

Python: Increasing timeout value in EMR using yelps MRJOB

I am using the yelp MRjob for writing some of the mapreduce programs. I am running it on EMR. My program has reducer code which takes a long time to execute. I am noticing that because of the default timeout period in EMR I am getting this error
Task attempt_201301171501_0001_r_000000_0 failed to report status for 600 seconds.Killing!
I want a way to increase the timeout of the EMR. I read the mrjobs official documentation about the same but I was not able to understand the procedure. Can someone suggest a way to solve this issue.
I've dealt with a similar issue with EMR in the past, the property you are looking for mapred.task.timeout which corresponds to the number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string.
With MRJob, you could add the following option:
--jobconf mapred.task.timeout=1800000
EDIT: It appears that some EMR AMIs appear do not support setting parameters like timeout with jobconf at run time. Instead, you must use Bootstrap-time configuration like this:
--bootstrap-action="s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.task.timeout=1800000"
I would still try the first one to start with and see if you can get it to work, otherwise try the bootstrap action.
To run any of these parameters, just create your job extending from MRJob, this class has a jobconf method that will read your --jobconf parameters, so you should specify these as regular options on command line:
python job.py --num-ec2-instances 42 --python-archive t.tar.gz -r emr --jobconf mapred.task.timeout=1800000 /path/to/input.txt

Categories