I am trying to create a lambda function that is scraping data from Wikipedia. A number of scripts are running in the same lambda that has an execution time of more than 30minutes. The issue here is Lambda timeout after 15minutes. I got some idea to use the step function to re-run the lambda, but I have no idea how to start the lambda from where it left the other time.
I don't have the option to use any other AWS services.
Runtime: Python
You can not run a Lambda for more than 900 seconds (15 minutes). That is a hard limit at this time.
As others mentioned, you could use Step Functions or use other services like EC2 or change the design of your "application".
But maybe you should stop scraping Wikipedia and instead use Wikidata (which is basically an API for all the data in the Wikipedia).
Check it out: https://www.wikidata.org
Related
I have written a lambda using Python which is dependent upon external APIs which can occasionally go down. This is triggered once a day using EventBridge to gather data from yesterday, and updates a file in S3 at the same time every day.
I was wondering how I would be able to re-run the Lambda, which includes a check as to whether the external API is functioning at the start, every 1-2 hours for the rest of the day until it successfully works? It would need to stop at 11pm so as to not go into the next calendar day.
Specifically I am using the Google Search Console API which should have updated every 4 hours but hasn't in this case for 30.
Appreciate the help!
A CloudWatch event schedule can trigger your lambda periodically throughout the day. Using a cron expression, you can have your lambda invoked at suitable intervals.
The lambda first looks up the last_modified property on the target S3 file. If last_modified is not today, proceed and call the API. If last_modified is today, we have nothing to do, so the lambda returns without doing anything.
i have one httpTrigger where i have implemented cache we have a requirement where we have to update cache after 2 hr.
Solution 1:
we can expire the cache after 2 hour.. but we don't want to use this solution
Solution 2:
we want a function to get triggered (update_cache()) after every 2 hour.
I find out some library
But i am unable to get how i can implement this..
# i want to trigger this function every 2 hour
def trigger_scheduler_update():
logging.info("Hi i am scheduler and got triggered...")
schedule.every(2).hours.do(trigger_scheduler_update)
But the problem i am facing here is we have to write this kind of code.
# ref: https://www.geeksforgeeks.org/python-schedule-library/
while True:
# Checks whether a scheduled task
# is pending to run or not
schedule.run_pending()
time.sleep(1)
As its an infinite loop i can place it in http trigger is there a way i can implement a scheduler that run after 2 hr.
i don't know that can it be done using threading?
i found one more library but looks like it also won't work.
Your function is shut down after a period of time, unless you are on a premium plan. Even then you cannot guarantee your function keeps on running.
What cache are you referring to?
Note that you cannot do threading in azure functions and you shouldn't actually. Abandon the idea of refreshing the cache from your httpTrigger function and create a separate scheduleTriggered function to update the cache that your http function is using.
Problem:
I have a python lambda that constantly receives data every second and puts it into DynamoDB. I noticed that after the first time DynamoDB takes a little more and the function times out, all the following calls also timeout and it never recovers.
The way to bring the lambda back to normal is to redeploy it.
When it starts timing out, it does not display any logs. It times out without executing any of the code.
Below is a picture of our console that represents the issue.
In order to reproduce the issue faster with this function I did the following:
Redeploy it and see it is working fine.
Reduce the memory available to the lambda to the minimum and timeout to 1 second. This will cause the first timeout
Increase back the memory of the lambda to normal and even increase the timeout. However, the timeouts persist
Is there a way to resolve this issue without having to redeploy?
I have seen the same description of issue but with nodejs in this post: https://forums.aws.amazon.com/thread.jspa?threadID=234417.
I haven't seen any description related with the python lambda env
More information about the setup:
Lambda environments tested: python3.6 and python3.7
Tool to deploy lambda: serverless 1.57.0
serverless plugins used: serverless-python-requirements, serverless-wsgi
I am not using any VPC for the lambda
Thank you for the help,
Figured out the trigger for the bug.
When the lambda function zip uploaded is too large, after the first time it times out, it never recovers!
My solution was to carefully strip out the unnecessary dependencies to make the package smaller.
I created a repository using a docker container for people to reproduce the issue more easily:
https://github.com/pedrohbtp/bug-aws-lambda-infinite-timeout
Thanks for the messages in the comments. I appreciate whoever takes time to try to help here in SO.
I am working on a web scraping project using python and an API
I want the python script to be ran everyday for 5 days for 12 hours as a job
I don't want to keep my system alive to either do it in CMD or in Jupyter so I was looking for a solution wherein any cloud service would help me automate the process
One way to do this would be to write a web scraper in Python, and run it on an AWS Lambda, which is essentially a serverless function with no underlying ops to manage. Depending on your use case, you could either perform some action based on the contents of that page data, or you could write the result out to S3 as a file.
To have your function execute in a recurring fashion, you can then set your AWS Lambda event trigger to be a CloudWatch event (in this case, some recurring timer at whatever frequencies/times you'd like, such as once each hour for a 12 hour window during Mon-Fri).
This is typically going to be an easier approach when compared to spinning up a virtual server (EC2 instance), and managing a persistent process that could error during waits/operation for any number of reasons.
I have a python aws lambda function that queries a aws dynamodb.
As my api now takes about 1 second to respond to a very simple query/table setup I wanted to understand where i can optimize.
The table has only 3 items (users) at the moment and the following structure:
user_id (Primary Key, String),
details ("[{
"_nested_atrb1_str": "abc",
"_nested_atrb2_str": "def",
"_nested_map": [nested_item1,nested_item2]},
{..}]
The query is super simple:
response = table.query(
KeyConditionExpression=Key('userid').eq("xyz")
)
The query takes 0.8-0.9 seconds.
Is this a normal query time for a table with only 3 items where each
user only has max 5 attributes(incl nested)?
If yes, can i expect
similar times if the structure stays the same but the number of items
(users) increases hundred-fold ?
There are a few things to investigate. First off, is your timing of .8 - .9 seconds based on timing the query directly by wrapping the query in a time or timeit like timer? If it is the query truly taking that time then there is definitely something not quite right with the interaction to Dynamo from Lambda.
If the time you're seeing is actually from the invoke of your Lambda (I assume this is through API Gateway as a REST API since you mentioned "api") then the time you're seeing could be due to many factors. Can you profile the API call? I would check to see through Postman or even browser tools if you can profile to see the time for DNS lookup, SSL setup, etc. Additionally, CloudWatch will give you metrics specific to the call times for your Lambda once the request has reached Lambda. You could also look at enabling X-Ray which will give you more details in regards to the execution of your Lambda. If your Lambda is running in a VPC you could also be encountering cold starts that are leading to the latency you're seeing.
X-Ray:
https://aws.amazon.com/xray/
Cold Starts: just Google "AWS Lambda cold starts" and you'll find all kinds of info
For anyone with similar experiences, I received the below AWS developer support response with some useful references. It didn't solve my problem but I now understand that this is mainly related to the low (test)volume and lambda startup time.
1) Is this a normal query time for a table with only 3 items where each user only has max 5 attributes(incl nested)?
The time is slow but could be due to a number of factors based on your setup. Since you are using Lambda you need to keep in mind that every time you trigger your lambda function it sets up your environment and then executes the code. An AWS Lambda function runs within a container—an execution environment that is isolated from other functions. When you run a function for the first time, AWS Lambda creates a new container and begins executing the function's code. A Lambda function has a handler that is executed once per invocation. After the function executes, AWS Lambda may opt to reuse the container for subsequent invocations of the function. In this case, your function handler might be able to reuse the resources that you defined in your initialization code. (Note that you cannot control how long AWS Lambda will retain the container, or whether the container will be reused at all.) Your table is really small, I had a look at it. [1]
2) Can I expect similar times if the structure stays the same but the number of items (users) increases hundred-fold?
If the code takes longer to execute and you have more data in DynamoDB eventually it could slow down, again based on your set up.
Some of my recommendations on optimizing your set up.
1) Have Lambda and DynamoDB within the same VPC. You can query your DynamoDB via a VPC endpoint. This will cut out any network latencies. [2][3]
2) Increase memory on lambda for faster startup and execution times.
3) As your application scales. Make sure to enable auto-scaling on your DynamoDB table and also increase your RCU and WCU to improve DynamoDBs performance when handling requests. [4]
Additionally, have a look at DynamoDB best practices. [5]
Please feel free to contact me with any additional questions and for further guidance. Thank you. Enjoy your day. Have a great day.
References
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.BestPracticesWithDynamoDB.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/vpc-endpoints-dynamodb.html
https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices.html
Profiling my small lambda code (outside of lambda) I got these results that you may find interesting.
Times in milliseconds
# Initially
3 calls to DB,
1350 ms 1st call (read)
1074 ms 2nd call (write)
1051 ms 3rd call (read)
# After doing this outside the DB calls and providing it to each one
dynamodb = boto3.resource('dynamodb',region_name=REGION_NAME)
12 ms executing the line above
1324 ms 1st call (read)
285 ms 2nd call (write)
270 ms 3rd call (read)
# seeing that reusing was producing savings I did the same with
tableusers = dynamodb.Table(TABLE_USERS)
12 create dynamodb handler
3 create table handler
1078 read reusing dynamodb and table
280 write reusing dynamodb and table
270 read reusing dynamodb (not table)
So initially it took 3.4 seconds, now ~1.6 seconds for just adding 2 lines of code.
I got these results using %lprun on jupyter / Colab
# The -u 0.001 sets the time unit at 1ms (default is 1 microsecond)
%lprun -u 0.001 -f lambdaquick lambdaquick()
If you only do 1 DB request and nothing else with the DB, try to put the 2 DB handlers outside the lambda handler as amittn recommends.
Disclaimer: I just learned all this, including deep profiling. So all this may be nonsense.
Note: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. -- Donald Knuth" from https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html
https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.Python.03.html
If you are seeing this issue only on the first invocation then its definitely due to cold start of lambda's. Otherwise on the consequent requests there should be a improvement which might help you to diagnose the actual pain point. Also cloudwatch logs will help in tracking the request.
I am assuming that you are reusing your connections as it cuts several milliseconds off your execution time. If not this will help you achieve that.
Any variable outside the lambda_handler function will be frozen in between Lambda invocations and possibly reused. The documentation states to “not assume that AWS Lambda always reuses the container because AWS Lambda may choose not to reuse the container.” but it's observed that depending on the volume of executions, the container is almost always reused.