AWS Lambda initialization code (Python) mystery - python

I have a Python 3.6 Lambda function that needs to download dependencies into /tmp (I use layers as well, but /tmp is needed due to size limitations) and import them. I have the code that does the download-zip-and-extract-to-temp part before the handler with the expectation that it only needs to be downloaded on cold start. Say it looks like below (pseudocode):
log('Cold start')
download_deps() # has some log statements of its own
log('init end')
def handler(event, context):
...
Most of the time it works fine. However, every now and then the logs stop showing up somewhere during initialization. (For e.g. it says "Cold start", but not "init end"; it 'dies' somewhere in download_deps). I have exception handling in there and log everything, but nothing shows up. When the handler runs next time it runs into ImportError.
While trying to fix this, I noticed something peculiar. The initialization code is running twice on a single invocation of the Lambda. Given the above pseudocode, the logs look like:
Cold start
<logs from download_deps that indicate it downloaded things into /tmp>
START <RequestId> ...
<RequestId> Cold start
<logs from download_deps that indicate it skipped download because /tmp was already populated by deps>
init end
END <RequestId>
The "init end" part doesn't show up the first time, so logs somehow vanish again. Since it skips download the second time (/tmp is preserved), I know it's not 2 actual cold starts happening. The second time it logs 'Cold start' it includes the RequestId, but not the first time; almost as if the first initialization wasn't caused by a request, even though timing of the request on API gateway matches the timing of the first "Cold start". What is going on here?

I noticed the two 'Cold start's were always ~10 seconds apart. It looks like if the initialization code takes more than ~10 seconds, it is restarted. Also, based on the duration reported in the logs, the time taken by the second initialization is included in the billed duration.
To solve my problem I moved download_deps() inside the handler, making sure that it only does anything if it needs to.

Related

My Azure Function in Python v2 doesn't show any signs of running, but it probably is

I have a simple function app in Python v2. The plan is to process millions of images, but right I just want to make the scaffolding right, i.e. no image processing, just dummy data. So I have two functions:
process with an HTTP trigger #app.route, this inserts 3 random image URLs to the Azure Queue Storage,
process_image with a Queue trigger #app.queue_trigger, that processes one image URL from above (currently only logs the event).
I trigger the first one with curl request and as expected, I can see the invocation in the Azure portal in the function's invocation section and I can see the items in the Storage Explorer's queue.
But unexpectedly, I do not see any invocations for the second function, even though after a few seconds the items disappear from the images queue and end up in the images-poison queue. So this means that something did run with the queue items 5 times. I see the following warning in the application insights checking traces and exceptions:
Message has reached MaxDequeueCount of 5. Moving message to queue 'case-images-deduplication-poison'.
Can anyone help with what's going on? Here's the gist of the code.
If I was to guess, something else is hitting that storage queue, like your dev machine or another function, can you put logging into the second function? (sorry c# guy so I don't know the code for logging)
Have you checked the individual function metric, in the portal, Function App >> Functions >> Function name >> overview >> Total execution Count and expand to the relevant time period?
Do note that it take up to 5 minutes for executions to show but after that you'll see them in the metrics

Persist Completed Pipeline in Luigi Visualiser

I'm starting to port a nightly data pipeline from a visual ETL tool to Luigi, and I really enjoy that there is a visualiser to see the status of jobs. However, I've noticed that a few minutes after the last job (named MasterEnd) completes, all of the nodes disappear from the graph except for MasterEnd. This is a little inconvenient, as I'd like to see that everything is complete for the day/past days.
Further, if in the visualiser I go directly to the last job's URL, it can't find any history that it ran: Couldn't find task MasterEnd(date=2015-09-17, base_url=http://aws.east.com/, log_dir=/home/ubuntu/logs/). I have verified that it ran successfully this morning.
One thing to note is that I have a cron that runs this pipeline every 15 minutes to check for a file on S3. If it exists, it runs, otherwise it stops. I'm not sure if that is causing the removal of tasks from the visualiser or not. I've noticed it generates a new PID every run, but I couldn't find a way to persist one PID/day in the docs.
So, my questions: Is it possible to persist the completed graph for the current day in the visualiser? And is there a way to see what has happened in the past?
Appreciate all the help
I'm not 100% positive if this is correct, but this is what I would try first. When you call luigi.run, pass it --scheduler-remove-delay. I'm guessing this is how long the scheduler waits before forgetting a task after all of its dependents have completed. If you look through luigi's source, the default is 600 seconds. For example:
luigi.run(["--workers", "8", "--scheduler-remove-delay","86400")], main_task_cls=task_name)
If you configure the remove_delay setting in your luigi.cfg then it will keep the tasks around for longer.
[scheduler]
record_task_history = True
state_path = /x/s/hadoop/luigi/var/luigi-state.pickle
remove_delay = 86400
Note, there is a typo in the documentation ("remove-delay" instead of remove_delay") which is being fixed under https://github.com/spotify/luigi/issues/2133

Call a function at end of the script in Python

So far I have been getting a lot of help and been able to successfully put together a Python script. The script basically calls a Windows executable and then does some action like pulling down some files from a remote server. And at the end of the script I have a function which does a compression and moves the retrieved files to another server. So far the script was working great, but now looks like I have hit a road hurdle.
The script basically accepts a ParentNumber as a input and finds 1 or more ChildNumbers. Once the list of ChildNumbers are gathered the script goes calls the windows executable with the number repeatedly till it completes pulling data for all of them.
As mentioned above the Function I have built to Archive, Move Files and Email Notification is being called at end of the script, the function works perfectly fine if there is only one ChildNumber. If there are many ChildNumbers and when the executable moves on the 2nd ChildNumber the command line kinda treats it as end and stats with new line something like below:
.........
C:\Scripts\startscript.py
Input> ParentNumber
Retrieval started
Retrieval finished
**Email Sent Successfully**
Preparing ParentNumber #childNumber
C:\Scritps\ParentNumber123\childNumber2
Retrieval Started
Retrieval finished
.........
`
If you see the script flow above the Email Sent successfully message shows up under first ChildNumber only, which means it's called way before the completion of the script.
The actual behavior I want is that all ArchiveMoveEmailFunction should be called once all of the ChildNumbers are processed, but not sure where it's going wrong.
My function for the ArchiveMoveEmailFunction as below and it's at ending of all other lines in the script:
def archiveMoveEmailNotification(startTime, sender, receivers):
"""
Function to archive, move and email
"""
Code for Archive
Code for Move to remote server
Code for email
archiveMoveEmailNotification(startTime, sender,receivers)
Please let me know if I am missing something here to specify on when exactly this function should be executed. As mentioned it works totally fine if the ParentNumber has only 1 ChildNumber, so not sure if the second retrieval jump is causing some issue here. Is there a way I can just have this function wait till rest of the functions in the script are called or would be be logical to move this function to another script completely and call that function from the master script?
Here is the exe call part:
def execExe(childNumb):
cmd = "myExe retriveeAll -u \"%s\" -l \"%s\"" % (childNum.Url(),childAccount.workDir))
return os.system(cmd)
def retriveChildNumb(childNumb):
#Run the retrive
if not (execExe(childNumb)==0):
DisplayResult(childNumb,3)
else:
DisplayResult(childNumb,0)
return 0
Any inputs thoughts on this is very helpful.
Your question is verbose but hard to understand; providing the code would make this much easier to troubleshoot.
That said, my suspicion is that the code you're using to call the Windows executable is asynchronous, meaning your program continues (and finishes) without waiting for the executable to return a value.

GAE Backend fails to respond to start request

This is probably a truly basic thing that I'm simply having an odd time figuring out in a Python 2.5 app.
I have a process that will take roughly an hour to complete, so I made a backend. To that end, I have a backend.yaml that has something like the following:
-name: mybackend
options: dynamic
start: /path/to/script.py
(The script is just raw computation. There's no notion of an active web session anywhere.)
On toy data, this works just fine.
This used to be public, so I would navigate to the page, the script would start, and time out after about a minute (HTTP + 30s shutdown grace period I assume, ). I figured this was a browser issue. So I repeat the same thing with a cron job. No dice. Switch to a using a push queue and adding a targeted task, since on paper it looks like it would wait for 10 minutes. Same thing.
All 3 time out after that minute, which means I'm not decoupling the request from the backend like I believe I am.
I'm assuming that I need to write a proper Handler for the backend to do work, but I don't exactly know how to write the Handler/webapp2Route. Do I handle _ah/start/ or make a new endpoint for the backend? How do I handle the subdomain? It still seems like the wrong thing to do (I'm sticking a long-process directly into a request of sorts), but I'm at a loss otherwise.
So the root cause ended up being doing the following in the script itself:
models = MyModel.all()
for model in models:
# Magic happens
I was basically taking for granted that the query would automatically batch my Query.all() over many entities, but it was dying at the 1000th entry or so. I originally wrote it was computational only because I completely ignored the fact that the reads can fail.
The actual solution for solving the problem we wanted ended up being "Use the map-reduce library", since we were trying to look at each model for analysis.

RW-locking a Windows file in Python, so that at most one test instance runs per night

I have written a custom test harness in Python (existing stuff was not a good fit due to lots of custom logic). Windows task scheduler kicks it off once per hour every day. As my tests now take more than 2 hours to run and are growing, I am running into problems. Right now I just check the system time and do nothing unless hour % 3 == 0, but I do not like that. I have a text file that contains:
# This is a comment
LatestTestedBuild = 25100
# Blank lines are skipped too
LatestTestRunStartedDate = 2011_03_26_00:01:21
# This indicates that it has not finished yet.
LatestTestRunFinishDate =
Sometimes, when I kick off a test manually, it can happen at any time, including 12:59:59.99
I want to remove race conditions as much as possible. I would rather put some extra effort once and not worry about practical probability of something happening. So, I think locking a this text file atomically is the best approach.
I am using Python 2.7, Windows Server 2008R2 Pro and Windows 7 Pro. I prefer not to install extra libraries (Python has not been "sold" to my co-workers yet, but I could copy over a file locally that implements it all, granted that the license permits it).
So, please suggest a good, bullet-proof way to solve this.
When you start running a test make a file called __LOCK__ or something. Delete it when you finish, using a try...finally block to ensure that it always gets cleared up. Don't run the test if the file exists. If the computer crashes or similar, delete the file by hand. I doubt you need more cleverness than that.
Are you sure you need 2 hours of tests?! I think 2 minutes is a more reasonable amount of time to spend, though I guess if you are running some complicated numerics you might need more.
example code:
import os
if os.path.exists("__LOCK__"):
raise RuntimeError("Already running.") # or whatever
try:
open("__LOCK__", "w").write("Put some info here if you want.")
finally:
if os.path.exists("__LOCK__"):
os.unlink("__LOCK__")

Categories