Persist Completed Pipeline in Luigi Visualiser - python

I'm starting to port a nightly data pipeline from a visual ETL tool to Luigi, and I really enjoy that there is a visualiser to see the status of jobs. However, I've noticed that a few minutes after the last job (named MasterEnd) completes, all of the nodes disappear from the graph except for MasterEnd. This is a little inconvenient, as I'd like to see that everything is complete for the day/past days.
Further, if in the visualiser I go directly to the last job's URL, it can't find any history that it ran: Couldn't find task MasterEnd(date=2015-09-17, base_url=http://aws.east.com/, log_dir=/home/ubuntu/logs/). I have verified that it ran successfully this morning.
One thing to note is that I have a cron that runs this pipeline every 15 minutes to check for a file on S3. If it exists, it runs, otherwise it stops. I'm not sure if that is causing the removal of tasks from the visualiser or not. I've noticed it generates a new PID every run, but I couldn't find a way to persist one PID/day in the docs.
So, my questions: Is it possible to persist the completed graph for the current day in the visualiser? And is there a way to see what has happened in the past?
Appreciate all the help

I'm not 100% positive if this is correct, but this is what I would try first. When you call luigi.run, pass it --scheduler-remove-delay. I'm guessing this is how long the scheduler waits before forgetting a task after all of its dependents have completed. If you look through luigi's source, the default is 600 seconds. For example:
luigi.run(["--workers", "8", "--scheduler-remove-delay","86400")], main_task_cls=task_name)

If you configure the remove_delay setting in your luigi.cfg then it will keep the tasks around for longer.
[scheduler]
record_task_history = True
state_path = /x/s/hadoop/luigi/var/luigi-state.pickle
remove_delay = 86400
Note, there is a typo in the documentation ("remove-delay" instead of remove_delay") which is being fixed under https://github.com/spotify/luigi/issues/2133

Related

Python/Django Prevent a script from being run twice

I got a big script that retrieves a lot of data via an API (Magento Invoices). This script is launched by a cron every 20 minutes. But sometimes, we need to refresh manually for getting the last invoiced orders. I have a dedicated page for this.
I would like to prevent from manual launching of the script by testing if it's already running because both API and script take a lot of ressources and time.
I tried to add a "process" model with is_active = True/False that would be tested and avoid re-launching if script is already active. At the start of the script, I switch the process status to TRUE and set it to FALSE when the script has finished.
But it seems that the 2nd instance of the script waits for the first to be finished before starting. At the end, both scripts are run because process.is_active always = False
I also tried with request.session variable, but same issue.
Spent lotsa time on this but didn't find a way to achieve my goal.
Has anyone already faced such a case ?

AWS Lambda initialization code (Python) mystery

I have a Python 3.6 Lambda function that needs to download dependencies into /tmp (I use layers as well, but /tmp is needed due to size limitations) and import them. I have the code that does the download-zip-and-extract-to-temp part before the handler with the expectation that it only needs to be downloaded on cold start. Say it looks like below (pseudocode):
log('Cold start')
download_deps() # has some log statements of its own
log('init end')
def handler(event, context):
...
Most of the time it works fine. However, every now and then the logs stop showing up somewhere during initialization. (For e.g. it says "Cold start", but not "init end"; it 'dies' somewhere in download_deps). I have exception handling in there and log everything, but nothing shows up. When the handler runs next time it runs into ImportError.
While trying to fix this, I noticed something peculiar. The initialization code is running twice on a single invocation of the Lambda. Given the above pseudocode, the logs look like:
Cold start
<logs from download_deps that indicate it downloaded things into /tmp>
START <RequestId> ...
<RequestId> Cold start
<logs from download_deps that indicate it skipped download because /tmp was already populated by deps>
init end
END <RequestId>
The "init end" part doesn't show up the first time, so logs somehow vanish again. Since it skips download the second time (/tmp is preserved), I know it's not 2 actual cold starts happening. The second time it logs 'Cold start' it includes the RequestId, but not the first time; almost as if the first initialization wasn't caused by a request, even though timing of the request on API gateway matches the timing of the first "Cold start". What is going on here?
I noticed the two 'Cold start's were always ~10 seconds apart. It looks like if the initialization code takes more than ~10 seconds, it is restarted. Also, based on the duration reported in the logs, the time taken by the second initialization is included in the billed duration.
To solve my problem I moved download_deps() inside the handler, making sure that it only does anything if it needs to.

Recommendation on how to write a good python wrapper LSF

I am creating a python wrapper script and was wondering what'd be a good way to create it.
I want to run code serially. For example:
Step 1.
Run same program (in parallel - the parallelization is easy because I work with an LSF system so I just submit three different jobs).
I run the program in parallel, and each run takes one fin.txt and outputs one fout.txt, i.e., when they all run they would produce 3 output files from the three input files, f1in.txt, f2in.txt, f3in.txt, f1out.txt, f2out.txt, f3out.txt.
(in the LSF system) When each run of the program is completed successfully, it produces a log file output, f1log.out, f2log.out, f3log.out.
The log files output are of this form, i.e., f1log.out would look something like this if it runs successfully.
------------------------------------------------------------
# LSBATCH: User input
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 86.20 sec.
Max Memory : 103 MB
Max Swap : 881 MB
Max Processes : 4
Max Threads : 5
The output (if any) is above this job summary.
Thus, I'd like my wrapper to check (every 5 min or so) for each run (1,2,3) if the log file has been created, and if it has been created I'd like the wrapper to check if it was successfully completed (aka, if the string Successfully completed appears in the log file).
Also if the one of the runs finished and produces a log file that was not successfully completed I'd like my wrapper to end and report that run (k=1,2,3) was not completed.
After that,
Step2. If all three runs are successfully completed I would run another program that takes those three files as input... else I'd print an error.
Basically in my question I am looking for two things:
Does it sound like a good way to write a wrapper?
How in python I can check the existence of a file, and search for a pattern every certain time in a good way?
Note. I am aware that LSF has job dependencies but I find this way to be more clear and easy to work with, though may not be optimal.
I'm a user of an LSF system, and my major gripes are exit handling, and cleanup. I think a neat idea would be to send a batch job array that has for instance: Initialization Task, Legwork Task, Cleanup Task. The LSF could complete all three and send a return code to the waiting head node. Alot of times LSF works great to send one job or command, but it isn't really set up to handle systematic processing.
Other than that I wish you luck :)

GAE Backend fails to respond to start request

This is probably a truly basic thing that I'm simply having an odd time figuring out in a Python 2.5 app.
I have a process that will take roughly an hour to complete, so I made a backend. To that end, I have a backend.yaml that has something like the following:
-name: mybackend
options: dynamic
start: /path/to/script.py
(The script is just raw computation. There's no notion of an active web session anywhere.)
On toy data, this works just fine.
This used to be public, so I would navigate to the page, the script would start, and time out after about a minute (HTTP + 30s shutdown grace period I assume, ). I figured this was a browser issue. So I repeat the same thing with a cron job. No dice. Switch to a using a push queue and adding a targeted task, since on paper it looks like it would wait for 10 minutes. Same thing.
All 3 time out after that minute, which means I'm not decoupling the request from the backend like I believe I am.
I'm assuming that I need to write a proper Handler for the backend to do work, but I don't exactly know how to write the Handler/webapp2Route. Do I handle _ah/start/ or make a new endpoint for the backend? How do I handle the subdomain? It still seems like the wrong thing to do (I'm sticking a long-process directly into a request of sorts), but I'm at a loss otherwise.
So the root cause ended up being doing the following in the script itself:
models = MyModel.all()
for model in models:
# Magic happens
I was basically taking for granted that the query would automatically batch my Query.all() over many entities, but it was dying at the 1000th entry or so. I originally wrote it was computational only because I completely ignored the fact that the reads can fail.
The actual solution for solving the problem we wanted ended up being "Use the map-reduce library", since we were trying to look at each model for analysis.

RW-locking a Windows file in Python, so that at most one test instance runs per night

I have written a custom test harness in Python (existing stuff was not a good fit due to lots of custom logic). Windows task scheduler kicks it off once per hour every day. As my tests now take more than 2 hours to run and are growing, I am running into problems. Right now I just check the system time and do nothing unless hour % 3 == 0, but I do not like that. I have a text file that contains:
# This is a comment
LatestTestedBuild = 25100
# Blank lines are skipped too
LatestTestRunStartedDate = 2011_03_26_00:01:21
# This indicates that it has not finished yet.
LatestTestRunFinishDate =
Sometimes, when I kick off a test manually, it can happen at any time, including 12:59:59.99
I want to remove race conditions as much as possible. I would rather put some extra effort once and not worry about practical probability of something happening. So, I think locking a this text file atomically is the best approach.
I am using Python 2.7, Windows Server 2008R2 Pro and Windows 7 Pro. I prefer not to install extra libraries (Python has not been "sold" to my co-workers yet, but I could copy over a file locally that implements it all, granted that the license permits it).
So, please suggest a good, bullet-proof way to solve this.
When you start running a test make a file called __LOCK__ or something. Delete it when you finish, using a try...finally block to ensure that it always gets cleared up. Don't run the test if the file exists. If the computer crashes or similar, delete the file by hand. I doubt you need more cleverness than that.
Are you sure you need 2 hours of tests?! I think 2 minutes is a more reasonable amount of time to spend, though I guess if you are running some complicated numerics you might need more.
example code:
import os
if os.path.exists("__LOCK__"):
raise RuntimeError("Already running.") # or whatever
try:
open("__LOCK__", "w").write("Put some info here if you want.")
finally:
if os.path.exists("__LOCK__"):
os.unlink("__LOCK__")

Categories