Task does not run Luigi - python

i write a trivial piece of code to run the tasks in Luigi. The code is as bellow:
import luigi
count = 0
class TaskC(luigi.Task):
def requires(self):
return None
def run(self):
print("Running task C ...")
global count
with self.output().open('w') as outfile:
outfile.write("Finished task C, count = %d", count)
count += 1
def output(self):
return luigi.LocalTarget("./logs/task_c.txt")
class TaskB(luigi.Task):
def requires(self):
return None
def run(self):
print("Running task B ...")
global count
with self.output().open('w') as outfile:
outfile.write("Finished task B, count = %d ...", count)
count += 1
def output(self):
return luigi.LocalTarget("./logs/task_b.txt")
class TaskA(luigi.Task):
def requires(self):
return [TaskB(), TaskC()]
def run(self):
print("Running task A ...")
global count
with self.output().open('w') as outfile:
outfile.write("Finished task A, count = %d ...", count)
count += 1
def output(self):
return luigi.LocalTarget("./logs/task_a.txt")
if __name__ == '__main__':
print("Start the fisrt luigi app :)")
luigi.run()
Expect: i want to run TaskA, but TaskA requires TaskB and TaskC -> TaskB and TaskC should run before and first when both tasks B,C are finished, then TaskA can run
Actual: Only TaskA runs. The other tasks don't. The log in console:
Start the fisrt luigi app :)
DEBUG: Checking if TaskA() is complete
INFO: Informed scheduler that task TaskA__99914b932b has status DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=382715991, workers=1, host=w10tng, username=tng, pid=2096) was stopped. Shutting down Keep-Alive thread
INFO:
===== Luigi Execution Summary =====
Scheduled 1 tasks of which:
* 1 complete ones were encountered:
- 1 TaskA()
Did not run any tasks
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
Command that i used to run:
python first_luigi_app.py --local-scheduler TaskA
I don't know if i've been missing somethings ! Would appreciate if some one can help :)

you can try removing requires methods from task B and task C as currently by returning None they are skipped.
Also when using formatting with f-string it worked ok.
Run with: python -m luigi --module l1 TaskA --local-scheduler where l1 is l1.py(copy of your code)

Related

Luigi: Task is never invoked

I have the following setup
class TaskA(luigi.Task):
def requires(self):
yield TaskB()
if not get_results_from_task_B_written_on_S3():
print('Did not find any results and will exit')
return
else:
print('Found results and will proceed')
yield TaskC()
results = get_results_from_task_C_written_on_S3():
# do other stuff
class TaskB(luigi.Task):
def run(self):
// process and write results to s3
def output(self):
return URITarget('b_path')
class TaskC(luigi.Task):
def run(self):
// process and write results to s3
def output(self):
return URITarget('c_path')
The Luigi logs show the following:
Did not find any results and will exit
Found results and will proceed
To me seems like the control flow enters both if and else. Since this is in principle impossible I suspect that Luigi attempts to run the pipeline twice. Once it produces this
Did not find any results and will exit
Since it cannot find any results written on s3 from TaskB.
Then TaskB actually finishes its execution. Writes its results on s3. TaskA reruns. Finds the results from TaskB on s3 and produces
Found results and will proceed
But then it seems like the yield of TaskC is not working. It's just stuck there indifinitely.
This is just my assumption of Luigi's behavior. Please let me know if I'm wrong about this.
I need this modularisation of tasks B and C into separate tasks since it makes testing much easier. TaskC is a fairly complex tasks whose test setup would be much more involved than testing its constituents separately.
Part of the problem is that requires() can get called multiple times during scheduling. Therefore, the first time your TaskA.requires() gets called, it yields TaskB. But the next time TaskA.requires() is called, you are yielding TaskB again and you hit the else block. That first call to TaskA.requires() is the only one that gets used for the actual scheduling dependencies.
I wrote a test program just to test this out and you can see in my output how many times TaskB.output() is called.
import luigi
taskC_complete = False
taskB_complete = False
def get_results_from_task_C_written_on_S3():
return taskC_complete
def get_results_from_task_B_written_on_S3():
return taskB_complete
def set_taskB_complete():
taskB_complete = True
def set_taskC_complete():
taskC_complete = True
class TaskA(luigi.Task):
def requires(self):
yield TaskB()
if not get_results_from_task_B_written_on_S3():
print('Did not find any results and will exit')
return
else:
print('Found results and will proceed')
yield TaskC()
results = get_results_from_task_C_written_on_S3()
class TaskB(luigi.Task):
def run(self):
print("Task B")
def output(self):
return print('b_path')
class TaskC(luigi.Task):
def run(self):
print("Task C")
def output(self):
return print('c_path')
if __name__ == '__main__':
luigi_run_results = luigi.build([TaskA()], workers=1,
local_scheduler=True, detailed_summary=True, log_level='INFO')
This code outputs
Did not find any results and will exit
b_path
Task B
Did not find any results and will exit
b_path
Did not find any results and will exit
b_path
Did not find any results and will exit
b_path
Although the code is not a perfect replica of what you are attempting, here's the output from the scheduler which shows what will actually run:
INFO: Informed scheduler that task TaskA__99914b932b has status PENDING
INFO: Informed scheduler that task TaskB__99914b932b has status PENDING
I'm not sure what exactly you're trying to achieve, but read up on their documentation on task dependencies. You're better off trying to yield other tasks in your run() function for TaskA.

python schedule library to stop previously running thread when new scheduled thread starts

I have a same thread running every 10 min. but when the new thread starts i want to quit the previous thread so it doesn't keep adding up the space. how can i achieve that. for scheduling of thread.I'm using python schedule library.
this is how I'm scheduling right now
schedule.every(10).minutes.do(sts,threadFunc)
There are two aspects to this question:
identify the currently running job, which is fairly easy.
Kill a running thread in python. There's no great solution for this, and the following code implements the 'stop flag' approach.
I'm solving the first challenge by using a global variable. This variable, named running_thread, holds the currently running thread so that a new job can kill it if needed.
The second challenge requires the running thread to constantly check the status of some flag ('the stop flag'). If the stop flag is set on that thread, it immediately exists.
Here's a code skeleton that demonstrates both these ideas. Jobs take a random amount of time, and I've scheduled them to start every 1 second.
import threading
import time
import schedule
import random
running_thread = None
class StoppableThread(threading.Thread):
"""Thread class with a stop() method. The thread itself has to check
regularly for the stopped() condition."""
def __init__(self, *args, **kwargs):
super(StoppableThread, self).__init__(*args, **kwargs)
self._stop_event = threading.Event()
def stop(self):
self._stop_event.set()
def stopped(self):
return self._stop_event.is_set()
def job():
current_thread = threading.currentThread()
sleep_time = random.random() * 5
print(f"Starting job, about to sleep {sleep_time} seconds, thread id is {current_thread.ident}")
counter = 0
while counter < sleep_time:
time.sleep(0.1)
counter += 0.1
if current_thread.stopped():
print ("Stopping job")
break
print(f"job with thread id {current_thread.ident} done")
def threadFunc():
global running_thread
if running_thread:
print("Trying to stop thread")
running_thread.stop()
print("Strting thread")
running_thread = StoppableThread(target = job)
running_thread.start()
schedule.every(1).seconds.do(threadFunc)
while True:
schedule.run_pending()
time.sleep(.5)

How to configure Luigi task retry correctly?

I am trying to configure Luigi's retry mechanism so that failed tasks will be retried a few times. However, while the task is retried successfully, Luigi exits unsuccessfully:
===== Luigi Execution Summary =====
Scheduled 3 tasks of which:
* 2 ran successfully:
- 1 FailOnceThenSucceed(path=/tmp/job-id-18.subtask)
- 1 MasterTask(path=/tmp/job-id-18)
* 1 failed:
- 1 FailOnceThenSucceed(path=/tmp/job-id-18.subtask)
This progress looks :( because there were failed tasks
So the question is: how do I configure Luigi (I have installed version 2.3.3 with pip install) so that when a task fails once, but is then retried with success, then Luigi will exit successfully with This progress looks :) instead of fail with This progress looks :(?
Here is a minimal scheduler and worker config I've come up with, as well as tasks to demonstrate the behavior:
[scheduler]
retry_count = 3
retry-delay = 1
[worker]
keep_alive=true
mytasks.py:
import luigi
class FailOnceThenSucceed(luigi.Task):
path = luigi.Parameter()
def output(self):
return luigi.LocalTarget(self.path)
def run(self):
failmarker = luigi.LocalTarget(self.path + ".hasfailedonce")
if failmarker.exists():
with self.output().open('w') as target:
target.write('OK')
else:
with failmarker.open('w') as marker:
marker.write('Failed')
raise RuntimeError("Failed once")
class MasterTask(luigi.Task):
path = luigi.Parameter()
def requires(self):
return FailOnceThenSucceed(path=self.path + '.subtask')
def output(self):
return luigi.LocalTarget(self.path)
def run(self):
with self.output().open('w') as target:
target.write('OK')
Example execution:
PYTHONPATH=. luigi --module mytasks MasterTask --workers=2 --path='/tmp/job-id-18'
This is an old issue of Luigi - where successful retried tasks were not marked as such when failed and then succeeded on retry:
https://github.com/spotify/luigi/issues/1932
It was fixed in version 2.7.2:
https://github.com/spotify/luigi/releases/tag/2.7.2
I suggest you upgrade to the latest Luigi version, i.e. by running pip install -U luigi.

Luigi Pipeline beginning in S3

My initial files are in AWS S3. Could someone point me how I need to setup this in a Luigi Task?
I reviewed the documentation and found luigi.S3 but is not clear for me what to do with that, then I searched in the web and only get links from mortar-luigi and implementation in top of luigi.
UPDATE
After following the example provided for #matagus (I created the ~/.boto file as suggested too):
# coding: utf-8
import luigi
from luigi.s3 import S3Target, S3Client
class MyS3File(luigi.ExternalTask):
def output(self):
return S3Target('s3://my-bucket/19170205.txt')
class ProcessS3File(luigi.Task):
def requieres(self):
return MyS3File()
def output(self):
return luigi.LocalTarget('/tmp/resultado.txt')
def run(self):
result = None
for input in self.input():
print("Doing something ...")
with input.open('r') as f:
for line in f:
result = 'This is a line'
if result:
out_file = self.output().open('w')
out_file.write(result)
When I execute it nothing happens
DEBUG: Checking if ProcessS3File() is complete
INFO: Informed scheduler that task ProcessS3File() has status PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 21171] Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) running ProcessS3File()
INFO: [pid 21171] Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) done ProcessS3File()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task ProcessS3File() has status DONE
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) was stopped. Shutting down Keep-Alive thread
As you can see, the message Doing something... never prints. What is wrong?
The key here is to define an External Task that has no inputs and which outputs are those files you already have in living in S3. Luigi docs mention this in Requiring another Task:
Note that requires() can not return a Target object. If you have a simple Target object that is created externally you can wrap it in a Task class
So, basically you end up with something like this:
import luigi
from luigi.s3 import S3Target
from somewhere import do_something_with
class MyS3File(luigi.ExternalTask):
def output(self):
return luigi.S3Target('s3://my-bucket/path/to/file')
class ProcessS3File(luigi.Task):
def requires(self):
return MyS3File()
def output(self):
return luigi.S3Target('s3://my-bucket/path/to/output-file')
def run(self):
result = None
# this will return a file stream that reads the file from your aws s3 bucket
with self.input().open('r') as f:
result = do_something_with(f)
# and the you
out_file = self.output().open('w')
# it'd better to serialize this result before writing it to a file, but this is a pretty simple example
out_file.write(result)
UPDATE:
Luigi uses boto to read files from and/or write them to AWS S3, so in order to make this code work, you'll need to provide your credentials in your boto config file ~/boto (look for other possible config file locations here):
[Credentials]
aws_access_key_id = <your_access_key_here>
aws_secret_access_key = <your_secret_key_here>

Where did the Luigi task go?

First time into the realm of Luigi (and Python!) and have some questions. Relevant code is:
from Database import Database
import luigi
class bbSanityCheck(luigi.Task):
conn = luigi.Parameter()
date = luigi.Parameter()
def __init__(self, *args, **kwargs):
super(bbSanityCheck, self).__init__(*args, **kwargs)
self.has_run = False
def run(self):
print "Entering run of bb sanity check"
# DB STUFF HERE THAT DOESN"T MATTER
print "Are we in la-la land?"
def complete(self):
print "BB Sanity check being asked for completeness: " , self.has_run
return self.has_run
class Pipeline(luigi.Task):
date = luigi.DateParameter()
def requires(self):
db = Database('cbs')
self.conn = db.connect()
print "I'm about to yield!"
return bbSanityCheck(conn = self.conn, date = self.date)
def run(self):
print "Hello World"
self.conn.query("""SELECT *
FROM log_blackbook""")
result = conn.store_result()
print result.fetch_row()
def complete(self):
return False
if __name__=='__main__':
luigi.run()
Output is here (with relevant DB returns removed 'cause):
DEBUG: Checking if Pipeline(date=2013-03-03) is complete
I'm about to yield!
INFO: Scheduled Pipeline(date=2013-03-03)
I'm about to yield!
DEBUG: Checking if bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03) is complete
BB Sanity check being asked for completeness: False
INFO: Scheduled bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 5150] Running bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
Entering run of bb sanity check
Are we in la-la land?
INFO: [pid 5150] Done bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: There are 1 pending tasks possibly being run by other workers
INFO: Worker was stopped. Shutting down Keep-Alive thread
So the questions:
1.) Why does "I'm about to yield" get printed twice?
2.) Why is "hello world" never printed?
3.) What is the "1 pending tasks possibly run by other workers"?
I prefer super-ultra clean output because it is way easier to maintain. I'm hoping I can get these warning equivalents ironed out.
I've also noted that requires either "yield" or "return item, item2, item3". I've read about yield and understand it. What I don't get is which convention is considered superior here or if their are subtle differences that I being new to the language am not getting.
I think you're misunderstanding how luigi works in general.
(1) Hmm.. not sure about that. It looks more like an issue with printing the same thing in both INFO and DEBUG to me
(2)
So, you're trying to run Pipeline which depends on bbSanityCheck to run. bbSanityCheck.complete() never returns True because you never set has_run to True in bbSanityCheck. So the Pipeline task can never run and output hello world, because its dependencies are never complete.
(3) That's probably because you have this pending task(it's actually Pipeline). But Luigi understands it is impossible for it to run and shuts down.
I would personally not use has_run as a way to check if a task has run, but instead check for the existence of the result of this job. Ie, if this job does sth to the database then, complete() should check that the expected contents are there.

Categories