Luigi Pipeline beginning in S3

Luigi Pipeline beginning in S3 - python

My initial files are in AWS S3. Could someone point me how I need to setup this in a Luigi Task?
I reviewed the documentation and found luigi.S3 but is not clear for me what to do with that, then I searched in the web and only get links from mortar-luigi and implementation in top of luigi.
UPDATE
After following the example provided for #matagus (I created the ~/.boto file as suggested too):
# coding: utf-8
import luigi
from luigi.s3 import S3Target, S3Client
class MyS3File(luigi.ExternalTask):
def output(self):
return S3Target('s3://my-bucket/19170205.txt')
class ProcessS3File(luigi.Task):
def requieres(self):
return MyS3File()
def output(self):
return luigi.LocalTarget('/tmp/resultado.txt')
def run(self):
result = None
for input in self.input():
print("Doing something ...")
with input.open('r') as f:
for line in f:
result = 'This is a line'
if result:
out_file = self.output().open('w')
out_file.write(result)
When I execute it nothing happens
DEBUG: Checking if ProcessS3File() is complete
INFO: Informed scheduler that task ProcessS3File() has status PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 21171] Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) running ProcessS3File()
INFO: [pid 21171] Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) done ProcessS3File()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task ProcessS3File() has status DONE
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) was stopped. Shutting down Keep-Alive thread
As you can see, the message Doing something... never prints. What is wrong?

The key here is to define an External Task that has no inputs and which outputs are those files you already have in living in S3. Luigi docs mention this in Requiring another Task:
Note that requires() can not return a Target object. If you have a simple Target object that is created externally you can wrap it in a Task class
So, basically you end up with something like this:
import luigi
from luigi.s3 import S3Target
from somewhere import do_something_with
class MyS3File(luigi.ExternalTask):
def output(self):
return luigi.S3Target('s3://my-bucket/path/to/file')
class ProcessS3File(luigi.Task):
def requires(self):
return MyS3File()
def output(self):
return luigi.S3Target('s3://my-bucket/path/to/output-file')
def run(self):
result = None
# this will return a file stream that reads the file from your aws s3 bucket
with self.input().open('r') as f:
result = do_something_with(f)
# and the you
out_file = self.output().open('w')
# it'd better to serialize this result before writing it to a file, but this is a pretty simple example
out_file.write(result)
UPDATE:
Luigi uses boto to read files from and/or write them to AWS S3, so in order to make this code work, you'll need to provide your credentials in your boto config file ~/boto (look for other possible config file locations here):
[Credentials]
aws_access_key_id = <your_access_key_here>
aws_secret_access_key = <your_secret_key_here>

Related

Python - improving logging to file and console using multiprocessing

I am trying to download a file on CAN bus using python-can. It involves sending data very quickly (in the order of 2-3 messages per millisecond). I am trying to log to file these messages without impacting the speed of sending. Doing the file I/O slows down the sending due to the logging overhead. I tried various methods to improve this (including using queues and reading the queue from another thread but this was not much better - possibly due to GIL). Most of these tests started with using the Python logging module and trying various handlers (QueueHandler/QueueListener, MemoryHandler, etc).
I've managed to make some significant improvements by moving the file I/O into a separate process. I initially ran into an issue with the overhead of sending data from one process to another - so I now buffer it. Now, instead of taking 150% longer with direct file I/O in the main process, I see ~20% increase in time.
I thought that, since this is running in another process, I could also print() the data to console (which I know is relative expensive) but I see a huge increase in the file download time.
What is happening that means the print() affects the main process even though it is running in a child process?
Code below:
file_logger_mp() is called from the main process and it starts the child process that does the logging. The main process then uses the log_hdl function to add a message to the buffer. When the buffer reaches a certain size (100) it is sent to the child process for logging to file or printing to console.
Device: rpi4. And the main process uses asyncio, in case that affects it.
def file_logger_mp(logger_name: str, log_file_pth: str):
conn_rec, conn_send = multiprocessing.Pipe()
log_hdl_c = MyLogger(conn_send)
log_hdl = log_hdl_c.log_hdl # This is used by main code to provide log messages to child process
listener = MyProcess(conn_rec, log_file_pth)
atexit.register(log_hdl_c.final_flush, listener)
listener.start() # Start the child process
return log_hdl, listener
class MyLogger():
def __init__(self, conn_send) -> None:
self.buffer = []
self.conn_send = conn_send
def log_hdl(self, msg):
self.buffer.append(msg)
if len(self.buffer) > 100:
self.conn_send.send(self.buffer)
self.buffer.clear()
def final_flush(self, listener):
self.conn_send.send(self.buffer)
listener.terminate()
class MyProcess(multiprocessing.Process):
def __init__(self, queue, f_hdl):
multiprocessing.Process.__init__(self)
self.exit = multiprocessing.Event()
self.queue = queue
self.f_hdl = f_hdl
def run(self):
f = open(self.f_hdl, "w+")
while not self.exit.is_set():
try:
record = self.queue.recv()
for msg in record:
output = str(msg)
f.write(output+'\n')
print(output) # This `print()` causes large delays to main process?!
record.clear()
except Exception:
import sys, traceback
print('Whoops! Problem:', file=sys.stderr)
traceback.print_exc(file=sys.stderr)
for msg in record: # Flush any pending records before finishing
f.write(str(msg)+'\n')
f.close()
def terminate(self):
self.exit.set()

Task does not run Luigi

i write a trivial piece of code to run the tasks in Luigi. The code is as bellow:
import luigi
count = 0
class TaskC(luigi.Task):
def requires(self):
return None
def run(self):
print("Running task C ...")
global count
with self.output().open('w') as outfile:
outfile.write("Finished task C, count = %d", count)
count += 1
def output(self):
return luigi.LocalTarget("./logs/task_c.txt")
class TaskB(luigi.Task):
def requires(self):
return None
def run(self):
print("Running task B ...")
global count
with self.output().open('w') as outfile:
outfile.write("Finished task B, count = %d ...", count)
count += 1
def output(self):
return luigi.LocalTarget("./logs/task_b.txt")
class TaskA(luigi.Task):
def requires(self):
return [TaskB(), TaskC()]
def run(self):
print("Running task A ...")
global count
with self.output().open('w') as outfile:
outfile.write("Finished task A, count = %d ...", count)
count += 1
def output(self):
return luigi.LocalTarget("./logs/task_a.txt")
if __name__ == '__main__':
print("Start the fisrt luigi app :)")
luigi.run()
Expect: i want to run TaskA, but TaskA requires TaskB and TaskC -> TaskB and TaskC should run before and first when both tasks B,C are finished, then TaskA can run
Actual: Only TaskA runs. The other tasks don't. The log in console:
Start the fisrt luigi app :)
DEBUG: Checking if TaskA() is complete
INFO: Informed scheduler that task TaskA__99914b932b has status DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=382715991, workers=1, host=w10tng, username=tng, pid=2096) was stopped. Shutting down Keep-Alive thread
INFO:
===== Luigi Execution Summary =====
Scheduled 1 tasks of which:
* 1 complete ones were encountered:
- 1 TaskA()
Did not run any tasks
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
Command that i used to run:
python first_luigi_app.py --local-scheduler TaskA
I don't know if i've been missing somethings ! Would appreciate if some one can help :)

you can try removing requires methods from task B and task C as currently by returning None they are skipped.
Also when using formatting with f-string it worked ok.
Run with: python -m luigi --module l1 TaskA --local-scheduler where l1 is l1.py(copy of your code)

Luigi : The task does not fail even if in the run() method ,i execute a file which does not exist

Iam new to luigi and exploring its possibilities. I encountered a problem wherein I defined the task with (requires ,run and output method). In run(), I'm executing the contents of a file.
However , if the file do not exist , the task does not fail . Is there something I'm missing ?
import luigi
import logging
import time
import sys, os
logging.basicConfig(filename='Execution.log',level=logging.DEBUG)
date = time.strftime("%Y%m%d")
class CreateTable(luigi.Task):
def run(self):
os.system('./createsample.hql')
# with self.output().open('w') as f:
# f.write('Completed')
def output(self):
return luigi.LocalTarget('/tmp/CreateTable_Success_%s.csv' % date)
Output :
INFO: [pid 15553] Worker Worker(salt=747259359, workers=1, host=host-.com, username=root, pid=15553) running CreateTable()
sh: ./createsample.hql: No such file or directory
INFO: [pid 15553] Worker Worker(salt=747259359, workers=1, host=host-.com, username=root, pid=15553) done CreateTable()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task CreateTable__99914b932b has status DONE

Technically your code works and the Python part of your job ran successfully. The problem is that you are doing a system call that fails because the file does not exist.
What you need to do here is to check the return code of the system call. Return code 0 means it ran successfully. Any other outcome will yield a non-zero return code:
rc = os.system('./createsample.hql')
if rc:
raise Exception("something went wrong")
You might want to use the subprocess module for system calls to have more flexibility (and complexity): https://docs.python.org/2/library/subprocess.html

Calling Scrapy from another file without threading

I have to call the crawler from another python file, for which I use the following code.
def crawl_koovs():
spider = SomeSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
On running this, I get the error as
exceptions.ValueError: signal only works in main thread
The only workaround I could find is to use
reactor.run(installSignalHandlers=False)
which I don't want to use as I want to call this method multiple times and want reactor to be stopped before the next call. What can I do to make this work (maybe force the crawler to start in the same 'main' thread)?

The first thing I would say to you is when you're executing Scrapy from external file the loglevel is set to INFO,you should change it to DEBUG to see what's happening if your code doesn't work
you should change the line:
log.start()
for:
log.start(loglevel=log.DEBUG)
To store everything in the log and generate a text file (for debugging purposes) you can do:
log.start(logfile="file.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)
About the signals issue with the log level changed to DEBUG maybe you can see some output that can help you to fix it, you can try to put your script into the Scrapy Project folder to see if still crashes.
If you change the line:
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
for:
dispatcher.connect(reactor.stop, signals.spider_closed)
What does it say ?
Depending on your Scrapy version it may be deprecated

for looping and use un azure functions with timertrigger use this taks
from twisted.internet import task from twisted.internet import reactor
loopTimes = 3 failInTheEnd = False
_loopCounter = 0
def runEverySecond():
"""
Called at ever loop interval.
"""
global _loopCounter
if _loopCounter < loopTimes:
_loopCounter += 1
print('A new second has passed.')
return
if failInTheEnd:
raise Exception('Failure during loop execution.')
# We looped enough times.
loop.stop()
return
def cbLoopDone(result):
"""
Called when loop was stopped with success.
"""
print("Loop done.")
reactor.stop()
def ebLoopFailed(failure):
"""
Called when loop execution failed.
"""
print(failure.getBriefTraceback())
reactor.stop()
loop = task.LoopingCall(runEverySecond)
# Start looping every 1 second. loopDeferred = loop.start(1.0)
# Add callbacks for stop and failure. loopDeferred.addCallback(cbLoopDone) loopDeferred.addErrback(ebLoopFailed)
reactor.run()
If we want a task to run every X seconds repeatedly, we can use twisted.internet.task.LoopingCall:
from https://docs.twisted.org/en/stable/core/howto/time.html

Where did the Luigi task go?

First time into the realm of Luigi (and Python!) and have some questions. Relevant code is:
from Database import Database
import luigi
class bbSanityCheck(luigi.Task):
conn = luigi.Parameter()
date = luigi.Parameter()
def __init__(self, *args, **kwargs):
super(bbSanityCheck, self).__init__(*args, **kwargs)
self.has_run = False
def run(self):
print "Entering run of bb sanity check"
# DB STUFF HERE THAT DOESN"T MATTER
print "Are we in la-la land?"
def complete(self):
print "BB Sanity check being asked for completeness: " , self.has_run
return self.has_run
class Pipeline(luigi.Task):
date = luigi.DateParameter()
def requires(self):
db = Database('cbs')
self.conn = db.connect()
print "I'm about to yield!"
return bbSanityCheck(conn = self.conn, date = self.date)
def run(self):
print "Hello World"
self.conn.query("""SELECT *
FROM log_blackbook""")
result = conn.store_result()
print result.fetch_row()
def complete(self):
return False
if __name__=='__main__':
luigi.run()
Output is here (with relevant DB returns removed 'cause):
DEBUG: Checking if Pipeline(date=2013-03-03) is complete
I'm about to yield!
INFO: Scheduled Pipeline(date=2013-03-03)
I'm about to yield!
DEBUG: Checking if bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03) is complete
BB Sanity check being asked for completeness: False
INFO: Scheduled bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 5150] Running bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
Entering run of bb sanity check
Are we in la-la land?
INFO: [pid 5150] Done bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: There are 1 pending tasks possibly being run by other workers
INFO: Worker was stopped. Shutting down Keep-Alive thread
So the questions:
1.) Why does "I'm about to yield" get printed twice?
2.) Why is "hello world" never printed?
3.) What is the "1 pending tasks possibly run by other workers"?
I prefer super-ultra clean output because it is way easier to maintain. I'm hoping I can get these warning equivalents ironed out.
I've also noted that requires either "yield" or "return item, item2, item3". I've read about yield and understand it. What I don't get is which convention is considered superior here or if their are subtle differences that I being new to the language am not getting.

I think you're misunderstanding how luigi works in general.
(1) Hmm.. not sure about that. It looks more like an issue with printing the same thing in both INFO and DEBUG to me
(2)
So, you're trying to run Pipeline which depends on bbSanityCheck to run. bbSanityCheck.complete() never returns True because you never set has_run to True in bbSanityCheck. So the Pipeline task can never run and output hello world, because its dependencies are never complete.
(3) That's probably because you have this pending task(it's actually Pipeline). But Luigi understands it is impossible for it to run and shuts down.
I would personally not use has_run as a way to check if a task has run, but instead check for the existence of the result of this job. Ie, if this job does sth to the database then, complete() should check that the expected contents are there.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Luigi Pipeline beginning in S3 - python

Related

Python - improving logging to file and console using multiprocessing

Task does not run Luigi

Luigi : The task does not fail even if in the run() method ,i execute a file which does not exist

Calling Scrapy from another file without threading

Where did the Luigi task go?

Categories

Resources