My TaskB requires TaskA, and on completion TaskA writes to a MySQL table, and then TaskB is to take in this output to the table as its input.
I cannot seem to figure out how to do this in Luigi. Can someone point me to an example or give me a quick example here?
The existing MySqlTarget in luigi uses a separate marker table to indicate when the task is complete. Here's the rough approach I would take...but your question is very abstract, so it is likely to be more complicated in reality.
import luigi
from datetime import datetime
from luigi.contrib.mysqldb import MySqlTarget
class TaskA(luigi.Task):
rundate = luigi.DateParameter(default=datetime.now().date())
target_table = "table_to_update"
host = "localhost:3306"
db = "db_to_use"
user = "user_to_use"
pw = "pw_to_use"
def get_target(self):
return MySqlTarget(host=self.host, database=self.db, user=self.user, password=self.pw, table=self.target_table,
update_id=str(self.rundate))
def requires(self):
return []
def output(self):
return self.get_target()
def run(self):
#update table
self.get_target().touch()
class TaskB(luigi.Task):
def requires(self):
return [TaskA()]
def run(self):
# reading from target_table
Related
I am using FastApi and I would like to know if I am using the dependencies correctly.
First, I have a function that yields the database session.
class ContextManager:
def __init__(self):
self.db = DBSession()
def __enter__(self):
return self.db
def __exit__(self):
self.db.close()
def get_db():
with ContextManager() as db:
yield db
I would like to use that function in another function:
def validate(db=Depends(get_db)):
is_valid = verify(db)
if not is is_valid:
raise HTTPException(status_code=400)
yield db
Finally, I would like to use the last functions as a dependency on the routes:
#router.get('/')
def get_data(db=Depends(validate)):
data = db.query(...)
return data
I am using this code and it seems to work, but I would like to know if it is the most appropiate way to use dependencies. Especially, I am not sure if I have to use 'yield db' inside the function validate or it would be better to use return. I would appreciate your help. Thanks a lot
Look at class ATask
class ATask(luigi.Task):
config = luigi.Parameter()
def requires(self):
# Some Tasks maybe
def output(self):
return luigi.LocalTarget("A.txt")
def run(self):
with open("A.txt", "w") as f:
f.write("Complete")
Now look at class BTask
class BTask(luigi.Task):
config = luigi.Parameter()
def requires(self):
return ATask(config = self.config)
def output(self):
return luigi.LocalTarget("B.txt")
def run(self):
with open("B.txt", "w") as f:
f.write("Complete")
Question is there is a chance that while TaskA running and start write "A.txt" taskB will start before taskA finishing writing?
The second is that if I start execution like
luigi.build([BTask(config=some_config)], local_scheduler=True )
And if this pipilene fail inside - Could I somehow to know outside about this like return value of luigi.build or smth else?
No, luigi won't start executing TaskB until TaskA has finished (ie, until it has finished writing the target file)
If you want to get a detailed response for luigi.build in case of error, you must pass an extra keyword argument: detailed_summary=True to build/run methods and then access the summary_text, this way:
luigi_run_result = luigi.build(..., detailed_summary=True)
print(luigi_run_result.summary_text)
For details on that, please read Response of luigi.build()/luigi.run() in Luigi documentation.
Also, you may be interested in this answer about how to access the error / exception: https://stackoverflow.com/a/33396642/3219121
I'm working on building a simple report generator that will run a query, put the data into a formatted excel sheet and email it.
I am trying to get better at proper coding practices (cohesion, coupling, etc) and I am wondering if this is the proper way to build this or if there's a better way. It feels somewhat redundant to have to pass the arguments of the Extractor twice: once to the main class and then again to the subclass.
Should I be using nested classes? **kwargs? Or is this correct?
from typing_extensions import ParamSpec
import pandas as pd
from sqlalchemy import create_engine
from O365 import Account
from jinjasql import JinjaSql
class Emailer:
pass
class Extractor:
'''Executes a sql query and returns the data.'''
def __init__(self, query: str, database: str, connection: dict, param_query: bool = False) -> None:
self.query = query
self.database = database
# self.conn_params = connection
self.engine = create_engine(connection)
# self.engine = Reactor().get_engine(**self.conn_params)
def parse_query(self) -> None:
'''If the query needs parameterization, do that here.'''
pass
def run_query(self) -> None:
'''Run a supplied query. Expects the name of the query.'''
# TODO: Make this check if it's a query or a query name.
with open(self.query, "r") as f:
query = f.read()
return pd.read_sql_query(query, self.engine)
class ReportGenerator:
'''Main class'''
def __init__(self, query: str, connection: dict, param_query: bool = False) -> None:
self.extractor = Extractor(query, connection, param_query)
self.emailer = Emailer()
def build_report(self) -> None:
pass
This isn't a bad solution, but I think it can be improved.
If you'll continue this correctly, ReportGenerator should never know about query/connection and other members of Extractor, thus it shouldn't be passed to his constructor.
What you can do instead is create an Extractor and an Emailer before the creation of the ReportGenerator and pass them to the constructor.
I am building a simple application in python able to run some Tasks in a specific order.
To achieve that, i am creating an executor which will take care of which task to execute, and a base Task class with the common abstraction in Python. Something like this.
class Executor(object):
def __init__(self, tasks=[]):
self.tasks = tasks
def run(self):
# do stuff
class Task():
def __init__(self, source_df, start_date, end_date):
self.source_df = source_df
self.start_date = start_date
self.end_date = end_date
def filter_source_data():
#return a filtered dataframe
def transform ():
#return cleaned data
def load_data():
#load data in hive table
The executor should run the whole pipeline or an individual Task if we specify it on the command line as an argument.
Task list can be any or all of these : ['filter','transform', 'load']
I am stucked/confused how to implement the solution. i am very new to object oriented programming design.
I realize I likely need to use dynamic requirements to accomplish the following task, however I have not been able to wrap my head around what this would look like in practice.
The goal is to use Luigi to generate data and add it to a database, without knowing ahead of time what data will be generated.
Take the following example using mongodb:
import luigi
from uuid import uuid4
from luigi.contrib import mongodb
import pymongo
# Make up IDs, though in practice the IDs may be generated from an API
class MakeID(luigi.Task):
def run(self):
with self.output().open('w') as f:
f.write(','.join([str(uuid4()) for e in range(10)]))
# Write the data to file
def output(self):
return luigi.LocalTarget('data.csv')
class ToDataBase(luigi.Task):
def requires(self):
return MakeID()
def run(self):
with self.input().open('r') as f:
ids = f.read().split(',')
# Add some fake data to simulate generating new data
count_data = {key: value for value, key in enumerate(ids)}
# Add data to the database
self.output().write(count_data)
def output(self):
# Attempt to read non-existent file to get the IDs to check if task is complete
with self.input().open('r') as f:
valid_ids = f.read().split(',')
client = pymongo.MongoClient('localhost',
27017,
ssl=False)
return mongodb.MongoRangeTarget(client,
'myDB',
'myData',
valid_ids,
'myField')
if __name__ == '__main__':
luigi.run()
The goal is to obtain data, modify it and then add it to a database.
The above code fails when run because the output method of ToDataBase runs before the require method so the while the function has access to the input, the input does not yet exist. Regardless I still need to check to be sure that the data was added to the database.
This github issue is close to what I am looking for, though as I mentioned I have not been able to figure out dynamic requirements for this use case in practice.
The solution is to create a third task (in the example Dynamic) that yields the task that is waiting on dynamic input and making the dependency a parameter rather than a requires method.
class ToDatabase(luigi.Task):
fp = luigi.Parameter()
def output(self):
with open(self.fp, 'r') as f:
valid_ids = [str(e) for e in f.read().split(',')]
client = pymongo.MongoClient('localhost', 27017, ssl=False)
return mongodb.MongoRangeTarget(client, 'myDB', 'myData',
valid_ids, 'myField')
def run(self):
with open(self.fp, 'r') as f:
valid_ids = [str(e) for e in f.read().split(',')]
self.output().write({k: 5 for k in valid_ids})
class Dynamic(luigi.Task):
def output(self):
return self.input()
def requires(self):
return MakeIDs()
def run(self):
yield(AddToDatabase(fp=self.input().path))