How to set the run name of a PipelineJob

How to set the run name of a PipelineJob - python

I have this code to start a VertexAI pipeline job:
import google.cloud.aiplatform as vertexai
vertexai.init(project=PROJECT_ID,staging_bucket=PIPELINE_ROOT)
job = vertexai.PipelineJob(
display_name='pipeline-test-1',
template_path='xgb_pipe.json'
)
job.run()
which works nicely, but the run name label is a random number. How can I specify the run name?

You can change the value shown in "Run Name" by defining name when defining the pipeline.
#kfp.dsl.pipeline(name="automl-image-training-v2")
When defining the name using #kfp.dsl.pipeline, it will automatically append the date and time when the pipeline is ran. Proceed with compiling and running the pipeline to see the change in "Run Name".
I used the code in Vertex AI pipeline examples. See pipeline code:
#kfp.dsl.pipeline(name="automl-image-training-v2")
def pipeline(project: str = PROJECT_ID, region: str = REGION):
ds_op = gcc_aip.ImageDatasetCreateOp(
project=project,
display_name="flowers",
gcs_source="gs://cloud-samples-data/vision/automl_classification/flowers/all_data_v2.csv",
import_schema_uri=aip.schema.dataset.ioformat.image.single_label_classification,
)
training_job_run_op = gcc_aip.AutoMLImageTrainingJobRunOp(
project=project,
display_name="train-automl-flowers",
prediction_type="classification",
model_type="CLOUD",
base_model=None,
dataset=ds_op.outputs["dataset"],
model_display_name="train-automl-flowers",
training_fraction_split=0.6,
validation_fraction_split=0.2,
test_fraction_split=0.2,
budget_milli_node_hours=8000,
)
endpoint_op = gcc_aip.EndpointCreateOp(
project=project,
location=region,
display_name="train-automl-flowers",
)
gcc_aip.ModelDeployOp(
model=training_job_run_op.outputs["model"],
endpoint=endpoint_op.outputs["endpoint"],
automatic_resources_min_replica_count=1,
automatic_resources_max_replica_count=1,
)

Related

Dynamic TaskGroup in Airflow 2

I have a function run_step that produces a dynamic number of emr tasks within a task group. I want to keep this function in a separate file named helpers.py so that other dags can use it and I don't have to rewrite it (in the examples below, I have hard-coded certain values for clarity, otherwise they would be variables):
def run_step(my_group_id, config, dependencies):
task_group = TaskGroup(group_id = my_group_id)
for c in config:
task_id = 'start_' + c['task_name']
task_name = c['task_name']
add_step = EmrAddStepsOperator(
task_group=my_group_id,
task_id=task_id,
job_flow_id="{{ task_instance.xcom_pull(dag_id='my_dag', task_ids='emr', key='return_value') }}",
steps=create_emr_step(args=config[c], d=dependencies[c]),
aws_conn_id='aws_default'
)
wait_for_step = EmrStepSensor(
task_group=my_group_id,
task_id='wait_for_' + task_name + '_step',
job_flow_id="{{ task_instance.xcom_pull(dag_id='my_dag', task_ids='emr', key='return_value') }}",
step_id="{{ task_instance.xcom_pull(dag_id='my_dag', task_ids='" + f"{my_group_id}.{task_id}" + "', key='return_value')[0] }}"
)
add_step >> wait_for_step
return task_group
The code in my_dag.py which calls this function looks like:
execute_my_step = create_emr_step(
my_group_id = 'execute_steps'
config = my_tasks,
dependencies = my_dependencies
)
some_steps >> execute_my_step
I am expecting this to produce a task group that contains two steps for every item in config, but it only produces one step labeled as create_emr_step with no task group. I tried putting the TaskGroup in the dag (and made the necessary changes to run_step) as shown below, but that did not work either:
with TaskGroup('execute_steps') as execute_steps:
execute_my_step = create_emr_step(
my_group_id = 'execute_steps'
config = my_tasks,
dependencies = my_dependencies
)
Is it possible to do this? I need to produce steps dynamically because our pipeline is so big. I was doing this successfully with subdags in a similar way, but can't figure out how to get this to work with task groups. Would it be easier to write my own operator?

How do I include both data and state dependencies in a Prefect flow?

This seems like it should be simple, but I can't figure out how to include both state and data dependencies in a single flow. Here is what I attempted (simplified):
def main():
with Flow("load_data") as flow:
test_results = prepare_file1()
load_file1(test_results)
participants = prepare_file2()
load_file2(participants)
email = flow.add_task(EmailTask(name='email', subject='Flow succeeded!', msg='flow succeeded', email_to='xxx', email_from='xxx', smtp_server='xxx',smtp_port=25, smtp_type='INSECURE',))
flow.set_dependencies(task=email, upstream_tasks=[load_file1,load_file2])
flow.visualize()
I get the following graph:
Which means that load_file1 and load_file2 run twice. Can I just set up an additional dependency so that email runs when the two load tasks finish?

The issue is how you add the task to your Flow. When using tasks from the Prefect task library, it's best to first initialize those and then call those in your Flow as follows:
send_email = EmailTask(name='email', subject='Flow succeeded!', msg='flow succeeded', email_to='xxx', email_from='xxx', smtp_server='xxx', smtp_port=25, smtp_type='INSECURE')
with Flow("load_data") as flow:
send_email()
Or alternatively, do it in one step with double round brackets EmailTask(init_kwargs)(run_kwargs). The first pair of brackets will initialize the task and the second one will call the task by invoking the task's .run() method.
with Flow("load_data") as flow:
EmailTask(name='email', subject='Flow succeeded!', msg='flow succeeded', email_to='xxx', email_from='xxx', smtp_server='xxx', smtp_port=25, smtp_type='INSECURE')()
The full flow example could look as follows:
from prefect import task, Flow
from prefect.tasks.notifications import EmailTask
from prefect.triggers import always_run
#task(log_stdout=True)
def prepare_file1():
print("File1 prepared!")
return "file1"
#task(log_stdout=True)
def prepare_file2():
print("File2 prepared!")
return "file2"
#task(log_stdout=True)
def load_file1(file: str):
print(f"{file} loaded!")
#task(log_stdout=True)
def load_file2(file: str):
print(f"{file} loaded!")
send_email = EmailTask(
name="email",
subject="Flow succeeded!",
msg="flow succeeded",
email_to="xxx",
email_from="xxx",
smtp_server="xxx",
smtp_port=25,
smtp_type="INSECURE",
trigger=always_run,
)
with Flow("load_data") as flow:
test_results = prepare_file1()
load1_task = load_file1(test_results)
participants = prepare_file2()
load2_task = load_file2(participants)
send_email(upstream_tasks=[load1_task, load2_task])
if __name__ == "__main__":
flow.visualize()

ti is not defined while pulling xcom variable in S3ToRedshiftOperator

I am using S3ToRedshiftOperator to load csv file into Redshift database. Kindly help to pass xcom variable to S3ToRedshiftOperator. How can we push xcom without using custom function?
Error:
NameError: name 'ti' is not defined
Using below code:
from airflow.operators.s3_to_redshift_operator import S3ToRedshiftOperator
def export_db_fn(**kwargs):
session = settings.Session()
outkey = S3_KEY.format(MWAA_ENV_NAME, name[6:])
print(outkey)
s3_client.put_object(Bucket=S3_BUCKET, Key=outkey, Body=f.getvalue())
ti.xcom_push(key='FILE_PATH', value=outkey)
return "OK"
with DAG(dag_id="export_info", schedule_interval=None, catchup=False, start_date=days_ago(1)) as dag:
export_info = PythonOperator(
task_id="export_info",
python_callable=export_db_fn,
provide_context=True
)
transfer_s3_to_redshift = S3ToRedshiftOperator(
s3_bucket=S3_BUCKET,
s3_key="{{ti.xcom_pull(key='FILE_PATH', task_ids='export_info')}}",
schema="dw_stage",
table=REDSHIFT_TABLE,
copy_options=['csv',"IGNOREHEADER 1"],
redshift_conn_id='redshift',
autocommit=True,
task_id='transfer_s3_to_redshift',
)
start >> export_info >> transfer_s3_to_redshift >> end

The error message tells the problem.
ti is not defined.
When you set provide_context=True, Airflow makes Context available for you in the python callable. One of the attributes is ti (see source code). So you need to extract it from kwargs or set it in the function signature.
Your code should be:
def export_db_fn(**kwargs):
...
ti = kwargs['ti']
ti.xcom_push(key='FILE_PATH', value=outkey)
...
Or if you want to use ti directly then:
def export_db_fn(ti, **kwargs):
...
ti.xcom_push(key='FILE_PATH', value=outkey)
...
Note: In Airflow >= 2.0 there is no need to set provide_context=True

how to import my function to python file and get input from it?

I have a function called analyze() which is like following:
def analyze():
for stmt in irsb.statements:
if isinstance(stmt, pyvex.IRStmt.WrTmp):
wrtmp(stmt)
if isinstance(stmt, pyvex.IRStmt.Store):
address = stmt.addr
address1 = '{}'.format(address)[1:]
print address1
data = stmt.data
data1 = '{}'.format(data)[1:]
tmp3 = store64(address1, int64(data1))
if isinstance(stmt, pyvex.IRStmt.Put):
expr = stmt.expressions[0]
putoffset = stmt.offset
data = stmt.data
data4 = '{}'.format(data)[1:]
if (str(data).startswith("0x")):
#const_1 = ir.Constant(int64, data4)
tmp = put64(putoffset, ZERO_TAG)
else:
put64(putoffset, int64(data4))
if isinstance(stmt.data, pyvex.IRExpr.Const):
reg_name = irsb.arch.translate_register_name(stmt.offset, stmt.data.result_size(stmt.data.tag))
print reg_name
stmt.pp()
This code function gets following input and try to analyze it:
CODE = b"\xc1\xe0\x05"
irsb = pyvex.block.IRSB(CODE, 0x80482f0, archinfo.ArchAMD64())
When this input is in the same file in my code (lets call the whole as analyze.py) it works and python analyze.py will make me an output. However, I want to make a seperate file(call array.py), call analyze there and also put the inputs inside it and run python array.py to get the same result. I did the following for array.py:
from analyze import analyze
CODE = b"\xc1\xe0\x05"
irsb = pyvex.block.IRSB(CODE, 0x80482f0, archinfo.ArchAMD64())
analyze()
However, when I run the array.py, it stops me with error;
NameError: name 'CODE' is not defined
how can I resolve this problem? What is the solution?

A simple change in your function, add parameters:
def analyze(irsb): # irsb here called parameter
...
# The rest is the same
And then pass arguments when calling it:
from analyze import analyze
CODE = b"\xc1\xe0\x05"
irsb_as_arg = pyvex.block.IRSB(CODE, 0x80482f0, archinfo.ArchAMD64())
analyze(irsb_as_arg) # irsb_as_arg is an argument
I have just changed here irsb to irsb_as_arg to take attention, but it can be the same name

TypeError ( 'module' object is not callable )

I have two Scripts. Script 1 is titled schemeDetails.The second script is a test script called temporaryFile that creates a schemeSetup object using the schemeSetup class which is within schemeDetails. Everything is hunky dory up to the point where I try to acess the method insertScheme which is within the schemeSetup Class.
I have imported the schemeDetails script using the following:
import schemeDetails
reload(schemeDetails)
from schemeDetails import *
I can create the schemeDetails Object and access its attributes
d = schemeDetails.schemeSetup() -- fine
print(d.scheme) -- fine
d.insertScheme() -- throws error
but trying to call the insertScheme function throws an error
I don't know why this is happening as the import statement looks above board to me. Any advice appreciated
from sikuli import *
import os
class schemeSetup(object):
#Uses default values
def __init__(
self,
scheme = "GM",
cardNumber = "1234567A",
month = "December",
year = "2015",
setSchemeAsDefault = True):
#Provide default values for parameters
self.scheme = scheme
self.cardNumber = cardNumber
self.month = month
self.year = year
self.setSchemeAsDefault = setSchemeAsDefault
#schemeDetails is not a sub
# class of patient. It is simply defined within the patient class
# - there is a huge difference.
#====================================================#
#schemeDetails Function
def insertScheme(self):
print("insertScheme Works")
#r = Regions()
#r.description("Patient Maintenance", "schemeDetails")
#myRegion = r.createRegion()
#myRegion.highlight(1)
#click(myRegion.find(insertSchemeButton))
#click(myRegion.find(blankSchemeEntry))
#type(self.scheme + Key.ENTER + Key.ENTER)
#type(self.cardNumber + Key.ENTER)
#type(self.month + Key.ENTER)
#type(self.year + Key.ENTER)
#type(" ")
#unticks HT link, HT linking should be in a separate function
#====================================================#
#schemeDetails Function
def editScheme(self):
print("editScheme Works")
#====================================================#
def deleteScheme(self):
pass
#====================================================#
It may be of importance that calling either of the bottom functions does not produce an error. If I put print("Hello") under editScheme, and call that method using s.editScheme the program compiles but I get no output. If I run print(s.editScheme) it returns None

Well it seems to be fixed now after changing the import format to this
import schemeDetails
from schemeDetails import schemeSetup
s = schemeDetails.schemeSetup()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to set the run name of a PipelineJob - python

Related

Dynamic TaskGroup in Airflow 2

How do I include both data and state dependencies in a Prefect flow?

ti is not defined while pulling xcom variable in S3ToRedshiftOperator

how to import my function to python file and get input from it?

TypeError ( 'module' object is not callable )

Categories

Resources