I have a spark stream that reads data from an azure data lake, applies some transformations then writes into the azure synapse (DW).
I wanna log some metrics for each batch processed. but I don't wanna duplicate logs from each batch.
Is there any way to log only once instead with some export_interval?
Example:
autoloader_df = (
spark.readStream.format("cloudFiles")
.options(**stream_config["cloud_files"])
.option("recursiveFileLookup", True)
.option("maxFilesPerTrigger", sdid_workload.max_files_agg)
.option("pathGlobfilter", "*_new.parquet")
.schema(stream_config["schema"])
.load(stream_config["read_path"])
.withColumn(stream_config["file_path_column"], input_file_name())
)
stream_query = (
autoloader_df.writeStream.format("delta")
.trigger(availableNow=True)
.option("checkpointLocation", stream_config["checkpoint_location"])
.foreachBatch(
lambda df_batch, batch_id: ingestion_process(
df_batch, batch_id, sdid_workload, stream_config, logger=logger
)
)
.start()
)
Where ingestion process is as follows:
def ingestion_process(df_batch, batch_id, sdid_workload, stream_config, **kwargs):
logger: AzureLogger = kwargs.get("logger")
iteration_start_time = datetime.utcnow()
sdid_workload.ingestion_iteration += 1
general_transformations(sdid_workload)
log_custom_metrics(sdid_workload)
`
In log_custom_metrics I'm using:
exporter = metrics_exporter.new_metrics_exporter(connection_string=appKey, export_interval=12)
view_manager.register_exporter(exporter)
I don’t want duplicated logs
If anyone step by this post.
I was able to find a workaround on this topic:
https://github.com/census-instrumentation/opencensus-python/issues/1070
other related topics:
https://github.com/census-instrumentation/opencensus-python/issues/1029
https://github.com/census-instrumentation/opencensus-python/issues/963
Related
I am new to Azure and dealing with all these paths is proving to be extremely challenging. I am trying to create a pipeline that contains a dataprep.py step and an AutoML step. What i want to do is (after passing the input to the dataprep block and performing several operations on it) to save the resulting tabulardataset in the datastore and have it as an output to then be able to reuse in my train block.
My dataprep.py file
-----dataprep stuff and imports
parser = argparse.ArgumentParser()
parser.add_argument("--month_train", required=True)
parser.add_argument("--year_train", required=True)
parser.add_argument('--output_path', dest = 'output_path', required=True)
args = parser.parse_args()
run = Run.get_context()
ws = run.experiment.workspace
datastore = ws.get_default_datastore()
name_dataset_input = 'Customer_data_'+str(args.year_train)
name_dataset_output = 'DATA_PREP_'+str(args.year_train)+'_'+str(args.month_train)
# get the input dataset by name
ds = Dataset.get_by_name(ws, name_dataset_input)
df = ds.to_pandas_dataframe()
# apply is one of my dataprep functions that i defined earlier
df = apply(df, args.mois_train)
# this is where i am having issues, I want to save this in the datastore but also have it as output
ds = Dataset.Tabular.register_pandas_dataframe(df, args.output_path ,name_dataset_output)
The pipeline block instructions.
from azureml.data import OutputFileDatasetConfig
from azureml.pipeline.steps import PythonScriptStep
prepped_data_path = OutputFileDatasetConfig(name="output_path", destination = (datastore, 'managed-dataset/{run-id}/{output-name}'))
dataprep_step = PythonScriptStep(
name="dataprep",
script_name="dataprep.py",
compute_target=compute_target,
runconfig=aml_run_config,
arguments=["--output_path", prepped_data_path, "--month_train", month_train,"--year_train",year_train],
allow_reuse=True
I have a function run_step that produces a dynamic number of emr tasks within a task group. I want to keep this function in a separate file named helpers.py so that other dags can use it and I don't have to rewrite it (in the examples below, I have hard-coded certain values for clarity, otherwise they would be variables):
def run_step(my_group_id, config, dependencies):
task_group = TaskGroup(group_id = my_group_id)
for c in config:
task_id = 'start_' + c['task_name']
task_name = c['task_name']
add_step = EmrAddStepsOperator(
task_group=my_group_id,
task_id=task_id,
job_flow_id="{{ task_instance.xcom_pull(dag_id='my_dag', task_ids='emr', key='return_value') }}",
steps=create_emr_step(args=config[c], d=dependencies[c]),
aws_conn_id='aws_default'
)
wait_for_step = EmrStepSensor(
task_group=my_group_id,
task_id='wait_for_' + task_name + '_step',
job_flow_id="{{ task_instance.xcom_pull(dag_id='my_dag', task_ids='emr', key='return_value') }}",
step_id="{{ task_instance.xcom_pull(dag_id='my_dag', task_ids='" + f"{my_group_id}.{task_id}" + "', key='return_value')[0] }}"
)
add_step >> wait_for_step
return task_group
The code in my_dag.py which calls this function looks like:
execute_my_step = create_emr_step(
my_group_id = 'execute_steps'
config = my_tasks,
dependencies = my_dependencies
)
some_steps >> execute_my_step
I am expecting this to produce a task group that contains two steps for every item in config, but it only produces one step labeled as create_emr_step with no task group. I tried putting the TaskGroup in the dag (and made the necessary changes to run_step) as shown below, but that did not work either:
with TaskGroup('execute_steps') as execute_steps:
execute_my_step = create_emr_step(
my_group_id = 'execute_steps'
config = my_tasks,
dependencies = my_dependencies
)
Is it possible to do this? I need to produce steps dynamically because our pipeline is so big. I was doing this successfully with subdags in a similar way, but can't figure out how to get this to work with task groups. Would it be easier to write my own operator?
I have this code to start a VertexAI pipeline job:
import google.cloud.aiplatform as vertexai
vertexai.init(project=PROJECT_ID,staging_bucket=PIPELINE_ROOT)
job = vertexai.PipelineJob(
display_name='pipeline-test-1',
template_path='xgb_pipe.json'
)
job.run()
which works nicely, but the run name label is a random number. How can I specify the run name?
You can change the value shown in "Run Name" by defining name when defining the pipeline.
#kfp.dsl.pipeline(name="automl-image-training-v2")
When defining the name using #kfp.dsl.pipeline, it will automatically append the date and time when the pipeline is ran. Proceed with compiling and running the pipeline to see the change in "Run Name".
I used the code in Vertex AI pipeline examples. See pipeline code:
#kfp.dsl.pipeline(name="automl-image-training-v2")
def pipeline(project: str = PROJECT_ID, region: str = REGION):
ds_op = gcc_aip.ImageDatasetCreateOp(
project=project,
display_name="flowers",
gcs_source="gs://cloud-samples-data/vision/automl_classification/flowers/all_data_v2.csv",
import_schema_uri=aip.schema.dataset.ioformat.image.single_label_classification,
)
training_job_run_op = gcc_aip.AutoMLImageTrainingJobRunOp(
project=project,
display_name="train-automl-flowers",
prediction_type="classification",
model_type="CLOUD",
base_model=None,
dataset=ds_op.outputs["dataset"],
model_display_name="train-automl-flowers",
training_fraction_split=0.6,
validation_fraction_split=0.2,
test_fraction_split=0.2,
budget_milli_node_hours=8000,
)
endpoint_op = gcc_aip.EndpointCreateOp(
project=project,
location=region,
display_name="train-automl-flowers",
)
gcc_aip.ModelDeployOp(
model=training_job_run_op.outputs["model"],
endpoint=endpoint_op.outputs["endpoint"],
automatic_resources_min_replica_count=1,
automatic_resources_max_replica_count=1,
)
I am trying to create Azure Batch Job with Task which uses output_files as a Task Parameter
tasks = list()
command_task = (r"cmd /c dir")
# Not providing actual property value for security purpose
containerName = r'ContainerName'
azureStorageAccountName = r'AccountName'
azureStorageAccountKey = r'AccountKey'
sas_Token = generate_account_sas(account_name=azureStorageAccountName, account_key=azureStorageAccountKey, resource_types=ResourceTypes(object=True), permission=AccountSasPermissions(read=True, write=True), expiry=datetime.datetime.utcnow() + timedelta(hours=1))
url = f"https://{azureStorageAccountName}.blob.core.windows.net/{containerName}?{sas_Token}"
output_file = batchmodels.OutputFile(
file_pattern=r"..\std*.txt",
destination=batchmodels.OutputFileDestination(
container=batchmodels.OutputFileBlobContainerDestination(container_url=url),
path="abc"),
upload_options='taskCompletion')
tasks.append(batchmodels.TaskAddParameter(id='Task1', display_name='Task1', command_line=command_task, user_identity=user, output_files=[output_file]))
batch_service_client.task.add_collection(job_id, tasks)
On Deugging this code I am getting exception as
But on removing the output_files parameter , everything works fine and Job is created with task.
I missed out upload_options object while creating OutputFile object:
output_file = batchmodels.OutputFile(
file_pattern=r"..\std*.txt",
destination=batchmodels.OutputFileDestination(
container=batchmodels.OutputFileBlobContainerDestination(container_url=url),
path="abc"),
upload_options=batchmodels.OutputFileUploadOptions('taskCompletion'))
I've been banging my head against a wall trying to make this work.
I'm attempting to use python/boto to create a cloutwatch alarm that recovers a failed ec2 instance.
I'm having difficulty in getting the ec2:RecoverInstance action to work. I suspect my topic isn't setup correctly.
topics = sns_conn.get_all_topics()
topic = topics[u'ListTopicsResponse']['ListTopicsResult']['Topics'][0]['TopicArn']
# arn:aws:sns:us-east-1:*********:CloudWatch
status_check_failed_alarm = boto.ec2.cloudwatch.alarm.MetricAlarm(
connection=cw_conn,
name=_INSTANCE_NAME + "RECOVERY-High-Status-Check-Failed-Any",
metric='StatusCheckFailed',
namespace='AWS/EC2',
statistic='Average',
comparison='>=',
description='status check for %s %s' % (_INSTANCE, _INSTANCE_NAME),
threshold=1.0,
period=60,
evaluation_periods=5,
dimensions={'InstanceId': _INSTANCE},
# alarm_actions = [topic],
ok_actions=[topic],
insufficient_data_actions=[topic])
# status_check_failed_alarm.add_alarm_action('arn:aws:sns:us-east-1:<acct#>:ec2:recover')
# status_check_failed_alarm.add_alarm_action('arn:aws:sns:us-east-1:<acct#>:ec2:RecoverInstances')
status_check_failed_alarm.add_alarm_action('ec2:RecoverInstances')
cw_conn.put_metric_alarm(status_check_failed_alarm)
Any suggestions would be highly appreciated.
Thank you.
--MIke
I think the issue is these alarm actions do not have <acct> in the arn. The cli reference documents the valid arns:
Valid Values: arn:aws:automate:region:ec2:stop | arn:aws:automate:region:ec2:terminate | arn:aws:automate:region:ec2:recover
I would think it is easier to pull the metric from AWS and create an alarm from that rather than trying to construct it from the ground up, e.g. (untested code):
topics = sns_conn.get_all_topics()
topic = topics[u'ListTopicsResponse']['ListTopicsResult']['Topics'][0]['TopicArn']
metric = cloudwatch_conn.list_metrics(dimensions={'InstanceId': _INSTANCE},
metric_name="StatusCheckFailed")[0]
alarm = metric.create_alarm(name=_INSTANCE_NAME + "RECOVERY-High-Status-Check-Failed-Any",
description='status check for {} {}'.format(_INSTANCE, _INSTANCE_NAME),
alarm_actions=[topic, 'arn:aws:automate:us-east-1:ec2:recover'],
ok_actions=[topic],
insufficient_data_actions=[topic],
statistic='Average',
comparison='>=',
threshold=1.0,
period=60,
evaluation_periods=5)