I made a minimal Pipeline with a unique step in AML. I've publish this pipeline and I have and id and REST endpoint for it.
When I try to create a schedule on this pipeline, I get no error, but it will never launch.
from azureml.core.runconfig import RunConfiguration
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline
datastore = ws.get_default_datastore()
minimal_run_config = RunConfiguration()
minimal_run_config.environment = myenv # Custom Env with Dockerfile from mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest + openSDK 11 + pip/conda packages
step_name = experiment_name
script_step_1 = PythonScriptStep(
name=step_name,
script_name="main.py",
arguments=args,
compute_target=cpu_cluster,
source_directory=str(source_path),
runconfig=minimal_run_config,
)
pipeline = Pipeline(
workspace=ws,
steps=[
script_step_1,
],
)
pipeline.validate()
pipeline.publish(name=experiment_name + "_pipeline")
I can trigger this pipeline with REST python
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.pipeline.core import PublishedPipeline
import requests
auth = InteractiveLoginAuthentication()
aad_token = auth.get_authentication_header()
pipelines = PublishedPipeline.list(ws)
rest_endpoint1 = [p for p in pipelines if p.name == experiment_name + "_pipeline"][0]
response = requests.post(rest_endpoint1.endpoint,
headers=aad_token,
json={"ExperimentName": experiment_name,
"RunSource": "SDK",
"ParameterAssignments": {"KEY": "value"}})
But when I use the Schedule, I have no warning, no error and nothing is triggered if I use start_time from ScheduleRecurrence. If I don't user start_time, my pipeline is triggered and launch immediately. And I don't want this. For example I'm running the Schedule setter today, but I want it's first trigger to run only the second of each month at 4pm.
from azureml.pipeline.core.schedule import ScheduleRecurrence, Schedule
import datetime
first_run = datetime.datetime(2022, 10, 2, 16, 00)
schedule_name = f"Recocpc monthly run PP {first_run.day:02} {first_run.hour:02}:{first_run.minute:02}"
recurrence = ScheduleRecurrence(
frequency="Month",
interval=1,
start_time=first_run,
)
recurrence.validate()
recurring_schedule = Schedule.create_for_pipeline_endpoint(
ws,
name=schedule_name,
description="Recocpc monthly run PP",
pipeline_endpoint_id=pipeline_endpoint.id,
experiment_name=experiment_name,
recurrence=recurrence,
pipeline_parameters={"KEY": "value"}
)
If I comment start_time, It will work, but the first run is now, and not when I want.
So I was not aware on how start_time was working. It is using DAGs logic like in airflow.
Here is an example:
Today is 10-01-2022 (dd-mm-yyy)
You want your pipeline to run every month, once the on the 10th of each month at 14:00.
Then your start_time is not 2022-01-10T14:00:00, but should be 2021-12-10T14:00:00.
Your scheduler will only trigger if it has made a full revolution of what you are asking him (here one month).
Maybe official documentation should be more explicit on this mecanism for neewbies like me that never used DAGs in their lives.
Related
I'm using a parameter that is the timestamp in a set of tasks:
default_dag_args = {'arg1': 'arg1-value',
'arg2': 'arg2-value',
'now': datetime.now()}
I would like that the now parameter would have the same value for all the tasks. But what happens is that it's re-executed for each function
Is there a way of doing it (executing once and using the same value through the dag)? I'm using the TaskFlow API for Airflow 2.0:
#task
def python_task()
context = get_current_context()
context_dag = context['dag']
now = context_dag.default_args['now']
print now
I tried to set the time constant, at the start of the dag file, like:
TIME = datetime.now()
and got the context inside of the tasks with get_current_context() just like you did.
Sadly, I think because of running the DAG file from start, every time a task got defined in the script, time was recalculated.
One idea I have is to use XCOM's in order to save the datetime to a variable and pull it to other tasks:
My sample code is below, I think you'll get the idea.
from airflow.decorators import task, dag
from datetime import datetime
import time
default_arguments = {
'owner': 'admin',
# This is the beginning, for more see: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime(2022, 5, 2)
}
#dag(
schedule_interval=None,
dag_id = "Time_Example_Dag",
default_args = default_arguments,
catchup=False,
)
def the_global_time_checker_dag():
#task
def time_set():
# To use XCOM to pass the value between tasks,
# we have to parse the datetime to a string.
now = str(datetime.now())
return now
#task
def starting_task(datetime_string):
important_number = 23
# We can use this datetime object in whatever way we like.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
return important_number
#task
def important_task(datetime_string, number):
# Passing some time
time.sleep(10)
# Again, we are free to do whatever we want with this object.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
print("The important number is: {}".format(number))
time_right_now = time_set()
start = starting_task(datetime_string = time_right_now)
important = important_task(datetime_string = time_right_now, number = start)
time_checker = the_global_time_checker_dag()
Through the logs, you can see all the datetime values are the same.
For more information about XCOM in Taskflow API, you can check here.
When a worker gets a task instance to run, it rebuilds the whole DagBag from the Python files to get the DAG and task definition. So every time a task instance is ran, your DAG file is sourced, rerunning your DAG definition code. And that resulting DAG object is the one that the particular task instance will be defined by.
It's critical to understand that the DAG definition is not simply built once for every execution date and then persisted/reused for all TIs within that DagRun. The DAG definition is constantly being recomputed from your Python code, each TI is ran in a separate process independently and without state from other tasks. Thus, if your DAG definition includes non-deterministic results at DagBag build time - such as datetime.now() - every instantiation of your DAG even for the same execution date will have different values. You need to build your DAGs in a deterministic and idempotent manner.
The only way to share non-deterministic results is to store them in the DB and have your tasks fetch them, as #sezai-burak-kantarcı has pointed out. Best practice is to use task context-specific variables, like {{ ds }}, {{ execution_date }}, {{ data_interval_start }}, etc. These are the same for all tasks within a DAG run. You can see the template variables available here: Airflow emplates reference
The situation
I have a celery task I am running at different timezone for each customer.
Basically, for each customer in my database, I get the timezone, and then I set up the celery task this way.
'schedule': crontab(minute=30, hour=14, nowfun=self.now_function)
Basically, what I want is the task to run at 14:30, at the customer timezone. Hence the now_function.
My now_function is just getting the current time with the customer timezone.
def now_function(self):
"""
return the now function for the task
this is used to compute the time a task should be scheduled for a given customer
"""
return datetime.now(timezone(self.customer.c_timezone))
What is happening
I am getting inconsistencies in the time the task run, sometimes they run at the expected time, so let's say 14:30 in the customer time zone, if the timezone is America/Chicago it runs at 20:30 and that is my expected behavior.
Some other days, it runs at 14:30, which is just the time in UTC.
I am tracking to see if there is a pattern in the day the task run at the correct time and the day the cards run at the incorrect time.
Additional Information
I have tried this on celery 4.4.2 and 5.xx but it is still has the same behavior.
Here is my celery config.
CELERY_REDIS_SCHEDULER_URL = redis_instance_url
logger.debug("****** CELERY_REDIS_SCHEDULER_URL: ", CELERY_REDIS_SCHEDULER_URL)
logger.debug("****** environment: ", environment)
redbeat_redis_url = CELERY_REDIS_SCHEDULER_URL
broker_url = CELERY_REDIS_SCHEDULER_URL
result_backend = CELERY_REDIS_SCHEDULER_URL
task_serializer = 'pickle'
result_serializer = 'pickle'
accept_content = ['pickle']
enable_utc = False
task_track_started = True
task_send_sent_event = True
You can notice enable_utc is set to False.
I am using Redis instance from AWS to run my task.
I am using the RedBeatScheduler scheduler from this package to schedule my tasks.
If anyone has experienced this issue or can help me to reproduce it, I will be very thankful.
Other edits:
I have another cron for the same job at the same time but running weekly and monthly but they are working perfectly.
weekly_schedule : crontab(minute=30, hour=14, nowfun=self.now_function, day_of_week=1)
monthly_schedule : crontab(minute=30, hour=14, nowfun=self.now_function, day_of_month=1)
Sample Project
Here is a sample project on GitHub if you want to run and reproduce the issue.
RedBeat's encoder and decoder don't support nowfun.
Source code: https://github.com/sibson/redbeat/blob/e6d72e2/redbeat/decoder.py#L94-L102
The behaviour you see was described previously: sibson/redbeat#192 (comment 756397651)
You can subclass and replace RedBeatJSONDecoder and RedBeatJSONEncoder.
Since nowfun has to be JSON serializable, we can only support some special cases,
e.g. nowfun=partial(datetime.now, tz=pytz.timezone(self.customer.c_timezone))
from datetime import datetime
from functools import partial
from celery.schedules import crontab
import pytz
from pytz.tzinfo import DstTzInfo
from redbeat.decoder import RedBeatJSONDecoder, RedBeatJSONEncoder
class CustomJSONDecoder(RedBeatJSONDecoder):
def dict_to_object(self, d):
if '__type__' not in d:
return d
objtype = d.pop('__type__')
if objtype == 'crontab':
if d.get('nowfun', {}).get('keywords', {}).get('zone'):
d['nowfun'] = partial(datetime.now, tz=pytz.timezone(d.pop('nowfun')['keywords']['zone']))
return crontab(**d)
d['__type__'] = objtype
return super().dict_to_object(d)
class CustomJSONEncoder(RedBeatJSONEncoder):
def default(self, obj):
if isinstance(obj, crontab):
d = super().default(obj)
if 'nowfun' not in d and isinstance(obj.nowfun, partial) and obj.nowfun.func == datetime.now:
zone = None
if obj.nowfun.args and isinstance(obj.nowfun.args[0], DstTzInfo):
zone = obj.nowfun.args[0].zone
elif isinstance(obj.nowfun.keywords.get('tz'), DstTzInfo):
zone = obj.nowfun.keywords['tz'].zone
if zone:
d['nowfun'] = {'keywords': {'zone': zone}}
return d
return super().default(obj)
Replace the classes in redbeat.schedulers:
from redbeat import schedulers
schedulers.RedBeatJSONDecoder = CustomJSONDecoder
schedulers.RedBeatJSONEncoder = CustomJSONEncoder
I am trying to get how long every ec2 instances are running for 'running' instances. For this I need to get last restart time for all instances and compare that with today. Found that I can get restart or start event from cloudtrail but can't figure out how to get only 'start' time from there. Is there a way to find that information for a couple of regions?
import boto3
import datetime
from datetime import date
import subprocess
regions = ['us-west-2', 'us-west-1', 'us-east-1','us-east-2','ap-south-1', 'ap-southeast-1','ca-central-1','eu-west-1','eu-west-3']
for region in regions:
session = boto3.session.Session(region_name=region)
ec2 = session.resource('ec2')
cloudtrail = boto3.client('cloudtrail')
for i in ec2.instances.all():
Id = i.id
State = i.state['Name']
Launchtime = i.launch_time
InstanceType = i.instance_type
Platform = str(i.platform)
currenttime = datetime.datetime.now(Launchtime.tzinfo)
time_diff = currenttime - Launchtime
uptime = str(time_diff)
Here uptime is giving time difference between launch time and current time, which is not correct as most of the instances been restarted lot of time already. So I need to find last start time for all running instances.
I tweaked my code little bit and made like this:
import boto3
import datetime
from datetime import date
import subprocess
regions = ['us-west-2', 'us-west-1', 'us-east-1']
for region in regions:
session = boto3.session.Session(region_name=region)
ec2 = session.resource('ec2')
for i in ec2.instances.all():
Id = i.id
State = i.state['Name']
Launchtime = i.launch_time
InstanceType = i.instance_type
Platform = str(i.platform)
currenttime = datetime.datetime.now(Launchtime.tzinfo)
time_diff = currenttime - Launchtime
uptime = str(time_diff)
if i.state['Name'] == 'stopped':
uptime = ' '
Here it's showing uptime for only running instances. Found out Launch time is showing as last time these instances started (Launched). Cross checked in console and those were right. Then checked up time for the Linux instances (sudo uptime) and those matched with output.
My Dataflow job (Job ID: 2020-08-18_07_55_15-14428306650890914471) is not scaling past 1 worker, despite Dataflow setting the target workers to 1000.
The job is configured to query the Google Patents BigQuery dataset, tokenize the text using a ParDo custom function and the transformers (huggingface) library, serialize the result, and write everything to a giant parquet file.
I had assumed (after running the job yesterday, which mapped a function instead of using a beam.DoFn class) that the issue was some non-parallelizing object eliminating scaling; hence, refactoring the tokenization process as a class.
Here's the script, which is run from the command line with the following command:
python bq_to_parquet_pipeline_w_class.py --extra_package transformers-3.0.2.tar.gz
The script:
import os
import re
import argparse
import google.auth
import apache_beam as beam
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.runners import DataflowRunner
from apache_beam.io.gcp.internal.clients import bigquery
import pyarrow as pa
import pickle
from transformers import AutoTokenizer
print('Defining TokDoFn')
class TokDoFn(beam.DoFn):
def __init__(self, tok_version, block_size=200):
self.tok = AutoTokenizer.from_pretrained(tok_version)
self.block_size = block_size
def process(self, x):
txt = x['abs_text'] + ' ' + x['desc_text'] + ' ' + x['claims_text']
enc = self.tok.encode(txt)
for idx, token in enumerate(enc):
chunk = enc[idx:idx + self.block_size]
serialized = pickle.dumps(chunk)
yield serialized
def run(argv=None, save_main_session=True):
query_big = '''
with data as (
SELECT
(select text from unnest(abstract_localized) limit 1) abs_text,
(select text from unnest(description_localized) limit 1) desc_text,
(select text from unnest(claims_localized) limit 1) claims_text,
publication_date,
filing_date,
grant_date,
application_kind,
ipc
FROM `patents-public-data.patents.publications`
)
select *
FROM data
WHERE
abs_text is not null
AND desc_text is not null
AND claims_text is not null
AND ipc is not null
'''
query_sample = '''
SELECT *
FROM `client_name.patent_data.patent_samples`
LIMIT 2;
'''
print('Start Run()')
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
'''
Configure Options
'''
# Setting up the Apache Beam pipeline options.
# We use the save_main_session option because one or more DoFn's in this
# workflow rely on global context (e.g., a module imported at module level).
options = PipelineOptions(pipeline_args)
options.view_as(SetupOptions).save_main_session = save_main_session
# Sets the project to the default project in your current Google Cloud environment.
_, options.view_as(GoogleCloudOptions).project = google.auth.default()
# Sets the Google Cloud Region in which Cloud Dataflow runs.
options.view_as(GoogleCloudOptions).region = 'us-central1'
# IMPORTANT! Adjust the following to choose a Cloud Storage location.
dataflow_gcs_location = 'gs://client_name/dataset_cleaned_pq_classTok'
# Dataflow Staging Location. This location is used to stage the Dataflow Pipeline and SDK binary.
options.view_as(GoogleCloudOptions).staging_location = f'{dataflow_gcs_location}/staging'
# Dataflow Temp Location. This location is used to store temporary files or intermediate results before finally outputting to the sink.
options.view_as(GoogleCloudOptions).temp_location = f'{dataflow_gcs_location}/temp'
# The directory to store the output files of the job.
output_gcs_location = f'{dataflow_gcs_location}/output'
print('Options configured per GCP Notebook Examples')
print('Configuring BQ Table Schema for Beam')
#Write Schema (to PQ):
schema = pa.schema([
('block', pa.binary())
])
print('Starting pipeline...')
with beam.Pipeline(runner=DataflowRunner(), options=options) as p:
res = (p
| 'QueryTable' >> beam.io.Read(beam.io.BigQuerySource(query=query_big, use_standard_sql=True))
| beam.ParDo(TokDoFn(tok_version='gpt2', block_size=200))
| beam.Map(lambda x: {'block': x})
| beam.io.WriteToParquet(os.path.join(output_gcs_location, f'pq_out'),
schema,
record_batch_size=1000)
)
print('Pipeline built. Running...')
if __name__ == '__main__':
import logging
logging.getLogger().setLevel(logging.INFO)
logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)
run()
The solution is twofold:
The following quotas were being exceeded when I ran my job, all under 'Compute Engine API' (view your quotas here: https://console.cloud.google.com/iam-admin/quotas):
CPUs (I requested an increase to 50)
Persistent Disk Standard (GB) (I requested an increase to 12,500)
In_Use_IP_Address (I requested an increase to 50)
Note: If you read the console output while your job is running, any exceeded quotas should print out as an INFO line.
Following Peter Kim's advice above, I passed the flag --max_num_workers as part of my command:
python bq_to_parquet_pipeline_w_class.py --extra_package transformers-3.0.2.tar.gz --max_num_workers 22
And I started scaling!
All in all, it would be nice if there was a way to prompt users via the Dataflow console when a quota is being reached, and provide an easy means to request an increase to that (and recommended complementary) quotas, along with suggestions for what the increased amount to be requested should be.
I am using Gcloud Composer to launch Dataflow jobs.
My DAG consist of two Dataflow jobs that should be run one after the other.
import datetime
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow import models
default_dag_args = {
'start_date': datetime.datetime(2019, 10, 23),
'dataflow_default_options': {
'project': 'myproject',
'region': 'europe-west1',
'zone': 'europe-west1-c',
'tempLocation': 'gs://somebucket/',
}
}
with models.DAG(
'some_name',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
parameters = {'params': "param1"}
t1 = DataflowTemplateOperator(
task_id='dataflow_example_01',
template='gs://path/to/template/template_001',
parameters=parameters,
dag=dag)
parameters2 = {'params':"param2"}
t2 = DataflowTemplateOperator(
task_id='dataflow_example_02',
template='gs://path/to/templates/template_002',
parameters=parameters2,
dag=dag
)
t1 >> t2
When I check in dataflow the job has succeeded, all the files it is supposed to make are created, but it appears it ran in US region, the cloud composer environment is in Europe west.
In airflow I can see that the first job is still running so the second one is not launched
What should I add to the DAG to make it succeed? How do I run in Europe?
Any advice or solution on how to proceed would be most appreciated. Thanks!
I had to solve this issue in the past. In Airflow 1.10.2 (or lower) the code calls to the service.projects().templates().launch() endpoint. This was fixed in 1.10.3 where the regional one is used instead: service.projects().locations().templates().launch().
As of October 2019, the latest Airflow version available for Composer environments is 1.10.2. If you need a solution immediately, the fix can be back-ported into Composer.
For this we can override the DataflowTemplateOperator for our own version called RegionalDataflowTemplateOperator:
class RegionalDataflowTemplateOperator(DataflowTemplateOperator):
def execute(self, context):
hook = RegionalDataFlowHook(gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to,
poll_sleep=self.poll_sleep)
hook.start_template_dataflow(self.task_id, self.dataflow_default_options,
self.parameters, self.template)
This will now make use of the modified RegionalDataFlowHook which overrides the start_template_dataflow method of the DataFlowHook operator to call the correct endpoint:
class RegionalDataFlowHook(DataFlowHook):
def _start_template_dataflow(self, name, variables, parameters,
dataflow_template):
...
request = service.projects().locations().templates().launch(
projectId=variables['project'],
location=variables['region'],
gcsPath=dataflow_template,
body=body
)
...
return response
Then, we can create a task using our new operator and a Google-provided template (for testing purposes):
task = RegionalDataflowTemplateOperator(
task_id=JOB_NAME,
template=TEMPLATE_PATH,
parameters={
'inputFile': 'gs://dataflow-samples/shakespeare/kinglear.txt',
'output': 'gs://{}/europe/output'.format(BUCKET)
},
dag=dag,
)
Full working DAG here. For a cleaner version, the operator can be moved into a separate module.