Reversed upstream/downstream relationships when generating multiple tasks in Airflow - python

The original code related to this question can be found here.
I'm confused by up both bitshift operators and set_upstream/set_downstream methods are working within a task loop that I've defined in my DAG. When the main execution loop of the DAG is configured as follows:
for uid in dash_workers.get_id_creds():
clear_tables.set_downstream(id_worker(uid))
or
for uid in dash_workers.get_id_creds():
clear_tables >> id_worker(uid)
The graph looks like this (the alpha-numeric sequence are the user IDs, which also define the task IDs):
when I configure the main execution loop of the DAG like this:
for uid in dash_workers.get_id_creds():
clear_tables.set_upstream(id_worker(uid))
or
for uid in dash_workers.get_id_creds():
id_worker(uid) >> clear_tables
the graph looks like this:
The second graph is what I want / what I would have expected the first two snippets of code to have produced based on my reading of the docs. If I want clear_tables to execute first before triggering my batch of data parsing tasks for different user IDs should I indicate this as clear_tables >> id_worker(uid)
EDIT -- Here's the complete code, which has been updated since I posted the last few questions, for reference:
from datetime import datetime
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('DASH_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
import dash_workers
else:
print('Define DASH_PREPROC_PATH value in environmental variables')
sys.exit(1)
ENV = os.environ
default_args = {
'start_date': datetime.now(),
}
DAG = DAG(
dag_id='dash_preproc',
default_args=default_args
)
clear_tables = PythonOperator(
task_id='clear_tables',
python_callable=dash_workers.clear_db,
dag=DAG)
def id_worker(uid):
return PythonOperator(
task_id=id,
python_callable=dash_workers.main_preprocess,
op_args=[uid],
dag=DAG)
for uid in dash_workers.get_id_creds():
preproc_task = id_worker(uid)
clear_tables << preproc_task
After implementing #LadislavIndra's suggestion I continue to have the same reversed implementation of the bitshift operator in order to get the correct dependency graph.
UPDATE #AshBerlin-Taylor's answer is what's going on here. I assumed that Graph View and Tree View were doing the same thing, but they're not. Here's what id_worker(uid) >> clear_tables looks like in graph view:
I certainly don't want the final step in my data pre-prep routine to be to delete all data tables!

The tree view in Airflow is "backwards" to how you (and I!) first thought about it. In your first screenshot it is showing that "clear_tables" must be run before the "AAAG5608078M2" run task. And the DAG status depends upon each of the id worker tasks. So instead of a task order, it's a tree of the status chain. If that makes any sense at all.
(This might seem strange at first, but it's because a DAG can branch out and branch back in.)
You might have better luck looking at the Graph view for your dag. This one has arrows and shows the execution order in a more intuitive way. (Though I do now find the tree view useful. It's just less clear to start with)

Looking through your other code, it seems get_id_creds is your task and you're trying to loop through it, which is creating some weird interaction.
A pattern that will work is:
clear_tables = MyOperator()
for uid in uid_list:
my_task = MyOperator(task_id=uid)
clear_tables >> my_task

Related

Can we use for loop to create apache beam data flow pipeline dynamically?

Can we use for loop to create apache beam data flow pipeline dynamically?
My fear is how for loop will behave in distributed environment when i am using it with data flow runner. I am sure this will work fine with direct runner
for example can I create pipelines dynamically like this:
with beam.Pipeline(options=pipeline_options) as pipeline:
for p in cdata['tablelist']:
i_file_path = p['sourcefile']
schemauri = p['schemauri']
schema=getschema(schemauri)
dest_table_id = p['targettable']
( pipeline | "Read From Input Datafile" + dest_table_id >> beam.io.ReadFromText(i_file_path)
| "Convert to Dict" + dest_table_id >> beam.Map(lambda r: data_ingestion.parse_method(r))
| "Write to BigQuery Table" + dest_table_id >> beam.io.WriteToBigQuery('{0}:{1}'.format(project_name, dest_table_id),
schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
Yes, this is totally legal, and lots of pipelines (especially ML ones) are constructed this way. Your looped pipeline construction above should work just fine on all runners.
You can think of a Beam pipeline has having two phases: construction and execution. The first phase, construction, happens entirely in the main program and can have arbitrary loops, control statements, etc. Behind the scenes, this builds up a DAG of deferred operations (such as reads, maps, etc.) to perform. If you have a loop, each iteration will simply append more operations to this graph. The only think you can't do in this phase is inspect the data (i.e. the contents of a PCollection) itself.
The second phase, execution, starts when pipeline.run() is invoked. (For Python, this is implicitly invoked on exiting the with block). At this point the pipeline graph (as constructed above), its dependencies, pipeline options, etc. are passed to a Runner which will actually execute the fully-specified graph, ideally in parallel.
This is covered a bit in the programming guide, though I agree it could be more clear.
I think it's not possible.
You have many other solutions to do that.
If you have an orchestrator like Cloud Composer/Airflow or Cloud Workflows, you can put this logic inside this orchestrator, instantiate and launch a Dataflow job per element in the loop :
Solution 1, example with Airflow :
for p in cdata['tablelist']:
i_file_path = p['sourcefile']
schemauri = p['schemauri']
dest_table_id = p['targettable']
options = {
'i_file_path': i_file_path,
'dest_table_id': dest_table_id,
'schemauri' : schemauri,
...
}
dataflow_task = DataflowCreatePythonJobOperator(
py_file=beam_main_file_path,
task_id=f'task_{dest_table_id}',
dataflow_default_options=your_default_options,
options=options,
gcp_conn_id="google_cloud_default"
)
# You can execute your Dataflow jobs in parallel
dataflow_task >> DummyOperator(task_id='END', dag=dag)
Solution 2, with a shell script :
for module_path in ${tablelist}; do
# Options
i_file_path = ...
schemauri = ...
dest_table_id = ...
#Python command to execute the Dataflow job
python -m your_module.main \
--runner=DataflowRunner \
--staging_location=gs://your_staging_location/ \
--temp_location=gs://your_temp_location/ \
--region=europe-west1 \
--setup_file=./setup.py \
--i_file_path=$i_file_path \
--schemauri=$schemauri \
--dest_table_id=$dest_table_id
In this case Dataflow jobs are executed in sequential.
If you have too many files and Dataflow jobs to launch, you can think about another solution.
With a shell script or a cloud function you can get all the needed files and rename them as expected (with metadata on filename), move them in a separated object in GCS.
Then in a single Dataflow job :
read all the previous files via a pattern
Parse the metadata from filename like schemauri and dest_table_id
apply the map operation in the job on the current element
write the result to Bigquery
If you don't have a huge amount of files, the two first solutions are simplers.

Prefect workflow: How to persist data of previous/every schedule run?

In prefect workflow, I'm trying to persist data of every schedule run. I need to compare data of every previous and current result. I tried Localresult and checkpoint=true but its not working. For example,
from prefect import Flow, task
from prefect.engine.results import LocalResult
from prefect.schedules import IntervalSchedule
from datetime import timedelta, datetime
import os
import prefect
#task("func_task_target.txt", checkpoint=True, result=LocalResult(dir="~/.prefect"))
def file_scan():
files = os.listdir(test)
#prefect.context.a = files
return files
schedule = IntervalSchedule(interval=timedelta(seconds=61))
with Flow("Test persist data", schedule) as flow:
a = file_scan()
flow.run()
My flow scheduled for every 61 seconds/a minute. On the first run I might get empty result but for the 2nd scheduled run I should get previous flow result to compare. can anyone help me to achieve this? Thanks!
Update (15 November 2021) :
Not sure what is the reason,
LocalResult and checkpoint actually worked when I ran the registered flow through the dashboard or cli prefect run -n "your-workflow.py" --watch. It doesn't work when I manually trigger the flow (e.g.: flow.run) in python code.
Try these following two options:
Option 1 : using target argument:
https://docs.prefect.io/core/concepts/persistence.html#output-caching-based-on-a-file-target
#task(target="func_task_target.txt", checkpoint=True, result=LocalResult(dir="~/.prefect"))
def func_task():
return "999"
Option 2 : instantiate LocalResult instance and invoke write manually.
MY_RESULTS = LocalResult(dir="./.prefect").
#task(checkpoint=True, result=LocalResult(dir="./.prefect"))
def func_task():
MY_RESULTS.write("999")
return "999"
PS:
Having same problem as LocalResult doesn't seem to work for mewhen used in decorator . e.g :
#task("func_task_target.txt", checkpoint=True, result=LocalResult(dir="~/.prefect"))
def file_scan():

Can't use python variable in jinja template with Airflow

I am trying to use Airflow to run 11 step on AWS EMR and following this code as reference. Since using EmrAddStepsOperator and EmrStepSensor for 11 steps would too much repetition. So I am trying to loop through it. I have used the below code in my DAG.
step_adder = list()
step_checker = list()
steps = ['step1', 'step2', 'step3', 'step4', 'step5', 'step6'...till step11]
# #evalcontextfilter
# def dangerous_render(context, value):
# return Markup(Template(value).render(context)).render()
for i in range(0,len(steps)):
#Add step
step_adder.append(EmrAddStepsOperator(
task_id=steps[i],
job_flow_id="{{ task_instance.xcom_pull(task_ids='create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
steps=eval('step_'+str(i+1)),
))
print(step_adder)
#Step Sensor for checking
step_checker.append(EmrStepSensor(
task_id=steps[i]+'_check',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
#step_id="{{"task_instance.xcom_pull(task_ids={}, key='return_value')[0]",steps[i]}}",
step_id='(Template("{{ "task_instance.xcom_pull(task_ids=params.step, key='return_value')[0] }}").render({'params': {'step': steps[i]}}))',
aws_conn_id='aws_default',
))
I am facing an error here, EmrStepSensor expects step_id from EMR to input here and that is being generated fetched from xcom(I guess, I am not 100% sure how this code works). But my step is stored in steps list so I can't give a static value here in task_id in step_id, like given in reference code and I am not able to figure out on how to use jinja template with python variable value to put values here from the steps list.
I used both of the below ways so that step_id can fetch the correct of step from EMR according to step name in steps[i]
step_id="{{"task_instance.xcom_pull(task_ids={}, key='return_value')[0]",steps[i]}}",
step_id='(Template("{{ "task_instance.xcom_pull(task_ids=params.step, key='return_value')[0] }}")
However both of these failed with syntax error in Airflow. So if anyone can point me in right direction to do this, I would really appreciate that. I am using Airflow 1.10.12(This is the default version of Airflow in Managed Apache Airflow on AWS).
I'm not sure if this is already solved, so:
Using f-strings:
f"{{{{ task_instance.xcom_pull(task_ids='{steps[i]}', key='return_value')[0] }}}}"
Using .format:
"{{{{ task_instance.xcom_pull(task_ids='{}', key='return_value')[0] }}}}".format(steps[i])
Note that you have to make sure that the value of key task_ids is wrapped with single quotes. Also, the return from xcom_pull is a list, therefore the index [0] at the end o

How to cache/target tasks with the same name in a Flow with prefect?

I am trying to find a target pattern or cache config to differentiate between tasks with the same name in a flow.
As highlighted from the diagram above only one of the tasks gets cached and the other get overwritten. I tried using task-slug but to no avail.
#task(
name="process_resource-{task_slug}",
log_stdout=True,
target=task_target
)
Thanks in advance
It looks like you are attempting to format the task name instead of the target. (task names are not template-able strings).
The following snippet is probably what you want:
#task(name="process_resource", log_stdout=True, target="{task_name}-{task_slug}")
After further research it looks like the documentation directly addresses changing task configuration on the fly - Without breaking target location templates.
#task
def number_task():
return 42
with Flow("example-v3") as f:
result = number_task(task_args={"name": "new-name"})
print(f.tasks) # {<Task: new-name>}

Pyspark multiple jobs in parallel

I have the following situation with my Pyspark:
In my driver program (driver.py), I call a function from another file (prod.py)
latest_prods = prod.featurize_prods().
Driver code:
from Featurize import Featurize
from LatestProd import LatestProd
from Oldprod import Oldprod
sc = SparkContext()
if __name__ == '__main__':
print 'Into main'
featurize_latest = Featurize('param1', 'param2', sc)
latest_prod = LatestProd(featurize_latest)
latest_prods = latest_prod.featurize_prods()
featurize_old = Featurize('param3', 'param3', sc)
old_prods = Oldprod(featurize_old)
old_prods = oldprod.featurize_oldprods()
total_prods = sc.union([latest_prods, old_prods])
Then I do some some reduceByKey code here... that generates total_prods_processed.
Finally I call:
total_prods_processed.saveAsTextFile(...)
I would like to generate latest_prods and old_prods in parallel. Both are created in the same SparkContext. Is it possible to do that? If not, how can I achieve that functionality?
Is this something that does Spark automatically? I am not seeing this behavior when I run the code so please let me know if it is a configuration option.
After searching on the internet, I think your problem can be addressed by threads. It is as simple as create two threads for your old_prod and latest_prod work.
Check this post for a simplified example. Since Spark is thread-safe, you gain the parallel efficiency without sacrifice anything.
The short answer is no, you can't schedule operations on two distinct RDDs at the same time in the same spark context. However there are some workarounds, you could process them in two distinct SparkContext on the same cluster and call SaveAsTextFile. Then read both in another job to perform the union. (this is not recommended by the documentation).
If you want to try this method, it is discussed here using spark-jobserver since spark doesn't support multiple context by default : https://github.com/spark-jobserver/spark-jobserver/issues/147
However according to the operations you perform there would be no reason to process both at the same time since you need the full results to perform the union, spark will split those operations in 2 different stages that will be executed one after the other.

Categories