Prefect workflow: How to persist data of previous/every schedule run? - python

In prefect workflow, I'm trying to persist data of every schedule run. I need to compare data of every previous and current result. I tried Localresult and checkpoint=true but its not working. For example,
from prefect import Flow, task
from prefect.engine.results import LocalResult
from prefect.schedules import IntervalSchedule
from datetime import timedelta, datetime
import os
import prefect
#task("func_task_target.txt", checkpoint=True, result=LocalResult(dir="~/.prefect"))
def file_scan():
files = os.listdir(test)
#prefect.context.a = files
return files
schedule = IntervalSchedule(interval=timedelta(seconds=61))
with Flow("Test persist data", schedule) as flow:
a = file_scan()
flow.run()
My flow scheduled for every 61 seconds/a minute. On the first run I might get empty result but for the 2nd scheduled run I should get previous flow result to compare. can anyone help me to achieve this? Thanks!

Update (15 November 2021) :
Not sure what is the reason,
LocalResult and checkpoint actually worked when I ran the registered flow through the dashboard or cli prefect run -n "your-workflow.py" --watch. It doesn't work when I manually trigger the flow (e.g.: flow.run) in python code.
Try these following two options:
Option 1 : using target argument:
https://docs.prefect.io/core/concepts/persistence.html#output-caching-based-on-a-file-target
#task(target="func_task_target.txt", checkpoint=True, result=LocalResult(dir="~/.prefect"))
def func_task():
return "999"
Option 2 : instantiate LocalResult instance and invoke write manually.
MY_RESULTS = LocalResult(dir="./.prefect").
#task(checkpoint=True, result=LocalResult(dir="./.prefect"))
def func_task():
MY_RESULTS.write("999")
return "999"
PS:
Having same problem as LocalResult doesn't seem to work for mewhen used in decorator . e.g :
#task("func_task_target.txt", checkpoint=True, result=LocalResult(dir="~/.prefect"))
def file_scan():

Related

Python - GCP Cloud Function issue with getting yesterdays date

I have a requirement of getting yesterdays date in my GCP Cloud Function which is written Python 3.9
This is the snippet I am using
from datetime import datetime, timedelta
cur_dt = datetime.today()
str_dt = str((datetime.today() - timedelta(1)).strftime('%Y-%m-%d'))
print(cur_dt)
print(str_dt)
This is working fine in Jupyter notebook. But if I place this same code in my cloud function, it fails to compile.
This is the error message I am getting: - Function failed on loading user code. This is likely due to a bug in the user code.
Would be of great help if anyone can help me fixing this error. This is strange and I dont understand why CLoud function isnt accepting something that is working fine in Jupyter notebook
Many Thanks in Advance.
As mentioned you need to follow the Cloud Functions schema which requires an entrypoint function that takes in the request parameter. Here is a codelab that walks through setting up a Cloud Function.
Your code should be updated to the following:
# imports
from datetime import datetime, timedelta
# entrypoint function with request param
def get_time(request):
cur_dt = datetime.today()
str_dt = str((datetime.today() - timedelta(1)).strftime('%Y-%m-%d'))
# can still print here if you want
return str_dt
If you are going through the UI make sure to update the entrypoint field to be get_time or whatever you name your entrypoint function.

Cloud Composer / Airflow start new task only when Cloud DataFusion task is really finished

I have the following task in Airflow (Cloud Composer) that triggers a Cloud DataFusion pipeline.
The problem is:
Airflow considers this task already a success when (within DataFusion) the DataProc cluster has been provisioned and the actual job has entered the RUNNING state.
But I only want it to be considered a success when it is COMPLETED.
from airflow.providers.google.cloud.operators.datafusion import \
CloudDataFusionStartPipelineOperator
my_task = CloudDataFusionStartPipelineOperator(
location='europe-west1',
pipeline_name="my_datafusion_pipeline_name",
instance_name="my_datafusion_instance_name",
task_id="my_task_name",
)
I had to look in the source code but the following states are the default success_states:
[PipelineStates.COMPLETED] + [PipelineStates.RUNNING]
So you have to limit the succes_states to only [PipelineStates.COMPLETED], by using keyword success_states like so:
from airflow.providers.google.cloud.operators.datafusion import \
CloudDataFusionStartPipelineOperator
from airflow.providers.google.cloud.hooks.datafusion import PipelineStates
my_task = CloudDataFusionStartPipelineOperator(
location='europe-west1',
pipeline_name="my_datafusion_pipeline_name",
instance_name="my_datafusion_instance_name",
task_id="my_task_name",
success_states=[PipelineStates.COMPLETED], # overwrite default success_states
pipeline_timeout=3600, # in seconds, default is currently 300 seconds
)
See also:
Airflow documentation on the DataFusionStartPipelineOperator
Airflow source code used for success states of DataFusionStartPipelineOperator

Concurrent.futures.map initializes code from beginning

I am a fairly beginner programmer with python and in general with not that much experience, and currently I'm trying to parallelize a process that is heavily CPU bound in my code. I'm using anaconda to create environments and Visual Code to debug.
A summary of the code is as following :
from tkinter import filedialog
import myfuncs as mf, concurrent.futures
file_path = filedialog.askopenfilename('Ask for a file containing data')
# import data from file_path
a = input('Ask the user for input')
Next calculations are made from these and I reach a stage where I need to iterate of a list of lists. These lists may contain up to two values and calls are made to a separate file.
For example the inputs are :
sub_data1 = [test1]
sub_data2 = [test1, test2]
dataset = [sub_data1, sub_data2]
This is the stage I use concurrent.futures.ProcessPoolExecutor()-instance and its .map() method :
with concurrent.futures.ProcessPoolExecutor() as executor:
sm_res = executor.map(mf.process_distr, dataset)
While inside a myfuncs.py, the mf.process_distr() function works like this :
def process_distr(tests):
sm_reg = []
for i in range(len(tests)):
if i==0:
# do stuff
sm_reg.append(result1)
else:
# do stuff
sm_reg.append(result2)
return sm_reg
The problem is that when I try to execute this code on the main.py file, it seems that the main.py starts running multiple times, and asks for user inputs and file dialog pops up multiple times (same amount as cores count).
How can I resolve this matter?
Edit: After reading more into it, encapsulating the whole main.py code with:
if __name__ == '__main__':
did the trick. Thank you to anyone who gave time to help with my rookie problem.

Python profiling, imports (and specially __init__) is what seems to take the most time

I have a script that seemed to run slow and that i profiled using cProfile (and visualisation tool KCacheGrind)
It seems that what is taking almost 90% of the runtime is the import sequence, and especially the running of the _ _ init _ _.py files...
Here a screenshot of the KCacheGrind output (sorry for attaching an image...)
I am not very familiar with how the import sequence works in python ,so maybe i got something confused... I also placed _ _ init _ _.py files in everyone of my custom made packages, not sure if that was what i should have done.
Anyway, if anyone has any hint, greatly appreciated!
EDIT: additional picture when function are sorted by self:
EDIT2:
here the code attached, for more clarity for the answerers:
from strategy.strategies.gradient_stop_and_target import make_one_trade
from datetime import timedelta, datetime
import pandas as pd
from data.db import get_df, mongo_read_only, save_one, mongo_read_write, save_many
from data.get import get_symbols
from strategy.trades import make_trade, make_mae, get_prices, get_signals, \
get_prices_subset
#from profilehooks import profile
mongo = mongo_read_only()
dollar_stop = 200
dollar_target = 400
period_change = 3
signal = get_df(mongo.signals.signals, strategy = {'$regex' : '^indicators_group'}).iloc[0]
symbol = get_symbols(mongo, description = signal['symbol'])[0]
prices = get_prices(
signal['datetime'],
signal['datetime'].replace(hour = 23, minute = 59),
symbol,
mongo)
make_one_trade(
signal,
prices,
symbol,
dollar_stop,
dollar_target,
period_change)
The function get_prices simply get data from a mongo db database, and make_one_trade does simple calculation with pandas. This never poses problem anywhere else in my project.
EDIT3:
Here the Kcache grind screen when i select 'detect cycle' option in View tab:
Could that actually mean that there are indeed circular imports in my self created packages that takes all that time to resolve?
No. You are conflating cumulative time with time spent in the top-level code of the __init__.py file itself. The top-level code calls other methods, and those together take a lot of time.
Look at the self column instead to find where all that time is being spent. Also see What is the difference between tottime and cumtime in a python script profiled with cProfile?, the incl. column is the cumulative time, self is the total time.
I'd just filter out all the <frozen importlib.*> entries; the Python project has already made sure those paths are optimised.
However, your second screenshot does show that in your profiling run, all that your Python code busied itself with was loading bytecode for modules to import (the marshal module provides the Python bytecode serialisation implementation). Either the Python program did nothing but import modules and no other work was done, or it is using some form of dynamic import that is loading a large number of modules or is otherwise ignoring the normal module caches and reloading the same module(s) repeatedly.
You can profile import times using Python 3.7's new -X importtime command-line switch, or you could use a dedicated import-profiler to find out why imports take such a long time.

Reversed upstream/downstream relationships when generating multiple tasks in Airflow

The original code related to this question can be found here.
I'm confused by up both bitshift operators and set_upstream/set_downstream methods are working within a task loop that I've defined in my DAG. When the main execution loop of the DAG is configured as follows:
for uid in dash_workers.get_id_creds():
clear_tables.set_downstream(id_worker(uid))
or
for uid in dash_workers.get_id_creds():
clear_tables >> id_worker(uid)
The graph looks like this (the alpha-numeric sequence are the user IDs, which also define the task IDs):
when I configure the main execution loop of the DAG like this:
for uid in dash_workers.get_id_creds():
clear_tables.set_upstream(id_worker(uid))
or
for uid in dash_workers.get_id_creds():
id_worker(uid) >> clear_tables
the graph looks like this:
The second graph is what I want / what I would have expected the first two snippets of code to have produced based on my reading of the docs. If I want clear_tables to execute first before triggering my batch of data parsing tasks for different user IDs should I indicate this as clear_tables >> id_worker(uid)
EDIT -- Here's the complete code, which has been updated since I posted the last few questions, for reference:
from datetime import datetime
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('DASH_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
import dash_workers
else:
print('Define DASH_PREPROC_PATH value in environmental variables')
sys.exit(1)
ENV = os.environ
default_args = {
'start_date': datetime.now(),
}
DAG = DAG(
dag_id='dash_preproc',
default_args=default_args
)
clear_tables = PythonOperator(
task_id='clear_tables',
python_callable=dash_workers.clear_db,
dag=DAG)
def id_worker(uid):
return PythonOperator(
task_id=id,
python_callable=dash_workers.main_preprocess,
op_args=[uid],
dag=DAG)
for uid in dash_workers.get_id_creds():
preproc_task = id_worker(uid)
clear_tables << preproc_task
After implementing #LadislavIndra's suggestion I continue to have the same reversed implementation of the bitshift operator in order to get the correct dependency graph.
UPDATE #AshBerlin-Taylor's answer is what's going on here. I assumed that Graph View and Tree View were doing the same thing, but they're not. Here's what id_worker(uid) >> clear_tables looks like in graph view:
I certainly don't want the final step in my data pre-prep routine to be to delete all data tables!
The tree view in Airflow is "backwards" to how you (and I!) first thought about it. In your first screenshot it is showing that "clear_tables" must be run before the "AAAG5608078M2" run task. And the DAG status depends upon each of the id worker tasks. So instead of a task order, it's a tree of the status chain. If that makes any sense at all.
(This might seem strange at first, but it's because a DAG can branch out and branch back in.)
You might have better luck looking at the Graph view for your dag. This one has arrows and shows the execution order in a more intuitive way. (Though I do now find the tree view useful. It's just less clear to start with)
Looking through your other code, it seems get_id_creds is your task and you're trying to loop through it, which is creating some weird interaction.
A pattern that will work is:
clear_tables = MyOperator()
for uid in uid_list:
my_task = MyOperator(task_id=uid)
clear_tables >> my_task

Categories