Python Celery inconsistent cronjob timing for task scheduling with now function

Python Celery inconsistent cronjob timing for task scheduling with now function - python

The situation
I have a celery task I am running at different timezone for each customer.
Basically, for each customer in my database, I get the timezone, and then I set up the celery task this way.
'schedule': crontab(minute=30, hour=14, nowfun=self.now_function)
Basically, what I want is the task to run at 14:30, at the customer timezone. Hence the now_function.
My now_function is just getting the current time with the customer timezone.
def now_function(self):
"""
return the now function for the task
this is used to compute the time a task should be scheduled for a given customer
"""
return datetime.now(timezone(self.customer.c_timezone))
What is happening
I am getting inconsistencies in the time the task run, sometimes they run at the expected time, so let's say 14:30 in the customer time zone, if the timezone is America/Chicago it runs at 20:30 and that is my expected behavior.
Some other days, it runs at 14:30, which is just the time in UTC.
I am tracking to see if there is a pattern in the day the task run at the correct time and the day the cards run at the incorrect time.
Additional Information
I have tried this on celery 4.4.2 and 5.xx but it is still has the same behavior.
Here is my celery config.
CELERY_REDIS_SCHEDULER_URL = redis_instance_url
logger.debug("****** CELERY_REDIS_SCHEDULER_URL: ", CELERY_REDIS_SCHEDULER_URL)
logger.debug("****** environment: ", environment)
redbeat_redis_url = CELERY_REDIS_SCHEDULER_URL
broker_url = CELERY_REDIS_SCHEDULER_URL
result_backend = CELERY_REDIS_SCHEDULER_URL
task_serializer = 'pickle'
result_serializer = 'pickle'
accept_content = ['pickle']
enable_utc = False
task_track_started = True
task_send_sent_event = True
You can notice enable_utc is set to False.
I am using Redis instance from AWS to run my task.
I am using the RedBeatScheduler scheduler from this package to schedule my tasks.
If anyone has experienced this issue or can help me to reproduce it, I will be very thankful.
Other edits:
I have another cron for the same job at the same time but running weekly and monthly but they are working perfectly.
weekly_schedule : crontab(minute=30, hour=14, nowfun=self.now_function, day_of_week=1)
monthly_schedule : crontab(minute=30, hour=14, nowfun=self.now_function, day_of_month=1)
Sample Project
Here is a sample project on GitHub if you want to run and reproduce the issue.

RedBeat's encoder and decoder don't support nowfun.
Source code: https://github.com/sibson/redbeat/blob/e6d72e2/redbeat/decoder.py#L94-L102
The behaviour you see was described previously: sibson/redbeat#192 (comment 756397651)
You can subclass and replace RedBeatJSONDecoder and RedBeatJSONEncoder.
Since nowfun has to be JSON serializable, we can only support some special cases,
e.g. nowfun=partial(datetime.now, tz=pytz.timezone(self.customer.c_timezone))
from datetime import datetime
from functools import partial
from celery.schedules import crontab
import pytz
from pytz.tzinfo import DstTzInfo
from redbeat.decoder import RedBeatJSONDecoder, RedBeatJSONEncoder
class CustomJSONDecoder(RedBeatJSONDecoder):
def dict_to_object(self, d):
if '__type__' not in d:
return d
objtype = d.pop('__type__')
if objtype == 'crontab':
if d.get('nowfun', {}).get('keywords', {}).get('zone'):
d['nowfun'] = partial(datetime.now, tz=pytz.timezone(d.pop('nowfun')['keywords']['zone']))
return crontab(**d)
d['__type__'] = objtype
return super().dict_to_object(d)
class CustomJSONEncoder(RedBeatJSONEncoder):
def default(self, obj):
if isinstance(obj, crontab):
d = super().default(obj)
if 'nowfun' not in d and isinstance(obj.nowfun, partial) and obj.nowfun.func == datetime.now:
zone = None
if obj.nowfun.args and isinstance(obj.nowfun.args[0], DstTzInfo):
zone = obj.nowfun.args[0].zone
elif isinstance(obj.nowfun.keywords.get('tz'), DstTzInfo):
zone = obj.nowfun.keywords['tz'].zone
if zone:
d['nowfun'] = {'keywords': {'zone': zone}}
return d
return super().default(obj)
Replace the classes in redbeat.schedulers:
from redbeat import schedulers
schedulers.RedBeatJSONDecoder = CustomJSONDecoder
schedulers.RedBeatJSONEncoder = CustomJSONEncoder

Related

AzureML not able to Schedule Pipeline endpoint

I made a minimal Pipeline with a unique step in AML. I've publish this pipeline and I have and id and REST endpoint for it.
When I try to create a schedule on this pipeline, I get no error, but it will never launch.
from azureml.core.runconfig import RunConfiguration
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline
datastore = ws.get_default_datastore()
minimal_run_config = RunConfiguration()
minimal_run_config.environment = myenv # Custom Env with Dockerfile from mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest + openSDK 11 + pip/conda packages
step_name = experiment_name
script_step_1 = PythonScriptStep(
name=step_name,
script_name="main.py",
arguments=args,
compute_target=cpu_cluster,
source_directory=str(source_path),
runconfig=minimal_run_config,
)
pipeline = Pipeline(
workspace=ws,
steps=[
script_step_1,
],
)
pipeline.validate()
pipeline.publish(name=experiment_name + "_pipeline")
I can trigger this pipeline with REST python
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.pipeline.core import PublishedPipeline
import requests
auth = InteractiveLoginAuthentication()
aad_token = auth.get_authentication_header()
pipelines = PublishedPipeline.list(ws)
rest_endpoint1 = [p for p in pipelines if p.name == experiment_name + "_pipeline"][0]
response = requests.post(rest_endpoint1.endpoint,
headers=aad_token,
json={"ExperimentName": experiment_name,
"RunSource": "SDK",
"ParameterAssignments": {"KEY": "value"}})
But when I use the Schedule, I have no warning, no error and nothing is triggered if I use start_time from ScheduleRecurrence. If I don't user start_time, my pipeline is triggered and launch immediately. And I don't want this. For example I'm running the Schedule setter today, but I want it's first trigger to run only the second of each month at 4pm.
from azureml.pipeline.core.schedule import ScheduleRecurrence, Schedule
import datetime
first_run = datetime.datetime(2022, 10, 2, 16, 00)
schedule_name = f"Recocpc monthly run PP {first_run.day:02} {first_run.hour:02}:{first_run.minute:02}"
recurrence = ScheduleRecurrence(
frequency="Month",
interval=1,
start_time=first_run,
)
recurrence.validate()
recurring_schedule = Schedule.create_for_pipeline_endpoint(
ws,
name=schedule_name,
description="Recocpc monthly run PP",
pipeline_endpoint_id=pipeline_endpoint.id,
experiment_name=experiment_name,
recurrence=recurrence,
pipeline_parameters={"KEY": "value"}
)
If I comment start_time, It will work, but the first run is now, and not when I want.

So I was not aware on how start_time was working. It is using DAGs logic like in airflow.
Here is an example:
Today is 10-01-2022 (dd-mm-yyy)
You want your pipeline to run every month, once the on the 10th of each month at 14:00.
Then your start_time is not 2022-01-10T14:00:00, but should be 2021-12-10T14:00:00.
Your scheduler will only trigger if it has made a full revolution of what you are asking him (here one month).
Maybe official documentation should be more explicit on this mecanism for neewbies like me that never used DAGs in their lives.

Airlfow API : testing tasks with parameters

I can't figure out how to use pytest to test a dag task waiting for xcom_arg.
I created the following DAG using the new airflow API syntax :
#dag(...)
def transfer_files():
#task()
def retrieve_existing_files():
existing = []
for elem in os.listdir("./backup"):
existing.append(elem)
return existing
#task()
def get_new_file_to_sync(existing: list[str]):
new_files = []
for elem in os.listdir("./prod"):
if not elem in existing:
new_files.append(elem)
return new_files
r = retrieve_existing_files()
get_new_file_to_sync(r)
Now I want to perform unit testing on the get_new_file_to_sync task. I wrote the following test :
def test_get_new_elan_list():
mocked_existing = ["a.out", "b.out"]
dag_bag = DagBag(include_examples=False)
dag = dag_bag.get_dag("transfer_files")
task = dag.get_task("get_new_file_to_sync")
result = task.execute({}, mocked_existing)
print(result)
The test fails because task.execute is waiting for 2 parameters but 3 were given.
My issue is that I don't have any clue of how to proceed in order to test my tasks waiting for arguments with a mocked custom argument.
Thanks for your insights

I managed to find a way to unit test airflow tasks declared using the new airflow API.
Here is a test case for the task get_new_file_to_sync contained in the DAG transfer_files declared in the question :
def test_get_new_file_to_synct():
mocked_existing = ["a.out", "b.out"]
# Asking airflow to load the dags in its home folder
dag_bag = DagBag(include_examples=False)
# Retrieving the dag to test
dag = dag_bag.get_dag("transfer_files")
# Retrieving the task to test
task = dag.get_task("get_new_file_to_sync")
# extracting the function to test from the task
function_to_unit_test = task.python_callable
# Calling the function normally
results = function_to_unit_test(mocked_existing)
assert len(results) == 10
This allows bypassing all the airflow mechanics triggered before calling the actual code you have written for your task. Thus, you can focus on writing tests for the code you have written for your task.

For testing such a task, I believe you'll need to use mocking from pytest.
Let's take this user defined operator for an example:
class MovielensPopularityOperator(BaseOperator):
def __init__(self, conn_id, start_date, end_date, min_ratings=4, top_n=5, **kwargs):
super().__init__(**kwargs)
self._conn_id = conn_id
self._start_date = start_date
self._end_date = end_date
self._min_ratings = min_ratings
self._top_n = top_n
def execute(self, context):
with MovielensHook(self._conn_id) as hook:
ratings = hook.get_ratings(start_date=self._start_date, end_date=self._end_date)
rating_sums = defaultdict(Counter)
for rating in ratings:
rating_sums[rating["movieId"]].update(count=1, rating=rating["rating"])
averages = {
movie_id: (rating_counter["rating"] / rating_counter["count"], rating_counter["count"])
for movie_id, rating_counter in rating_sums.items()
if rating_counter["count"] >= self._min_ratings
}
return sorted(averages.items(), key=lambda x: x[1], reverse=True)[: self._top_n]
And a test written just like the one you did:
def test_movielenspopularityoperator():
task = MovielensPopularityOperator(
task_id="test_id",
start_date="2015-01-01",
end_date="2015-01-03",
top_n=5,
)
result = task.execute(context={})
assert len(result) == 5
Running this test fail as:
=============================== FAILURES ===============================
___________________ test_movielenspopularityoperator ___________________
mocker = <pytest_mock.plugin.MockFixture object at 0x10fb2ea90>
def test_movielenspopularityoperator(mocker: MockFixture):
task = MovielensPopularityOperator(
➥
>
task_id="test_id", start_date="2015-01-01", end_date="2015-01-
03", top_n=5
)
➥
E
TypeError: __init__() missing 1 required positional argument:
'conn_id'
tests/dags/chapter9/custom/test_operators.py:30: TypeError
========================== 1 failed in 0.10s ==========================
The test failed because we’re missing the required argument conn_id, which points to the connection ID in the metastore. But how do you provide this in a test? Tests should be isolated from each other; they should not be able to influence the results of other tests, so a database shared between tests is not an ideal situation. In this case, mocking comes to the rescue.
Mocking is “faking” certain operations or objects. For example, the call to a database that is expected to exist in a production setting but not while testing could be faked, or mocked, by telling Python to return a certain value instead of making the actual call to the (nonexistent during testing) database. This allows you to develop and run tests without requiring a connection to external systems. It requires insight into the internals of whatever it is you’re testing, and thus sometimes requires you to dive into third-party code.
After installing pytest-mock in your enviroment:
pip install pytest-mock
Here is the test written where mocking is used:
def test_movielenspopularityoperator(mocker):
mocker.patch.object(
MovielensHook,
"get_connection",
return_value=Connection(conn_id="test", login="airflow", password="airflow"),
)
task = MovielensPopularityOperator(
task_id="test_id",
conn_id="test",
start_date="2015-01-01",
end_date="2015-01-03",
top_n=5,
)
result = task.execute(context=None)
assert len(result) == 5
Now, hopefully this will give you an idea about how to write your tests for Airflow Tasks.
For more about mocking and unit tests, you can check here and here.

How to avoid dynamic execution of expression in dag parameter at Airflow?

I'm using a parameter that is the timestamp in a set of tasks:
default_dag_args = {'arg1': 'arg1-value',
'arg2': 'arg2-value',
'now': datetime.now()}
I would like that the now parameter would have the same value for all the tasks. But what happens is that it's re-executed for each function
Is there a way of doing it (executing once and using the same value through the dag)? I'm using the TaskFlow API for Airflow 2.0:
#task
def python_task()
context = get_current_context()
context_dag = context['dag']
now = context_dag.default_args['now']
print now

I tried to set the time constant, at the start of the dag file, like:
TIME = datetime.now()
and got the context inside of the tasks with get_current_context() just like you did.
Sadly, I think because of running the DAG file from start, every time a task got defined in the script, time was recalculated.
One idea I have is to use XCOM's in order to save the datetime to a variable and pull it to other tasks:
My sample code is below, I think you'll get the idea.
from airflow.decorators import task, dag
from datetime import datetime
import time
default_arguments = {
'owner': 'admin',
# This is the beginning, for more see: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime(2022, 5, 2)
}
#dag(
schedule_interval=None,
dag_id = "Time_Example_Dag",
default_args = default_arguments,
catchup=False,
)
def the_global_time_checker_dag():
#task
def time_set():
# To use XCOM to pass the value between tasks,
# we have to parse the datetime to a string.
now = str(datetime.now())
return now
#task
def starting_task(datetime_string):
important_number = 23
# We can use this datetime object in whatever way we like.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
return important_number
#task
def important_task(datetime_string, number):
# Passing some time
time.sleep(10)
# Again, we are free to do whatever we want with this object.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
print("The important number is: {}".format(number))
time_right_now = time_set()
start = starting_task(datetime_string = time_right_now)
important = important_task(datetime_string = time_right_now, number = start)
time_checker = the_global_time_checker_dag()
Through the logs, you can see all the datetime values are the same.
For more information about XCOM in Taskflow API, you can check here.

When a worker gets a task instance to run, it rebuilds the whole DagBag from the Python files to get the DAG and task definition. So every time a task instance is ran, your DAG file is sourced, rerunning your DAG definition code. And that resulting DAG object is the one that the particular task instance will be defined by.
It's critical to understand that the DAG definition is not simply built once for every execution date and then persisted/reused for all TIs within that DagRun. The DAG definition is constantly being recomputed from your Python code, each TI is ran in a separate process independently and without state from other tasks. Thus, if your DAG definition includes non-deterministic results at DagBag build time - such as datetime.now() - every instantiation of your DAG even for the same execution date will have different values. You need to build your DAGs in a deterministic and idempotent manner.
The only way to share non-deterministic results is to store them in the DB and have your tasks fetch them, as #sezai-burak-kantarcı has pointed out. Best practice is to use task context-specific variables, like {{ ds }}, {{ execution_date }}, {{ data_interval_start }}, etc. These are the same for all tasks within a DAG run. You can see the template variables available here: Airflow emplates reference

Is there a way to find out last restart time for all EC2 instances using Python

I am trying to get how long every ec2 instances are running for 'running' instances. For this I need to get last restart time for all instances and compare that with today. Found that I can get restart or start event from cloudtrail but can't figure out how to get only 'start' time from there. Is there a way to find that information for a couple of regions?
import boto3
import datetime
from datetime import date
import subprocess
regions = ['us-west-2', 'us-west-1', 'us-east-1','us-east-2','ap-south-1', 'ap-southeast-1','ca-central-1','eu-west-1','eu-west-3']
for region in regions:
session = boto3.session.Session(region_name=region)
ec2 = session.resource('ec2')
cloudtrail = boto3.client('cloudtrail')
for i in ec2.instances.all():
Id = i.id
State = i.state['Name']
Launchtime = i.launch_time
InstanceType = i.instance_type
Platform = str(i.platform)
currenttime = datetime.datetime.now(Launchtime.tzinfo)
time_diff = currenttime - Launchtime
uptime = str(time_diff)
Here uptime is giving time difference between launch time and current time, which is not correct as most of the instances been restarted lot of time already. So I need to find last start time for all running instances.

I tweaked my code little bit and made like this:
import boto3
import datetime
from datetime import date
import subprocess
regions = ['us-west-2', 'us-west-1', 'us-east-1']
for region in regions:
session = boto3.session.Session(region_name=region)
ec2 = session.resource('ec2')
for i in ec2.instances.all():
Id = i.id
State = i.state['Name']
Launchtime = i.launch_time
InstanceType = i.instance_type
Platform = str(i.platform)
currenttime = datetime.datetime.now(Launchtime.tzinfo)
time_diff = currenttime - Launchtime
uptime = str(time_diff)
if i.state['Name'] == 'stopped':
uptime = ' '
Here it's showing uptime for only running instances. Found out Launch time is showing as last time these instances started (Launched). Cross checked in console and those were right. Then checked up time for the Linux instances (sudo uptime) and those matched with output.

Django/Apache old content after restart

I have a strange problem with a website running on Django 1.8 and Apache.
I am trying to display a list of next week's upcoming events. After making modifications to the model, making and applying my migrations on the website, and restarting Apache, everything shows up fine. I have a python script queries and displays next week's days worth of events on the website. This week-to-week rollover of events is scheduled to happen shortly after midnight on the Saturday before the events take place.
Unfortunately, the events are not showing up on Saturday until I manually reload the Apache service. Also, by the time Monday rolls around, the site is again displaying old information until I again reload the Apache service. After the Monday reload, everything works as it should until the next Saturday.
I've checked all the crons, and there's nothing running that should make the site behave this way. I've also checked the ctime and mtime on all of the files and it does not appear that they've been changed or modified. I've also checked through the script and can find nothing that would be causing this behaviour.
Does anyone know what might be causing this or have another approach to solving this problem?
Thank you.
Here's the relevant code...
views.py
def index(request, starttime = datetime.now(), version = False):
# starttime: Datetime object for when to begin looking for events. Index pages are always a week.
starttime = datetime.now()
endtime = starttime + timedelta(7)
event_list = getEventsDB(EventList(), start = starttime, end = endtime)
event_calendar = getEventsDB(
EventSeries.objects.get(event_slug = 'competition'),
start = starttime, end = starttime + timedelta(7))
highlight_title = 'Competitions'
t = loader.get_template('event/index.html')
c = RequestContext(request, {
'page_title': 'Events',
'highlight_title': highlight_title,
'event_list': event_list,
'highlight': event_calendar,
'event_series_list': EventSeries.objects.filter(active=True),
})
return HttpResponse(t.render(c))

Your problem is here:
def index(request, starttime = datetime.now(), version = False):
There starttime is evaluated when the method is defined, which happens when the module is first imported - in other words, when the server is restarted. Don't put values there; instead, leave the default as None and check in the function itself:
def index(request, starttime=None, version = False):
if starttime is None:
starttime = datetime.now()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Celery inconsistent cronjob timing for task scheduling with now function - python

Related

AzureML not able to Schedule Pipeline endpoint

Airlfow API : testing tasks with parameters

How to avoid dynamic execution of expression in dag parameter at Airflow?

Is there a way to find out last restart time for all EC2 instances using Python

Django/Apache old content after restart

Categories

Resources