luigi tasks have conflicting pip dependencies - python

I have a luigi pipeline where some luigi.Tasks have conflicting pip dependencies. This causes issues because those tasks are part of the same pipeline (i.e. one task requires the other). I would not want to create separated pipelines as I would not be able to inspect the full pipeline in the scheduler anymore. What are the best practices in this case?
Example: You have two python packages each defining a luigi.Task.
However packageA needs a different version of a library than packageB:
packageA/task1.py requires mypackage==1.0.0
packageB/task2.py requires mypackage==0.9.0
Let's say the pipeline is:
task1 -> task2 -> wrappertask
This is an issue as in task2 I have to import task1 in order to define the requires method:
# packageB/task2.py, needs mypackage==0.9.0
from task1 import Task1 # cannot do this as I would need mypackage==1.0.0
class Task2(luigi.Task):
id = luigi.Parameter()
def requires(self):
task1.Task1(id=id)
...

If you run luigi with the local scheduler, all scheduling and task execution occurs in a single python process. So, any python package imported into the global namespace stays present and new import statements for that package will be ignored. You also have to know that luigi instantiates all task classes once, but calls the requires() and output() methods of each task multiple times during scheduling and task execution.
So, you only want to import the troubling package into the local namespace of the task method where you use it. Be sure that the package in its two versions is available on the PYTHONPATH, e.g. as pack1 and pack2 so that you can use an import statement like:
...
...
import pack1 as pack

Related

Running Django celery on load

Hi I am working on a project where i need celery beat to run long term periodic tasks. But the problem is that after starting the celery beat it is taking the specified time to run for the first time.
I want to fire the task on load for the first time and then run periodically.
I have seen this question on stackoverflow and this issue on GitHub, but didn't found a reliable solution.
Any suggestions on this one?
Since this does not seem possible I suggest a different approach. Call the task explicitly when you need and let the scheduler continue scheduling the tasks as usual. You can call the task on startup by using one the following methods (you probably need to take care of the multiple calls of the ready method if the task is not idempotent). Alternatively call the task from the command line by using celery call after your django server startup command.
The best place to call it will most of the times be in the ready() function of the current apps AppConfig class:
from django.apps import AppConfig
from myapp.tasks import my_task
class RockNRollConfig(AppConfig):
# ...
def ready(self):
my_task.delay(1, 2, 3)
Notice the use of .delay() which puts the invocation on the celery que and doesn't slow down starting the server.
See: https://docs.djangoproject.com/en/3.2/ref/applications/#django.apps.AppConfig and https://docs.celeryproject.org/en/stable/userguide/calling.html.

Skipping/excluding test module if running Pytest in parallel

I have a number of test files, such as
test_func1.py
test_func2.py
test_func3.py
I know in advance that test_func3.py won't pass if I run Pytest in parallel, e.g. pytest -n8. The reason is that test_func3.py contains a number of parametrized tests that handle file i/o processes. Parallel writing to the same file leads to failures. In serial testing mode, all tests in this module passes.
I'm wondering how I can skip the whole module in case Pytest will be started with the option -n? My thinking is to apply the skipif marker. I need to check in my code whether the -n argument has been passed to pytest.
...>pytest # run all modules
...>pytest -n8 # skip module test_func3.py automatically
The pytest-xdist package supports four scheduling algorithms:
each
load
loadscope
loadfile
Calling pytest -n is a shortcut for load scheduling, i.e. the scheduler will load balance the tests across all workers.
Using loadfile scheduling, all test cases in a test file will be executed sequentially by the same worker.
pytest -n8 --dist=loadfile will do the trick. The drawback may be that the whole test suite execution may be slower than using load. The advantage is that all tests will be performed and no test will be skipped.
There may be a case of test that affects some service settings.
This test can not be run in parallel with any other test.
There is a way to skip individual test like this:
#pytest.mark.unparalleled
def test_to_skip(a_fixture):
if worker_id == "master":
pytest.skip('Can't run in parallel with anything')
The first drawback here is that it will be skipped, so you will need to run those test separately. For that matter you can put them in separate folder, or mark with some tag.
The second drawback is that any fixtures used in such will be initialized.
Old question, but xdist has a newer feature that may address the OP's answer.
Per the docs:
--dist loadgroup: Tests are grouped by the xdist_group mark. Groups are distributed to available workers as whole units. This guarantees that all tests with same xdist_group name run in the same worker.
#pytest.mark.xdist_group(name="group1")
def test1():
pass
class TestA:
#pytest.mark.xdist_group("group1")
def test2():
pass
This will make sure test1 and TestA::test2 will run in the same worker. Tests without the xdist_group mark are distributed normally as in the --dist=load mode.
Specifically, you could put the test functions in test_func3.py into the same xdist_group.

Python current.futures import libraries multiple times (execute code in top scope multiple times)

for the following script (python 3.6, windows anaconda), I noticed that the libraries are imported as many as the number of the processors were invoked. And print('Hello') are also executed multiple same amount of times.
I thought the processors will only be invoked for func1 call rather than the whole program. The actual func1 is a heavy cpu bounded task which will be executed for millions of times.
Is this the right choice of framework for such task?
import pandas as pd
import numpy as np
from concurrent.futures import ProcessPoolExecutor
print("Hello")
def func1(x):
return x
if __name__ == '__main__':
print(datetime.datetime.now())
print('test start')
with ProcessPoolExecutor() as executor:
results = executor.map(func1, np.arange(1,1000))
for r in results:
print(r)
print('test end')
print(datetime.datetime.now())
concurrent.futures.ProcessPoolExecutor uses the multiprocessing module to do its multiprocessing.
And, as explained in the Programming guidelines, this means you have to protect any top-level code you don't want to run in every process in your __main__ block:
Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
... one should protect the “entry point” of the program by using if __name__ == '__main__':…
Notice that this is only necessary if using the spawn or forkserver start methods. But if you're on Windows, spawn is the default. And, at any rate, it never hurts to do this, and usually makes the code clearer, so it's worth doing anyway.
You probably don't want to protect your imports this way. After all, the cost of calling import pandas as pd once per core may seem nontrivial, but that only happens at startup, and the cost of running a heavy CPU-bound function millions of times will completely swamp it. (If not, you probably didn't want to use multiprocessing in the first place…) And usually, the same goes for your def and class statements (especially if they're not capturing any closure variables or anything). It's only setup code that's incorrect to run multiple times (like that print('hello') in your example) that needs to be protected.
The examples in the concurrent.futures doc (and in PEP 3148) all handle this by using the "main function" idiom:
def main():
# all of your top-level code goes here
if __name__ == '__main__':
main()
This has the added benefit of turning your top-level globals into locals, to make sure you don't accidentally share them (which can especially be a problem with multiprocessing, where they get actually shared with fork, but copied with spawn, so the same code may work when testing on one platform, but then fail when deployed on the other).
If you want to know why this happens:
With the fork start method, multiprocessing creates each new child process by cloning the parent Python interpreter and then just starting the pool-servicing function up right where you (or concurrent.futures) created the pool. So, top-level code doesn't get re-run.
With the spawn start method, multiprocessing creates each new child process by starting a clean new Python interpreter, importing your code, and then starting the pool-servicing function. So, top-level code gets re-run as part of the import.

Airflow dynamic tasks at runtime

Other questions about 'dynamic tasks' seem to address dynamic construction of a DAG at schedule or design time. I'm interested in dynamically adding tasks to a DAG during execution.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
dag = DAG('test_dag', description='a test',
schedule_interval='0 0 * * *',
start_date=datetime(2018, 1, 1),
catchup=False)
def make_tasks():
du1 = DummyOperator(task_id='dummy1', dag=dag)
du2 = DummyOperator(task_id='dummy2', dag=dag)
du3 = DummyOperator(task_id='dummy3', dag=dag)
du1 >> du2 >> du3
p = PythonOperator(
task_id='python_operator',
dag=dag,
python_callable=make_tasks)
This naive implementation doesn't seem to work - the dummy tasks never show up in the UI.
What's the correct way to add new operators to the DAG during execution? Is it possible?
It it not possible to modify the DAG during its execution (without a lot more work).
The dag = DAG(... is picked up in a loop by the scheduler. It will have task instance 'python_operator' in it. That task instance gets scheduled in a dag run, and executed by a worker or executor. Since DAG models in the Airflow DB are only updated by the scheduler these added dummy tasks will not be persisted to the DAG nor scheduled to run. They will be forgotten when the worker exits. Unless you copy all the code from the scheduler regarding persisting & updating the model… but that will be undone the next time the scheduler visits the DAG file for parsing, which could be happening once a minute, once a second or faster depending how many other DAG files there are to parse.
Airflow actually wants each DAG to approximately stay the same layout between runs. It also wants to reload/parse DAG files constantly. So though you could make a DAG file that on each run determines the tasks dynamically based on some external data (preferably cached in a file or pyc module, not network I/O like a DB lookup, you'll slow down the whole scheduling loop for all the DAGs) it's not a good plan as your graph and tree view will get all confusing, and your scheduler parsing will be more taxed by that lookup.
You could make the callable run each task…
def make_tasks(context):
du1 = DummyOperator(task_id='dummy1', dag=dag)
du2 = DummyOperator(task_id='dummy2', dag=dag)
du3 = DummyOperator(task_id='dummy3', dag=dag)
du1.execute(context)
du2.execute(context)
du3.execute(context)
p = PythonOperator(
provides_context=true,
But that's sequential, and you have to work out how to use python to make them parallel (use futures?) and if any raise an exception the whole task fails. Also it is bound to one executor or worker so not using airflow's task distribution (kubernetes, mesos, celery).
The other way to work with this is to add a fixed number of tasks (the maximal number), and use the callable(s) to short circuit the unneeded tasks or push arguments with xcom for each of them, changing their behavior at run time but not changing the DAG.
Regarding your code sample, you never call your function which registers your tasks in your DAG.
To have a kind of dynamic tasks, you can have a single operator which does something different depending on some state or you can have a handful of operators which can be skipped depending on the state, with a ShortCircuitOperator.
I appreciate all the work everybody has done here as I have the same challenge of creating dynamically structured DAGs. I have done enough mistakes to not use software against its design. If I cant inspect the whole run on the UI and zoom in and out, basically use airflow features, which are the main reason I use it anyway. I can just write multiprocessing code inside a function and be done with it as well.
That all being said my solution is to use a resource manager such as redis locking and have a DAG that writes to this resource manager with data about what to run how to run etc; and have another DAG or DAGs that run in certain intervals polling the resource manager, locking them before running and removing them at finish. This way at least I use airflow as expected even though its specifications dont exactly meet my needs. I breakdown the problem into more definable chunks. The solutions are creative but they are against the design and not tested by the developers. The specifically say to have fixed structured workflows. I cannot put a work around code that is not tested and against design unless I rewrite the core airflow code and test myself. I understand my solution brings complexity with locking and all that but at least I know the boundaries to that.

Module importing multiple times with django + celery

I have a module which is expensive to import (it involves downloading a ~20MB index file), which is used by a celery worker. Unfortunately I can't figure out how to have the module imported only once, and only by the celery worker.
Version 1 tasks.py file:
import expensive_module
#shared_task
def f():
expensive_module.do_stuff()
When I organize the file this way the expensive module is imported both by the web server and the celery instance, which is what I'd expect since the tasks module is imported in both and they're difference processes.
Version 2 tasks.py file:
#shared_task:
def f():
import expensive_module
expensive_module.do_stuff()
In this version the web server never imports the module (which is good), but the module gets re-imported by the celery worker every time f.delay() is called. This is what really confuses me. In this scenario, why is the module re-imported every time this function is run by the celery worker? How can I re-organize this code to have only the celery worker import the expensive module, and have the module imported only once?
As a follow-on, less important question, in Version 1 of the tasks.py file, why does the web instance import the expensive module twice? Both times it's imported form urls.py when django runs self._urlconf_module = import_module(self.urlconf_name).
Make a duplicate tasks.py file for webserver which has empty tasks and no unneeded imports.
For celery use version 1 where you import only once instead of every time you call that task.
Been there and it works.

Categories