Celery - assign session dynamicall/reusing connections - python

Breaking my head for few days for simple task(Thought it simple...not anymore):
Main program sends hundreds of sql queries to fetch data from Multiple DBs .
I thought Celery can be the right choice as it can scale and also simplify the threading/async orchestration .
The "clean" solution would be one generic class supposed to looks something like:
#app.task(bind=True , name='fetch_data')
def fetch_data(self,*args,**kwargs):
db= kwargs['db']
sql= kwargs['sql']
session = DBContext().get_session(db)
result = session.query(sql).all()
...
But having trouble to implement such DBContext class which will instantiate once for each DB and reuse the DB sessions for request and once requests done - close it .
(or any other recommendation you suggest ).
I was thinking about using a Base class to decorate the function and
keep the all available connections there ,
But the problem such class can't init dynamically but once ...
maybe there's way to make it work but not sure how ...
class DatBaseFactory(Task):
def __call__(self, *args, **kwargs):
print("In class",self.db)
self.engine = DBContext.get_db(self.db)
return super().__call__(*args, **kwargs)
#app.task(bind=True ,base=DatBaseFactory, name='test_db', db=db ,engine='' )
def test_db(self,*args,**kwargs):
print("From Task" ,self.engine)
Other alterative would be duplicating the functions as number of the DB and "preserved" them the sessions - but that's quite ugly solution .
Hope some1 can help here with this trouble ....

Related

How to test operations in a context manager using pytest

I have a database handler that utilizes SQLAlchemy ORM to communicate with a database. As part of SQLAlchemy's recommended practices, I interact with the session by using it as a context manager. How can I test what a function called inside the context manager using that context manager has done?
EDIT: I realized the file structure mattered due to the complexity in introduced. I re-structured the code below to more closely mirror what the end file structure will be like, and what a common production repo in my environment would look like, with code being defined in one file and tests in a completely separate file.
For example:
Code File (delete_things_from_table.py):
from db_handler import delete, SomeTable
def delete_stuff(handler):
stmt = delete(SomeTable)
with handler.Session.begin() as session:
session.execute(stmt)
session.commit()
Test File:
import pytest
import delete_things_from_table as dlt
from db_handler import Handler
def test_delete_stuff():
handler = db_handler()
dlt.delete_stuff(handler):
# Test that session.execute was called
# Test the value of 'stmt'
# Test that session.commit was called
I am not looking for a solution specific to SQLAlchemy; I am only utilizing this to highlight what I want to test within a context manager, and any strategies for testing context managers are welcome.
After sleeping on it, I came up with a solution. I'd love additional/less complex solutions if there are any available, but this works:
import pytest
import delete_things_from_table as dlt
from db_handler import Handler
class MockSession:
def __init__(self):
self.execute_params = []
self.commit_called = False
def execute(self, *args, **kwargs):
self.execute_params.append(["call", args, kwargs])
return self
def commit(self):
self.commit_called = True
return self
def begin(self):
return self
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
pass
def test_delete_stuff(monkeypatch):
handler = db_handler()
# Parens in 'MockSession' below are Important, pass an instance not the class
monkeypatch.setattr(handler, Session, MockSession())
dlt.delete_stuff(handler):
# Test that session.execute was called
assert len(handler.Session.execute_params)
# Test the value of 'stmt'
assert str(handler.Session.execute_params[0][1][0]) == "DELETE FROM some_table"
# Test that session.commit was called
assert handler.Session.commit_called
Some key things to note:
I created a static mock instead of a MagicMock as it's easier to control the methods/data flow with a custom mock class
Since the SQLAlchemy session context manager requires a begin() to start the context, my mock class needed a begin. Returning self in begin allows us to test the values later.
context managers rely on on the magic methods __enter__ and __exit__ with the argument signatures you see above.
The mocked class contains mocked methods which alter instance variables allowing us to test later
This relies on monkeypatch (there are other ways I'm sure), but what's important to note is that when you pass your mock class you want to patch in an instance of the class and not the class itself. The parentheses make a world of difference.
I don't think it's an elegant solution, but it's working. I'll happily take any suggestions for improvement.

trigger a celery job via django singnals

I would like to use Django signals to trigger a celery task like so:
def delete_content(sender, instance, **kwargs):
task_id = uuid()
task = delete_libera_contents.apply_async(kwargs={"instance": instance}, task_id=task_id)
task.wait(timeout=300, interval=2)
But I'm always running into kombu.exceptions.EncodeError: Object of type MusicTracks is not JSON serializable
Now I'm not sure how to tread MusicTracks instance as it's a model class instance. How can I properly pass such instances to my task?
At my tasks.py I have the following:
#app.task(name="Delete Libera Contents", queue='high_priority_tasks')
def delete_libera_contents(instance, **kwargs):
libera_backend = instance.file.libera_backend
...
Never send instance in celery task, you only should send variables for example instanse primary key and then inside of the celery task via this pk find this instance and then do your logic
your code should be like this:
views.py
def delete_content(sender, **kwargs):
task_id = uuid()
task = delete_libera_contents.apply_async(kwargs={"instance_pk": sender.pk}, task_id=task_id)
task.wait(timeout=300, interval=2)
task.py
#app.task(name="Delete Libera Contents", queue='high_priority_tasks')
def delete_libera_contents(instance_pk, **kwargs):
instance = Instance.ojbects.get(pk = instance_pk)
libera_backend = instance.file.libera_backend
...
you can find this rule in celery documentation (can't find link), one of
reasons imagine situation:
you send your instance to celery tasks (it is delayed for any reason for 5 min)
then your project makes logic with this instance, before your task finished
then celery's task time come and it uses this instance old version, and this instance become corrupted
(this is the reason as I think it is, not from the documentation)
First off, sorry for making the question a bit confusing, especially for the people that have already written an answer.
In my case, the delete_content signal can be trigger from three different models, so it actually looks like this:
#receiver(pre_delete, sender=MusicTracks)
#receiver(pre_delete, sender=Movies)
#receiver(pre_delete, sender=TvShowEpisodes)
def delete_content(sender, instance, **kwargs):
delete_libera_contents.delay(instance_pk=instance.pk)
So every time one of these models triggers a delete action, this signal will also trigger a celery task to actually delete the stuff in the background (all stored on S3).
As I cannot and should not pass instances around directly as pointed out by #oruchkin, I pass the instance.pk to the celery task which I then have to find in the celery task as I don't know in the celery task what model has triggered the delete action:
#app.task(name="Delete Libera Contents", queue='high_priority_tasks')
def delete_libera_contents(instance_pk, **kwargs):
if Movies.objects.filter(pk=instance_pk).exists():
instance = Movies.objects.get(pk=instance_pk)
elif MusicTracks.objects.filter(pk=instance_pk).exists():
instance = MusicTracks.objects.get(pk=instance_pk)
elif TvShowEpisodes.objects.filter(pk=instance_pk).exists():
instance = TvShowEpisodes.objects.get(pk=instance_pk)
else:
raise logger.exception("Task: 'Delete Libera Contents', reports: No instance found (code: JFN4LK) - Warning")
libera_backend = instance.file.libera_backend
You might ask why do you not simply pass the sender from the signal to the celery task. I also tried this and again, as already pointed out, I cannot pass instances and I fail with:
kombu.exceptions.EncodeError: Object of type ModelBase is not JSON serializable
So it really seems I have to hard obtain the instance using the if-elif-else clauses at the celery task.

SQLAlchemy session cleared in celery job and on_success function

I am building a tool that fetches data from a different database, transforms it, and stores it in my own database. I'm migrating from APScheduler to Celery, but I ran into the following problem:
I use a class I call JobRecords to store when a job ran, whether it was successful and which errors it encountered. I use this to know not too look too far back for updated entries, especially since some tables have multiple millions of rows.
Since the system is the same for all jobs, I created a subclass from the celery Task object. I make sure the job is executed within the Flask app context, and I fetch the latest time this Job finished successfully. I also make sure I register a value for now to avoid timing issues between querying the database and adding the job record.
class RecordedTask(Task):
"""
Task sublass that uses JobRecords to get the last run date
and add new JobRecords on completion
"""
now: datetime = None
ignore_result = True
_session: scoped_session = None
success: bool = True
info: dict = None
#property
def session(self) -> Session:
"""Making sure we have one global session instance"""
if self._session is None:
from app.extensions import db
self._session = db.session
return self._session
def __call__(self, *args, **kwargs):
from app.models import JobRecord
kwargs['last_run'] = (
self.session.query(func.max(JobRecord.run_at_))
.filter(JobRecord.job_id == self.name, JobRecord.success)
.first()
)[0] or datetime.min
self.now = kwargs['now'] = datetime.utcnow()
with app.app_context():
super(RecordedTask, self).__call__(*args, **kwargs)
def on_failure(self, exc, task_id, args: list, kwargs: dict, einfo):
self.session.rollback()
self.success = False
self.info = dict(
args=args,
kwargs=kwargs,
error=exc.args,
exc=format_exception(exc.__class__, exc, exc.__traceback__),
)
app.logger.error(f"Error executing job '{self.name}': {exc}")
def on_success(self, retval, task_id, args: list, kwargs: dict):
app.logger.info(f"Executed job '{self.name}' successfully, adding JobRecord")
for entry in self.to_trigger:
if len(entry) == 2:
job, kwargs = entry
else:
job, = entry
kwargs = {}
app.logger.info(f"Scheduling job '{job}'")
current_celery_app.signature(job, **kwargs).delay()
def after_return(self, *args, **kwargs):
from app.models import JobRecord
record = JobRecord(
job_id=self.name,
run_at_=self.now,
info=self.info,
success=self.success
)
self.session.add(record)
self.session.commit()
self.session.remove()
I added an example of a job to update a model called Location, but there are a lot of jobs just like this one.
#celery.task(bind=True, name="update_locations")
def update_locations(self, last_run: datetime = datetime.min, **_):
"""Get the locations from the external database and check for updates"""
locations: List[ExternalLocation] = ExternalLocation.query.filter(
ExternalLocation.updated_at_ >= last_run
).order_by(ExternalLocation.id).all()
app.logger.info(f"ExternalLocation: collected {len(locations)} updated locations")
for update_location in locations:
existing_location: Location = Location.query.filter(
Location.external_id == update_location.id
).first()
if existing_location is None:
self.session.add(Location.from_worker(update_location))
else:
existing_location.update_from_worker(update_location)
The problem is that when I run this job, the Location objects are not committed with the JobRecord, so only the latter is created. If I track it with the debugger, Location.query.count() returns the correct value inside the function, but as soon as it enters the on_success callback, it's back to 0, and self._session.new returns an empty dict.
I already tried adding the session as a property to make sure it's the same instance everywhere, but the problem still persists. Maybe it has something to do with it being a scoped_session because of Flask-SQLAlchemy?
Sorry about the large amount of code, I did try to strip as much away as possible. Any help is welcome!
I found out that the culprit was the combination of scoped_session and the Flask app context. Like any contextmanager, running the code with app.app_context() triggered the __exit__ function on leaving, which in turn caused the ScopedRegistry, where the scoped_session was stored, to be cleared. Then, a new session was created, the JobRecords were added to that, and that session was committed. Therefore, the locations would not be written to the database.
There are two possible solutions. If you don't use sessions in other files than in your task, you can add a session property to the task. This way, you avoid the scoped_session alltogether, and can clean up in your after_return function.
#property
def session(self):
if self._session is None:
from dashboard.extensions import db
self._session = db.create_session(options={})()
return self._session
However, I was accessing the session in my model definition files as well, through from extensions import db. Therefore, I was using two different sessions. I ended up using app.app_context().push() instead of the contextmanager, thus avoiding the __exit__ function
app.app_context().push()
super(RecordedTask, self).__call__(*args, **kwargs)

Non-lazy instance creation with Pyro4 and instance_mode='single'

My aim is to provide to a web framework access to a Pyro daemon that has time-consuming tasks at the first loading. So far, I have managed to keep in memory (outside of the web app) a single instance of a class that takes care of the time-consuming loading at its initialization. I can also query it with my web app. The code for the daemon is:
Pyro4.expose
#Pyro4.behavior(instance_mode='single')
class Store(object):
def __init__(self):
self._store = ... # the expensive loading
def query_store(self, query):
return ... # Useful query tool to expose to the web framework.
# Not time consuming, provided self._store is
# loaded.
with Pyro4.Daemon() as daemon:
uri = daemon.register(Thing)
with Pyro4.locateNS() as ns:
ns.register('thing', uri)
daemon.requestLoop()
The issue I am having is that although a single instance is created, it is only created at the first proxy query from the web app. This is normal behavior according to the doc, but not what I want, as the first query is still slow because of the initialization of Thing.
How can I make sure the instance is already created as soon as the daemon is started?
I was thinking of creating a proxy instance of Thing in the code of the daemon, but this is tricky because the event loop must be running.
EDIT
It turns out that daemon.register() can accept either a class or an object, which could be a solution. This is however not recommended in the doc (link above) and that feature apparently only exists for backwards compatibility.
Do whatever initialization you need outside of your Pyro code. Cache it somewhere. Use the instance_creator parameter of the #behavior decorator for maximum control over how and when an instance is created. You can even consider pre-creating server instances yourself and retrieving one from a pool if you so desire? Anyway, one possible way to do this is like so:
import Pyro4
def slow_initialization():
print("initializing stuff...")
import time
time.sleep(4)
print("stuff is initialized!")
return {"initialized stuff": 42}
cached_initialized_stuff = slow_initialization()
def instance_creator(cls):
print("(Pyro is asking for a server instance! Creating one!)")
return cls(cached_initialized_stuff)
#Pyro4.behavior(instance_mode="percall", instance_creator=instance_creator)
class Server:
def __init__(self, init_stuff):
self.init_stuff = init_stuff
#Pyro4.expose
def work(self):
print("server: init stuff is:", self.init_stuff)
return self.init_stuff
Pyro4.Daemon.serveSimple({
Server: "test.server"
})
But this complexity is not needed for your scenario, just initialize the thing (that takes a long time) and cache it somewhere. Instead of re-initializing it every time a new server object is created, just refer to the cached pre-initialized result. Something like this;
import Pyro4
def slow_initialization():
print("initializing stuff...")
import time
time.sleep(4)
print("stuff is initialized!")
return {"initialized stuff": 42}
cached_initialized_stuff = slow_initialization()
#Pyro4.behavior(instance_mode="percall")
class Server:
def __init__(self):
self.init_stuff = cached_initialized_stuff
#Pyro4.expose
def work(self):
print("server: init stuff is:", self.init_stuff)
return self.init_stuff
Pyro4.Daemon.serveSimple({
Server: "test.server"
})

cherrypy mvc with mysql issue

Problem setting up the MVC design with Cherrypy/MySQL. Here is the setup: (assume all the imports are correct)
##controller.py
class User(object):
def __init__(self):
self.model = model.User()
#cherrypy.expose
def index(self):
return 'some HTML to display user home'
## model.py
class Model(object):
_db = None
def __init__(self):
self._db = cherrypy.thread_data.db
class User(Model):
def getuser(self, email):
#get the user with _db and return result
##service.py
class UserService(object):
def __init__(self):
self._model = model.User()
def GET(self, email):
return self._model.getuser(email)
##starting the server
user = controller.User()
user.service = service.UserService()
cherrypy.tree.mount(user, '/user', self.config)
#app.merge(self.config)
cherrypy.engine.subscribe("start_thread", self._onThreadStart)
self._onThreadStart(-1)
def _onThreadStart(self, threadIndex):
cherrypy.thread_data.db = mysql.connect(**self.config["database"])
if __name__ == '__main__':
cherrypy.engine.start()
cherrypy.engine.block()
the above code has error in model.py at the line: cherrypy.thread_data.db.
I got:
AttributeError: '_ThreadData' object has no attribute 'db'
not sure why, could you please point me into the right direction? I can get the connection, and pull info from controller.py at User index, but not in model.py?
Please help.. thanks.
CherryPy doesn't decide for you what tools to use. It is up to you to pick the tools that fit you and your tasks the best. Thus CherryPy doesn't setup any database connection, your cherrypy.thread_data.db, it's your job.
Personally I use the same concept of responsibility separation, kind of MVC, for my CherryPy apps so there follow two possible ways to achieve what you want.
Design note
I would like to note that the simple solution of thread-mapped database connections, at least in case of MySQL, works pretty well in practice. And additional complexity of more old-fashioned connection pools may not be necessary.
There're however points that shouldn't be overlooked. Your database connection may become killed, lost or be in any other state that won't allow you to make queries on it. In this case reconnection must be preformed.
Also pay attention to avoid connection sharing between threads as it will result in hard-to-debug errors and Python crashes. In your example code, it may relate to a service dispatcher and its cache.
Bootstrapping phase
In your code that sets configuration, mounts CherryPy apps, etc.
bootstrap.py
# ...
import MySQLdb as mysql
def _onThreadStart(threadIndex):
cherrypy.thread_data.db = mysql.connect(**config['database'])
cherrypy.engine.subscribe('start_thread', _onThreadStart)
# useful for tests to have db connection on current thread
_onThreadStart(-1)
model.py
import cherrypy
import MySQLdb as mysql
class Model(object):
'''Your abstract model'''
_db = None
def __init__(self):
self._db = cherrypy.thread_data.db
try:
# reconnect if needed
self._db.ping(True)
except mysql.OperationalError:
pass
I wrote a complete CherryPy deployment tutorial, cherrypy-webapp-skeleton, a couple of years ago. You can take a look at the code, as the demo application uses exactly this approach.
Model property
To achieve less code coupling and to avoid import cycles it could be a good idea to move all database related code to model module. It may include, initial connection queries like setting operation timezone, making MySQLdb converters timzeone-aware, etc.
model.py
class Model(object):
def __init__(self):
try:
# reconnect if needed
self._db.ping(True)
except mysql.OperationalError:
pass
#property
def _db(self):
'''Thread-mapped connection accessor'''
if not hasattr(cherrypy.thread_data, 'db'):
cherrypy.thread_data.db = mysql.connect(**config['database'])
return cherrypy.thread_data.db

Categories