Message queues- how do I know who I am?

Message queues- how do I know who I am? - python

I have a Flask application that uses Nose to discover and run a series of tests in a particular directory. The tests take a long time to run, so I want to report the progress to the user as things are happening.
I use Celery to create a task that runs the test so I can return immediately and start displaying a results page. Now I need to start reporting results. I'm thinking in the test that I can just put a message on the queue that says 'I've completed step N'.
I know that Celery has task context I could use to determine which queue to write to, but the test isn't part of the task, it's a function that's called from the task. I also can't use a flask session, because that context is gone when the test run is moved to a task.
I have seen several ways to do data driven nose tests, such as test generators or nose-testconfig, but that doesn't meet the requirement that the message queue name will be dynamic and there may be several threads running the same test.
So, my question is: How do I tell the test that it corresponds to a particular celery task, ie: the one that started the test, so I can report it's status on the correct message queue?

Related

Django: How to ignore tasks with Celery?

Without changing the code itself, Is there a way to ignore tasks in Celery?
For example, when using Django mails, there is a Dummy Backend setting. This is perfect since it allows me, from a .env file to deactivate mail sending in some environments (like testing, or staging). The code itself that handles mail sending is not changed with if statements or decorators.
For celery tasks, I know I could do it in code using mocks or decorators, but I'd like to do it in a clean way that is 12factors compliant, like with Django mails. Any idea?
EDIT to explain why I want to do this:
One of the main motivation behind this, is that it creates coupling between Django web server and Celery tasks.
For example, when running unit tests, if the broker server (Redis for me) is not running, then if delay() method is called, it freezes forever, because there is no timeout when Celery tries to send a task to Redis.
From an architecture view, this is very bad. I'd like my unit tests can run properly without the requirement to run a Celery broker!
Thanks!

As far as the coupling is concerned, your Django application would still be tied to celery if you use a dummy backend. Just your tasks won't execute. Maybe this is acceptable in your case but in my opinion, it can cause some problems. For example, if the piece of code you are trying to test, submits a task to celery, and in a later part, tries to retrieve the result for that task, it will fail. Because the dummy backend will never execute the task.
For unit testing, as you mentioned in your question, you can use the task_always_eager setting. If you turn it on, your Django app will no longer depend upon a running worker. It will execute tasks in the same thread in a synchronous fashion and return the result.

Display progress of a long running Python task in Django

I currently have a typical Django structure set up for a project and one web application.
The web application is set up so that a user inputs some information, and this information is taken as the input to run a Python program.
This python program sometimes can take quite a while to finish (grabbing things from the web and doing some text mining scoring) - sometimes it can take multiple minutes to load.
On the command line, this program would periodically display where it was in the process (it'd first say how many things it found to score against, then it'd say where in the number of things found it is in the scoring process), which was very useful. However, when I moved this over to a Django set up, I no longer have this capability (at least, not in the same way since now this is sent to log files).
The way I set it up is that there is an input view, and then a results view. The results view takes the input and runs the Python program. It won't display the results until the entire program is run. So on the user side, the browser just sits there for sometimes minutes before the results are displayed. Obviously, this is not ideal.
Does anyone know of the best way to bring status information on a task to Django?
I've looked into Celery a little bit, but I think since I'm still a beginner in Django that I'm confusing myself with some of the documentation. For instance: even if the task is sent off asynchronously to a worker, how does the browser grab the current state of the program?? Also, consistent documentation seems to be lacking for celery on Django (I've seen people set up celery many different ways on their Django projects).
I would appreciate any input here, I've been stuck on this for a while now.

My first suggestion is to psychologically separate celery from django when you start to think of the two. They can run in the same environment, but celery is to asynchronous processes what django is to http requests.
Also remember that celery is unlike diango in that it requires other services to function; a message broker. So by using celery you will increase your architectural requirements.
To address you specific use case, you'll need a system to publish messages from each celery task to a message broker and your web client will need to subscribe to those messages.
There's a lot involved here, but the short version is that you can use Redis as your celery message broker as well as your pub/sub service to get messages back to the browser. You can then use e.g diango-redis-websockets to subscribe the browser to the task state messages in redis

how to make email existence check faster?

I am using email validator to validate if a email address exists or not. The process seems very time consuming. I have tried using interruptingcow to decrease the time taken by each email address waiting for Timeout response. This method worked outside django but inside django, I couldn't call interruptingcow as it asks to be called from the main thread and I have tried many ways to solve it, but failed.
Secondly, I tried multi threading the process, the thread runs just the way i wanted to but I can't get a return value from the thread. For which I tried implementing a Queue, which wasn't of quite the help.
I would like to ask for any supplements of the validate_email or want the process called by
validate_email("emailaddress#email.com",verify=True)
to run faster as I would have to process about 20 emails at a time.
Any suggestions or help is most welcome.

I usually relegate external processes to celery, it is an industry standard for such things
http://www.celeryproject.org/
If i need to run one background task for a project, django-tasks is usually sufficient for longer processing tasks, is easier to setup and doesn't need external queueing like rabbitmq or redis, which celery needs
https://code.google.com/p/django-tasks/

Run a Celery worker that connects to the Django Test DB

BACKGROUND: I'm working on a project that uses Celery to schedule tasks that will run at a certain time in the future. These tasks push the state of the Final State Machine forward. Here's an example:
A future reminder is scheduled to be sent to the user in 2 days.
When that scheduled task runs, an email is sent, and the FSM is advanced to the next state
The next state is to schedule a reminder to run in another two days
When this task runs, it will send another email, advance state
Etc...
I'm currently using CELERY_ALWAYS_EAGER as suggested by this SO answer
The problem with using that technique in tests, is that the task code, which is meant to run in a separate thread is running in the same one as the one that schedules it. This causes the FSM state to not be saved properly, and making it hard to test. I haven't been able to determine what exactly causes it, but it seems like at the bottom of the call stack you are saving to the current state, but as you return up the call stack, a previous state is being saved. I could possibly spend more time determining what is going wrong when the code is not running how it should, but it seems more logical to try to get the code running how it should and make sure it's doing what it should.
QUESTION: I would therefore like to know if there is a way to run a full on celery setup that django can use during a test run. If it could be run automagically, that would be ideal, but even some manual intervention would be better than having to test behavior by hand. I'm thinking something could be possible if I set a break in the tests, run the celery worker to connect to the test DB, continue the django tests. Has anyone tried something like this before?

What you are trying to do is not unit testing but rather functional / integration testing.
I would recommend to use some BDD framework (Behave, Lettuce) and run BDD tests from a CI server (TravisCI or Jenkins) against external server (staging environment for example).
So, the process could be:
Push changes to GitHub
GitHub launches build on CI server
CI server runs unit tests
CI server deploys to integration environment (or staging, if you don't have integration)
CI server runs integration end to end tests against the new deployed code
If all succeeds, this build will be promoted to "can be deploy to production" or something like that

Run a repeating task for a web app

This seems like a simple question, but I am having trouble finding the answer.
I am making a web app which would require the constant running of a task.
I'll use sites like Pingdom or Twitterfeed as an analogy. As you may know, Pingdom checks uptime, so is constantly checking websites to see if they are up and Twitterfeed checks RSS feeds to see if they;ve changed and then tweet that. I too need to run a simple script to cycle through URLs in a database and perform an action on them.
My question is: how should I implement this? I am familiar with cron, currently using it to do my server backups. Would this be the way to go?
I know how to make a Python script which runs indefinitely, starting back at the beginning with the next URL in the database when I'm done. Should I just run that on the server? How will I know it is always running and doesn't crash or something?
I hope this question makes sense and I hope I am not repeating someone else or anything.
Thank you,
Sam
Edit: To be clear, I need the task to run constantly. As in, check URL 1 in the database, check URl 2 in the database, check URL 3 and, when it reaches the last one, go right back to the beginning. Thanks!

If you need a repeatable running of the task which can be run from command line - that's what the cron is ideal for.
I don't see any demerits of this approach.
Update:
Okay, I saw the issue somewhat different. Now I see several solutions:
run the cron task at set intervals, let it process the data once per run, next time it will process the data on another run; use PIDs/Database/semaphores to avoid parallel processes;
update the processes that insert/update data in the database; let the information be processed when it is inserted/updated; c)
write a demon process which will reside in memory and check the data in real time.

cron would definitely be a way to go with this, as well as any other task scheduler you may prefer.
The main point is found in the title to your question:
Run a repeating task for a web app
The background task and the web application should be kept separate. They can share code, they can share access to a database, but they should be separate and discrete application contexts. (Consider them as separate UIs accessing the same back-end logic.)
The main reason for this is because web applications and background processes are architecturally very different and aren't meant to be mixed. Consider the structure of a web application being held within a web server (Apache, IIS, etc.). When is the application "running"? When it is "on"? It's not really a running task. It's a service waiting for input (requests) to handle and generate output (responses) and then go back to waiting.
Web applications are for responding to requests. Scheduled tasks or daemon jobs are for running repeated processes in the background. Keeping the two separate will make your management of the two a lot easier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.