I am using Open Semantic Search (OSS) and I would like to monitor its processes using the Flower tool. The workers that Celery needs should be given as OSS states on its website
The workers will do tasks like analysis and indexing of the queued files. The workers are implemented by etl/tasks.py and will be started automatically on boot by the service opensemanticsearch.
This tasks.py file looks as follows:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
#
# Queue tasks for batch processing and parallel processing
#
# Queue handler
from celery import Celery
# ETL connectors
from etl import ETL
from etl_delete import Delete
from etl_file import Connector_File
from etl_web import Connector_Web
from etl_rss import Connector_RSS
verbose = True
quiet = False
app = Celery('etl.tasks')
app.conf.CELERYD_MAX_TASKS_PER_CHILD = 1
etl_delete = Delete()
etl_web = Connector_Web()
etl_rss = Connector_RSS()
#
# Delete document with URI from index
#
#app.task(name='etl.delete')
def delete(uri):
etl_delete.delete(uri=uri)
#
# Index a file
#
#app.task(name='etl.index_file')
def index_file(filename, wait=0, config=None):
if wait:
time.sleep(wait)
etl_file = Connector_File()
if config:
etl_file.config = config
etl_file.index(filename=filename)
#
# Index file directory
#
#app.task(name='etl.index_filedirectory')
def index_filedirectory(filename):
from etl_filedirectory import Connector_Filedirectory
connector_filedirectory = Connector_Filedirectory()
result = connector_filedirectory.index(filename)
return result
#
# Index a webpage
#
#app.task(name='etl.index_web')
def index_web(uri, wait=0, downloaded_file=False, downloaded_headers=[]):
if wait:
time.sleep(wait)
result = etl_web.index(uri, downloaded_file=downloaded_file, downloaded_headers=downloaded_headers)
return result
#
# Index full website
#
#app.task(name='etl.index_web_crawl')
def index_web_crawl(uri, crawler_type="PATH"):
import etl_web_crawl
result = etl_web_crawl.index(uri, crawler_type)
return result
#
# Index webpages from sitemap
#
#app.task(name='etl.index_sitemap')
def index_sitemap(uri):
from etl_sitemap import Connector_Sitemap
connector_sitemap = Connector_Sitemap()
result = connector_sitemap.index(uri)
return result
#
# Index RSS Feed
#
#app.task(name='etl.index_rss')
def index_rss(uri):
result = etl_rss.index(uri)
return result
#
# Enrich with / run plugins
#
#app.task(name='etl.enrich')
def enrich(plugins, uri, wait=0):
if wait:
time.sleep(wait)
etl = ETL()
etl.read_configfile('/etc/opensemanticsearch/etl')
etl.read_configfile('/etc/opensemanticsearch/enhancer-rdf')
etl.config['plugins'] = plugins.split(',')
filename = uri
# if exist delete protocoll prefix file://
if filename.startswith("file://"):
filename = filename.replace("file://", '', 1)
parameters = etl.config.copy()
parameters['id'] = uri
parameters['filename'] = filename
parameters, data = etl.process (parameters=parameters, data={})
return data
#
# Read command line arguments and start
#
#if running (not imported to use its functions), run main function
if __name__ == "__main__":
from optparse import OptionParser
parser = OptionParser("etl-tasks [options]")
parser.add_option("-q", "--quiet", dest="quiet", action="store_true", default=False, help="Don\'t print status (filenames) while indexing")
parser.add_option("-v", "--verbose", dest="verbose", action="store_true", default=False, help="Print debug messages")
(options, args) = parser.parse_args()
if options.verbose == False or options.verbose==True:
verbose = options.verbose
etl_delete.verbose = options.verbose
etl_web.verbose = options.verbose
etl_rss.verbose = options.verbose
if options.quiet == False or options.quiet==True:
quiet = options.quiet
app.worker_main()
I read multiple tutorials about Celery and from my understanding, this line should do the job
celery -A etl.tasks flower
but it doesnt. The result is the statement
Error: Unable to load celery application. The module etl was not found.
Same for
celery -A etl.tasks worker --loglevel=debug
so Celery itself seems to be causing the trouble, not flower. I also tried e.g. celery -A etl.index_filedirectory worker --loglevel=debug but with the same result.
What am I missing? Do I have to somehow tell Celery where to find etl.tasks? Online research doesn't really show a similar case, most of the "Module not found" errors seem to occur while importing stuff. So possibly it's a silly question but I couldn't find a solution anywhere. I hope you guys can help me. Unfortunately, I won't be able to respond until Monday though, sorry in advance.
I got same issue, I installed and configured my queue as follows, and it works.
Install RabbitMQ
MacOS
brew install rabbitmq
sudo vim ~/.bash_profile
In bash_profile add the following line:
PATH=$PATH:/usr/local/sbin
Then update bash_profile:
sudo source ~/.bash_profile
Linux
sudo apt-get install rabbitmq-server
Configure RabbitMQ
Launch the queue:
sudo rabbitmq-server
In another Terminal, configure the queue:
sudo rabbitmqctl add_user myuser mypassword
sudo rabbitmqctl add_vhost myvhost
sudo rabbitmqctl set_user_tags myuser mytag
sudo rabbitmqctl set_permissions -p myvhost myuser ".*" ".*" ".*"
Launch Celery
I would suggest to go in the folder that contains task.py and use the following command:
celery -A task worker -l info -Q celery --concurrency 5
Beware that this error means two things:
The module is missing
The module exists but cannot be loaded. If it has errors in it, such as a SyntaxError for instance.
To check that it's not the latter, run:
python -c "import <myModuleContainingTasksDotPyFile>"
In the context of this question:
python -c "import etl"
If it crashes, fix this first (Unlike with celery, you'll get a detailed error message).
Solutions above did not work for me.
I had the same issue and my problem was that in main celery.py (that was in SmartCalend folder) I had:
app = Celery('proj')
but instead I must type there:
app = Celery('SmartCalend')
where SmartCalend is the actual app name where celery.py belongs (!). not any random word, but precisely app name. Thats nowhere mentioned, only in official docs here:
Try export PYTHONPATH=<parent directory> where parent directory is the folder where the etl is. Run the Celery worker, and see it if fixes your problem. This is probably one of the most common Celery "issues" (not really Celery, but Python in general). Alternatively, run the Celery worker from that folder.
Answer for MacOS Catalina:
When you install celery with pip (pip install celery), python can import celery, but you are not able to launch celery from the terminal because the terminal does not know of the celery executable.
Add celery to the path to fix:
nano ~/.bash_profile
In the file add: export PATH="/Users/gavinbelson/Library/Python/2.7/bin:$PATH"
To save the file in the nano editor: ctrl+o, then enter, then ctrl+x
To update the terminal with your change type: source ~/.bash_profile
Now you should be able to type celery in the terminal window
---- Note this is for the default python terminal command which runs version 2.7. If you are using python3 to run python, you would need to change alter the path variable accordingly
Related
I know crons run in a different environment than command lines, but I'm using absolute paths everywhere and I don't understand why my script behaves differently. I believe it is somehow related to my cron_supervisor which runs the django "manage.py" within a sub process.
Cron:
0 * * * * /home/p1/.virtualenvs/prod/bin/python /home/p1/p1/manage.py cron_supervisor --command="/home/p1/.virtualenvs/prod/bin/python /home/p1/p1/manage.py envoyer_argent"
This will call the cron_supervisor, and it's call the script, but the script won't be executed as it would if I would run:
/home/p1/.virtualenvs/prod/bin/python /home/p1/p1/manage.py envoyer_argent
Is there something particular to be done for the script to be called properly when running it through another script?
Here is the supervisor, which basically is for error handling and making sure we get warned if something goes wrong within the cron scripts themselves.
import logging
import os
from subprocess import PIPE, Popen
from django.core.management.base import BaseCommand
from command_utils import email_admin_error, isomorphic_logging
from utils.send_slack_message import send_slack_message
CURRENT_DIR = os.path.dirname(os.path.abspath(__file__))
PROJECT_DIR = CURRENT_DIR + '/../../../'
logging.basicConfig(
level=logging.INFO,
filename=PROJECT_DIR + 'cron-supervisor.log',
format='%(asctime)s %(levelname)s: %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
class Command(BaseCommand):
help = "Control a subprocess"
def add_arguments(self, parser):
parser.add_argument(
'--command',
dest='command',
help="Command to execute",
)
parser.add_argument(
'--mute_on_success',
dest='mute_on_success',
action='store_true',
help="Don't post any massage on success",
)
def handle(self, *args, **options):
try:
isomorphic_logging(logging, "Starting cron supervisor with command \"" + options['command'] + "\"")
if options['command']:
self.command = options['command']
else:
error_message = "Empty required parameter --command"
# log error
isomorphic_logging(logging, error_message, "error")
# send slack message
send_slack_message("Cron Supervisor Error: " + error_message)
# send email to admin
email_admin_error("Cron Supervisor Error", error_message)
raise ValueError(error_message)
if options['mute_on_success']:
self.mute_on_success = True
else:
self.mute_on_success = False
# running process
process = Popen([self.command], stdout=PIPE, stderr=PIPE, shell=True)
output, error = process.communicate()
if output:
isomorphic_logging(logging, "Output from cron:" + output)
# check for any subprocess error
if process.returncode != 0:
error_message = 'Command \"{command}\" - Error \nReturn code: {code}\n```{error}```'.format(
code=process.returncode,
error=error,
command=self.command,
)
self.handle_error(error_message)
else:
message = "Command \"{command}\" ended without error".format(command=self.command)
isomorphic_logging(logging, message)
# post message on slack if process isn't muted_on_success
if not self.mute_on_success:
send_slack_message(message)
except Exception as e:
error_message = 'Command \"{command}\" - Error \n```{error}```'.format(
error=e,
command=self.command,
)
self.handle_error(error_message)
def handle_error(self, error_message):
# log the error in local file
isomorphic_logging(logging, error_message)
# post message in slack
send_slack_message(error_message)
# email admin
email_admin_error("Cron Supervisor Error", error_message)
Example of script not executed properly when being called by the cron, through the cron_supervisor:
# -*- coding: utf-8 -*-
import json
import logging
import os
from django.conf import settings
from django.core.management.base import BaseCommand
from utils.lock import handle_lock
logging.basicConfig(
level=logging.INFO,
filename=os.path.join(settings.BASE_DIR, 'crons.log'),
format='%(asctime)s %(levelname)s: %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
class Command(BaseCommand):
help = "Envoi de l'argent en attente"
#handle_lock
def handle(self, *args, **options):
logging.info("some logs that won't be log (not called)")
logging.info("Those logs will be correcly logged")
Additionally, I have another issue with the logging which I don't quite understand either, I specify to store logs within cron-supervisor.log but they don't get stored there, I couldn't figure out why. (but that's not related to my main issue, just doesn't help with debug)
Your cron job can't just run the Python interpreter in the virtualenv; this is completely insufficient. You need to activate the env just like in an interactive environment.
0 * * * * . /home/p1/.virtualenvs/prod/bin/activate; python /home/p1/p1/manage.py cron_supervisor --command="python /home/p1/p1/manage.py envoyer_argent"
This is already complex enough that you might want to create a separate wrapper script containing these commands.
Without proper diagnostics of how your current script doesn't work, it's entirely possible that this fix alone is insufficient. Cron jobs do not only (or particularly) need absoute paths; the main differences compared to interactive shells is that cron jobs run with a different and more spare environment, where e.g. the shell's PATH, various library paths, environment variables etc can be different or missing altogether; and of course, no interactive facilities are available.
The system variables will hopefully be taken care of by your virtualenv; if it's correctly done, activating it will set up all the variables (PATH, PYTHONPATH, etc) your script needs. There could still be things like locale settings which are set up by your shell only when you log in interactively; but again, without details, let's just hope this isn't an issue for you.
The reason some people recommend absolute paths is that this will work regardless of your working directory. But a correctly written script should work fine in any directory; if it matters, the cron job will start in the owner's home directory. If you wanted to point to a relative path from there, this will work fine inside a cron job just as it does outside.
As an aside, you probably should not use subprocess.Popen() if one of the higher-level wrappers from the subprocess module do what you want. Unless compatibility with legacy Python versions is important, you should probably use subprocess.run() ... though running Python as a subprocess of Python is also often a useless oomplication. See also my answer to this related question.
I'm writing a daemon in python, using the python-daemon package. the daemon is started at boot-time (init.d) and needs to access various devices.
the daemon is to run on an embedded system (beaglebone) running ubuntu.
now my problem is that I want to run the daemon as an unprivileged user rather (e.g. mydaemon) than root.
in order to allow the daemon to access the devices I added that user to the required groups.
in the python code I use daemon.DaemonContext(uid=uidofmydamon).
the process started by root daemonizes nicely and is owned by the correct user, but I get permission denied errors when trying to access the devices.
I wrote a small test application, and it seems that the process does not inherit the group-memberships of the user.
#!/usr/bin/python
import logging, daemon, os
if __name__ == '__main__':
lh=logging.StreamHandler()
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(lh)
uid=1001 ## UID of the daemon user
with daemon.DaemonContext(uid=uid,
files_preserve=[lh.stream],
stderr=lh.stream):
logger.warn("UID : %s" % str(os.getuid()))
logger.warn("groups: %s" % str(os.getgroups()))
when i run the above code as the user with uid=1001 i get something like
$ ./testdaemon.py
UID: 1001
groups: [29,107,1001]
whereas when I run the above code as root (or su), I get:
$ sudo ./testdaemon.py
UID: 1001
groups: [0]
How can I create a daemon-process started by root but with a different effective uid and intact group memberships?
my current solution involves dropping root priviliges before starting the actual daemon, using the chuid argument for start-stop-daemon:
start-stop-daemon \
--start \
--chuid daemonuser \
--name testdaemon \
--pidfile /var/run/testdaemon/test.pid \
--startas /tmp/testdaemon.py \
-- \
--pidfile /var/run/testdaemon/test.pid \
--logfile=/var/log/testdaemon/testdaemon.log
the drawback of this solution is, that i need to create all directories, where the daemon ought to write to (noteably /var/run/testdaemon and /var/log/testdaemon), before starting the actual daemon (with the proper file permissions).
i would have preferred to write that logic in python rather than bash.
for now that works, but me thinketh that this should be solveable in a more elegant fashion.
This can be fixed by monkey patching the daemon module, the code is as follows:
import os, grp, pwd
class DaemonError(Exception):
pass
class DaemonOSEnvironmentError(DaemonError, OSError):
pass
def change_process_owner(uid, gid):
try:
# This line adds all the groups the user is member of
# to keep the expected permissions
os.setgroups(
[g.gr_gid for g in grp.getgrall()
if pwd.getpwuid(uid).pw_name in g.gr_mem
]
)
os.setgid(gid)
os.setuid(uid)
except Exception, exc:
error = DaemonOSEnvironmentError(u"Unable to change process
owner (%(exc)s)" % vars())
raise error
And then the monkey patch:
import daemon
daemon.daemon.change_process_owner = change_process_owner
I have an application that needs to initialize Celery and other things (e.g. database). I would like to have a .ini file that would contain the applications configuration. This should be passed to the application at runtime.
development.init:
[celery]
broker=amqp://localhost/
backend=amqp://localhost/
task.result.expires=3600
[database]
# database config
# ...
celeryconfig.py:
from celery import Celery
import ConfigParser
config = ConfigParser.RawConfigParser()
config.read(...) # Pass this from the command line somehow
celery = Celery('myproject.celery',
broker=config.get('celery', 'broker'),
backend=config.get('celery', 'backend'),
include=['myproject.tasks'])
# Optional configuration, see the application user guide.
celery.conf.update(
CELERY_TASK_RESULT_EXPIRES=config.getint('celery', 'task.result.expires')
)
# Initialize database, etc.
if __name__ == '__main__':
celery.start()
To start Celery, I call:
celery worker --app=myproject.celeryconfig -l info
Is there anyway to pass in the config file without doing something ugly like setting a environment variable?
Alright, I took Jordan's advice and used a env variable. This is what I get in celeryconfig.py:
celery import Celery
import os
import sys
import ConfigParser
CELERY_CONFIG = 'CELERY_CONFIG'
if not CELERY_CONFIG in os.environ:
sys.stderr.write('Missing env variable "%s"\n\n' % CELERY_CONFIG)
sys.exit(2)
configfile = os.environ['CELERY_CONFIG']
if not os.path.isfile(configfile):
sys.stderr.write('Can\'t read file: "%s"\n\n' % configfile)
sys.exit(2)
config = ConfigParser.RawConfigParser()
config.read(configfile)
celery = Celery('myproject.celery',
broker=config.get('celery', 'broker'),
backend=config.get('celery', 'backend'),
include=['myproject.tasks'])
# Optional configuration, see the application user guide.
celery.conf.update(
CELERY_TASK_RESULT_EXPIRES=config.getint('celery', 'task.result.expires'),
)
if __name__ == '__main__':
celery.start()
To start Celery:
$ export CELERY_CONFIG=development.ini
$ celery worker --app=myproject.celeryconfig -l info
How is setting an environment variable ugly? You either set an environment variable with the current version of your application, or you can derive it based on hostname, or you can make your build/deployment process overwrite the file, and on development you let development.ini copy over to settings.ini in a general location, and on production you let production.ini copy over to settings.ini.
Any of these options are quite common. Using a configuration management tool such as Chef or Puppet to put the file in place is a good option.
Original Question
I've got some python scripts which have been using Amazon S3 to upload screenshots taken following Selenium tests within the script.
Now we're moving from S3 to use GitHub so I've found GitPython but can't see how you use it to actually commit to the local repo and push to the server.
My script builds a directory structure similar to \images\228M\View_Use_Case\1.png in the workspace and when uploading to S3 it was a simple process;
for root, dirs, files in os.walk(imagesPath):
for name in files:
filename = os.path.join(root, name)
k = bucket.new_key('{0}/{1}/{2}'.format(revisionNumber, images_process, name)) # returns a new key object
k.set_contents_from_filename(filename, policy='public-read') # opens local file buffers to key on S3
k.set_metadata('Content-Type', 'image/png')
Is there something similar for this or is there something as simple as a bash type git add images command in GitPython that I've completely missed?
Updated with Fabric
So I've installed Fabric on kracekumar's recommendation but I can't find docs on how to define the (GitHub) hosts.
My script is pretty simple to just try and get the upload to work;
from __future__ import with_statement
from fabric.api import *
from fabric.contrib.console import confirm
import os
def git_server():
env.hosts = ['github.com']
env.user = 'git'
env.passowrd = 'password'
def test():
process = 'View Employee'
os.chdir('\Work\BPTRTI\main\employer_toolkit')
with cd('\Work\BPTRTI\main\employer_toolkit'):
result = local('ant viewEmployee_git')
if result.failed and not confirm("Tests failed. Continue anyway?"):
abort("Aborting at user request.")
def deploy():
process = "View Employee"
os.chdir('\Documents and Settings\markw\GitTest')
with cd('\Documents and Settings\markw\GitTest'):
local('git add images')
local('git commit -m "Latest Selenium screenshots for %s"' % (process))
local('git push -u origin master')
def viewEmployee():
#test()
deploy()
It Works \o/ Hurrah.
You should look into Fabric. http://docs.fabfile.org/en/1.4.1/index.html. Automated server deployment tool. I have been using this quite some time, it works pretty fine.
Here is my one of the application which uses it, https://github.com/kracekumar/sachintweets/blob/master/fabfile.py
It looks like you can do this:
index = repo.index
index.add(['images'])
new_commit = index.commit("my commit message")
and then, assuming you have origin as the default remote:
origin = repo.remotes.origin
origin.push()
I installed Celery (latest stable version.)
I have a directory called /home/myuser/fable/jobs. Inside this directory, I have a file called tasks.py:
from celery.decorators import task
from celery.task import Task
class Submitter(Task):
def run(self, post, **kwargs):
return "Yes, it works!!!!!!"
Inside this directory, I also have a file called celeryconfig.py:
BROKER_HOST = "localhost"
BROKER_PORT = 5672
BROKER_USER = "abc"
BROKER_PASSWORD = "xyz"
BROKER_VHOST = "fablemq"
CELERY_RESULT_BACKEND = "amqp"
CELERY_IMPORTS = ("tasks", )
In my /etc/profile, I have these set as my PYTHONPATH:
PYTHONPATH=/home/myuser/fable:/home/myuser/fable/jobs
So I run my Celery worker using the console ($ celeryd --loglevel=INFO), and I try it out.
I open the Python console and import the tasks. Then, I run the Submitter.
>>> import fable.jobs.tasks as tasks
>>> s = tasks.Submitter()
>>> s.delay("abc")
<AsyncResult: d70d9732-fb07-4cca-82be-d7912124a987>
Everything works, as you can see in my console
[2011-01-09 17:30:05,766: INFO/MainProcess] Task tasks.Submitter[d70d9732-fb07-4cca-82be-d7912124a987] succeeded in 0.0398268699646s:
But when I go into my Django's views.py and run the exact 3 lines of code as above, I get this:
[2011-01-09 17:25:20,298: ERROR/MainProcess] Unknown task ignored: "Task of kind 'fable.jobs.tasks.Submitter' is not registered, please make sure it's imported.": {'retries': 0, 'task': 'fable.jobs.tasks.Submitter', 'args': ('abc',), 'expires': None, 'eta': None, 'kwargs': {}, 'id': 'eb5c65b4-f352-45c6-96f1-05d3a5329d53'}
Traceback (most recent call last):
File "/home/myuser/mysite-env/lib/python2.6/site-packages/celery/worker/listener.py", line 321, in receive_message
eventer=self.event_dispatcher)
File "/home/myuser/mysite-env/lib/python2.6/site-packages/celery/worker/job.py", line 299, in from_message
eta=eta, expires=expires)
File "/home/myuser/mysite-env/lib/python2.6/site-packages/celery/worker/job.py", line 243, in __init__
self.task = tasks[self.task_name]
File "/home/myuser/mysite-env/lib/python2.6/site-packages/celery/registry.py", line 63, in __getitem__
raise self.NotRegistered(str(exc))
NotRegistered: "Task of kind 'fable.jobs.tasks.Submitter' is not registered, please make sure it's imported."
It's weird, because the celeryd client does show that it's registered, when I launch it.
[2011-01-09 17:38:27,446: WARNING/MainProcess]
Configuration ->
. broker -> amqp://GOGOme#localhost:5672/fablemq
. queues ->
. celery -> exchange:celery (direct) binding:celery
. concurrency -> 1
. loader -> celery.loaders.default.Loader
. logfile -> [stderr]#INFO
. events -> OFF
. beat -> OFF
. tasks ->
. tasks.Decayer
. tasks.Submitter
Can someone help?
This is what I did which finally worked
in Settings.py I added
CELERY_IMPORTS = ("myapp.jobs", )
under myapp folder I created a file called jobs.py
from celery.decorators import task
#task(name="jobs.add")
def add(x, y):
return x * y
Then ran from commandline: python manage.py celeryd -l info
in another shell i ran python manage.py shell, then
>>> from myapp.jobs import add
>>> result = add.delay(4, 4)
>>> result.result
and the i get:
16
The important point is that you have to rerun both command shells when you add a new function. You have to register the name both on the client and and on the server.
:-)
I believe your tasks.py file needs to be in a django app (that's registered in settings.py) in order to be imported. Alternatively, you might try importing the tasks from an __init__.py file in your main project or one of the apps.
Also try starting celeryd from manage.py:
$ python manage.py celeryd -E -B -lDEBUG
(-E and -B may or may not be necessary, but that's what I use).
See Automatic Naming and Relative Imports, in the docs:
http://celeryq.org/docs/userguide/tasks.html#automatic-naming-and-relative-imports
The tasks name is "tasks.Submitter" (as listed in the celeryd output),
but you import the task as "fable.jobs.tasks.Submitter"
I guess the best solution here is if the worker also sees it as "fable.jobs.tasks.Submitter",
it makes more sense from an app perspective.
CELERY_IMPORTS = ("fable.jobs.tasks", )