I have a process goodreads-user-scraper that runs fine within a cron scheduler script that I run from my Ubuntu terminal.
From my Ubuntu server terminal, I navigate to the directory containing scheduler.py and write:
python scheduler.py
This runs fine. It scrapes the site and saves files to the output_dir I have assigned inside the script.
Now, I want to run this function using a service file (socialAggregator.service).
When I set up a service file in my Ubuntu server to run scheduler.py, goodreads-user-scraper is not recognized. It's the exact same file I just ran from the terminal.
Why is goodreads-user-scraper not found when the service file calls the script?
Any ideas?
Error message form syslog file
Jan 12 22:13:15 speedypersonal2 python[2668]: --user_id: 1: goodreads-user-scraper: not found
socialAggregator.service
[Unit]
Description=Run Social Aggregator scheduler - collect data from API's and store in socialAggregator Db --- DEVELOPMENT ---
After=network.target
[Service]
User=nick
ExecStart= /home/nick/environments/social_agg/bin/python /home/nick/applications/socialAggregator/scheduler.py --serve-in-foreground
[Install]
WantedBy=multi-user.target
scheduler.py
from apscheduler.schedulers.background import BackgroundScheduler
import json
import requests
from datetime import datetime, timedelta
import os
from sa_config import ConfigLocal, ConfigDev, ConfigProd
import logging
from logging.handlers import RotatingFileHandler
import subprocess
if os.environ.get('CONFIG_TYPE')=='local':
config = ConfigLocal()
elif os.environ.get('CONFIG_TYPE')=='dev':
config = ConfigDev()
elif os.environ.get('CONFIG_TYPE')=='prod':
config = ConfigProd()
#Setting up Logger
formatter = logging.Formatter('%(asctime)s:%(name)s:%(message)s')
formatter_terminal = logging.Formatter('%(asctime)s:%(filename)s:%(name)s:%(message)s')
#initialize a logger
logger_init = logging.getLogger(__name__)
logger_init.setLevel(logging.DEBUG)
#where do we store logging information
file_handler = RotatingFileHandler(os.path.join(config.PROJ_ROOT_PATH,'social_agg_schduler.log'), mode='a', maxBytes=5*1024*1024,backupCount=2)
file_handler.setFormatter(formatter)
#where the stream_handler will print
stream_handler = logging.StreamHandler()
stream_handler.setFormatter(formatter_terminal)
logger_init.addHandler(file_handler)
logger_init.addHandler(stream_handler)
def scheduler_funct():
logger_init.info(f"- Started Scheduler on {datetime.today().strftime('%Y-%m-%d %H:%M')}-")
scheduler = BackgroundScheduler()
job_collect_socials = scheduler.add_job(run_goodreads,'cron', hour='*', minute='13', second='15')#Testing
scheduler.start()
while True:
pass
def run_goodreads():
logger_init.info(f"- START run_goodreads() -")
output_dir = os.path.join(config.PROJ_DB_PATH)
goodreads_process = subprocess.Popen(['goodreads-user-scraper', '--user_id', config.GOODREADS_ID,'--output_dir', output_dir], shell=True, stdout=subprocess.PIPE)
logger_init.info(f"- send subprocess now on::: goodreads_process.communicate() -")
_, _ = goodreads_process.communicate()
logger_init.info(f"- FINISH run_goodreads() -")
if __name__ == '__main__':
scheduler_funct()
The problem was the environment which the service file was using was not the same as the environment used when I run the script in the terminal.
Below is the service file that now works.
[Unit]
Description=Run Social Aggregator scheduler - collect data from API's and store in socialAggregator Db --- DEVELOPMENT ---
After=network.target
[Service]
User=nick
ExecStart= /home/nick/environments/social_agg/bin/python /home/nick/applications/socialAggregator/scheduler.py --serve-in-foreground
Environment=PATH=/home/nick/environments/social_agg/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
[Install]
WantedBy=multi-user.target
I added Environment=PATH=<path_from_terminal_and_venv_activated>
For additional clarity, path_from_terminal_and_venv_activated is obtained by:
Activiating my python venv in the terminal
copying the result of echo $PATH
Related
I have written a python script to copy files from local to gcp bucket and capture log info.
The gsutil rsync command is working fine and files are getting copied to corresponding target folders.
However, log info are not appearing on gcp log viewer. The sample script is given below. Please suggest.
## python3 /home/sant/multiprocessing_gs.py
from multiprocessing import Pool
from subprocess import Popen, PIPE, TimeoutExpired, run, CalledProcessError
import os
import sys
import logging as lg
import google.cloud.logging as gcl
from google.cloud.logging.handlers import CloudLoggingHandler
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/sant/key.json"
ftp_path1 = "/home/sant"
GCS_DATA_INGEST_BUCKET_URL = "dev2-ingest-manual"
class GcsMover:
def __init__(self):
self.folder_list = ["raw_amr", "osr_data"]
self.logger = self.create_logger()
def create_logger(self, log_name="Root_Logger", log_level=lg.INFO):
try:
log_format = lg.Formatter("%(levelname)s %(asctime)s - %(message)s")
client = gcl.Client()
log_handler = CloudLoggingHandler(client)
log_handler.setFormatter(log_format)
logger = lg.getLogger(log_name)
logger.setLevel(log_level)
logger.addHandler(log_handler)
return logger
except Exception as e:
sys.exit("WARNING - Invalid cloud logging")
def execute_jobs(self, cmd):
try:
gs_sp = Popen(cmd, stdin=PIPE, stdout=PIPE, stderr=PIPE, shell=True)
print(f"starting process with Pid {str(gs_sp.pid)} for command {cmd}")
self.logger.info(f"starting process with Pid {str(gs_sp.pid)} for command {cmd}")
sp_out, sp_err = gs_sp.communicate(timeout=int(3600))
except OSError:
self.logger.error(f"Processing aborted for Pid {str(gs_sp.pid)}")
except TimeoutExpired:
gs_sp.kill()
self.logger.error(f"Processing aborted for Pid {str(gs_sp.pid)}")
else:
if gs_sp.returncode:
self.logger.error(f"Failure due to {sp_err} for Pid {str(gs_sp.pid)} and command {cmd}")
else:
print(f"Loading successful for Pid {str(gs_sp.pid)}")
self.logger.info(f"Loading successful for Pid {str(gs_sp.pid)}")
def move_files(self):
command_list = []
for folder in self.folder_list:
gs_command = f"gsutil -m rsync -r {ftp_path1}/{folder} gs://{GCS_DATA_INGEST_BUCKET_URL}/{folder}"
command_list.append(gs_command)
pool = Pool(processes=2, maxtasksperchild=1)
pool.map(self.execute_jobs, iterable=command_list)
pool.close()
pool.join()
def main():
gsu = GcsMover()
gsu.move_files()
if __name__ == "__main__":
main()
There's a documentation explaining how to log activity in GGS buckets with Cloud Functions by using the storage trigger.
I have tested it and it worked for me, I used the same code as the offered in the documentation:
def hello_gcs(event, context):
"""Background Cloud Function to be triggered by Cloud Storage.
This generic function logs relevant data when a file is changed.
Args:
event (dict): The dictionary with data specific to this type of event.
The `data` field contains a description of the event in
the Cloud Storage `object` format described here:
https://cloud.google.com/storage/docs/json_api/v1/objects#resource
context (google.cloud.functions.Context): Metadata of triggering event.
Returns:
None; the output is written to Stackdriver Logging
"""
print('Event ID: {}'.format(context.event_id))
print('Event type: {}'.format(context.event_type))
print('Bucket: {}'.format(event['bucket']))
print('File: {}'.format(event['name']))
print('Metageneration: {}'.format(event['metageneration']))
print('Created: {}'.format(event['timeCreated']))
print('Updated: {}'.format(event['updated']))
And for deploying I used the command:
gcloud functions deploy hello_gcs \
--runtime python37 \
--trigger-resource YOUR_TRIGGER_BUCKET_NAME \
--trigger-event google.storage.object.finalize
Google Cloud Storage can log the actions taken on objects, as described in the documentation. You might need to activate audit logs in your project.
Since your script uses rsync, this takes a few actions on GCS (details in the code of the command), but as an overview, it will check if the object exists in the bucket (by listing the bucket), if it exists it will compare the hash of the local file with the remote one, and it will upload the file if it has changed or if it didn't exist previously.
All of those actions will be logged in the data access logs, which you can access from the console.
If you want to also keep the local logs (in case there's a local error not logged in the cloud), you can change the command executed by appending a redirect to a log file:
gsutil -m rsync -r /source/path gs://bucket/folder &> /path/to/log
I am using Open Semantic Search (OSS) and I would like to monitor its processes using the Flower tool. The workers that Celery needs should be given as OSS states on its website
The workers will do tasks like analysis and indexing of the queued files. The workers are implemented by etl/tasks.py and will be started automatically on boot by the service opensemanticsearch.
This tasks.py file looks as follows:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
#
# Queue tasks for batch processing and parallel processing
#
# Queue handler
from celery import Celery
# ETL connectors
from etl import ETL
from etl_delete import Delete
from etl_file import Connector_File
from etl_web import Connector_Web
from etl_rss import Connector_RSS
verbose = True
quiet = False
app = Celery('etl.tasks')
app.conf.CELERYD_MAX_TASKS_PER_CHILD = 1
etl_delete = Delete()
etl_web = Connector_Web()
etl_rss = Connector_RSS()
#
# Delete document with URI from index
#
#app.task(name='etl.delete')
def delete(uri):
etl_delete.delete(uri=uri)
#
# Index a file
#
#app.task(name='etl.index_file')
def index_file(filename, wait=0, config=None):
if wait:
time.sleep(wait)
etl_file = Connector_File()
if config:
etl_file.config = config
etl_file.index(filename=filename)
#
# Index file directory
#
#app.task(name='etl.index_filedirectory')
def index_filedirectory(filename):
from etl_filedirectory import Connector_Filedirectory
connector_filedirectory = Connector_Filedirectory()
result = connector_filedirectory.index(filename)
return result
#
# Index a webpage
#
#app.task(name='etl.index_web')
def index_web(uri, wait=0, downloaded_file=False, downloaded_headers=[]):
if wait:
time.sleep(wait)
result = etl_web.index(uri, downloaded_file=downloaded_file, downloaded_headers=downloaded_headers)
return result
#
# Index full website
#
#app.task(name='etl.index_web_crawl')
def index_web_crawl(uri, crawler_type="PATH"):
import etl_web_crawl
result = etl_web_crawl.index(uri, crawler_type)
return result
#
# Index webpages from sitemap
#
#app.task(name='etl.index_sitemap')
def index_sitemap(uri):
from etl_sitemap import Connector_Sitemap
connector_sitemap = Connector_Sitemap()
result = connector_sitemap.index(uri)
return result
#
# Index RSS Feed
#
#app.task(name='etl.index_rss')
def index_rss(uri):
result = etl_rss.index(uri)
return result
#
# Enrich with / run plugins
#
#app.task(name='etl.enrich')
def enrich(plugins, uri, wait=0):
if wait:
time.sleep(wait)
etl = ETL()
etl.read_configfile('/etc/opensemanticsearch/etl')
etl.read_configfile('/etc/opensemanticsearch/enhancer-rdf')
etl.config['plugins'] = plugins.split(',')
filename = uri
# if exist delete protocoll prefix file://
if filename.startswith("file://"):
filename = filename.replace("file://", '', 1)
parameters = etl.config.copy()
parameters['id'] = uri
parameters['filename'] = filename
parameters, data = etl.process (parameters=parameters, data={})
return data
#
# Read command line arguments and start
#
#if running (not imported to use its functions), run main function
if __name__ == "__main__":
from optparse import OptionParser
parser = OptionParser("etl-tasks [options]")
parser.add_option("-q", "--quiet", dest="quiet", action="store_true", default=False, help="Don\'t print status (filenames) while indexing")
parser.add_option("-v", "--verbose", dest="verbose", action="store_true", default=False, help="Print debug messages")
(options, args) = parser.parse_args()
if options.verbose == False or options.verbose==True:
verbose = options.verbose
etl_delete.verbose = options.verbose
etl_web.verbose = options.verbose
etl_rss.verbose = options.verbose
if options.quiet == False or options.quiet==True:
quiet = options.quiet
app.worker_main()
I read multiple tutorials about Celery and from my understanding, this line should do the job
celery -A etl.tasks flower
but it doesnt. The result is the statement
Error: Unable to load celery application. The module etl was not found.
Same for
celery -A etl.tasks worker --loglevel=debug
so Celery itself seems to be causing the trouble, not flower. I also tried e.g. celery -A etl.index_filedirectory worker --loglevel=debug but with the same result.
What am I missing? Do I have to somehow tell Celery where to find etl.tasks? Online research doesn't really show a similar case, most of the "Module not found" errors seem to occur while importing stuff. So possibly it's a silly question but I couldn't find a solution anywhere. I hope you guys can help me. Unfortunately, I won't be able to respond until Monday though, sorry in advance.
I got same issue, I installed and configured my queue as follows, and it works.
Install RabbitMQ
MacOS
brew install rabbitmq
sudo vim ~/.bash_profile
In bash_profile add the following line:
PATH=$PATH:/usr/local/sbin
Then update bash_profile:
sudo source ~/.bash_profile
Linux
sudo apt-get install rabbitmq-server
Configure RabbitMQ
Launch the queue:
sudo rabbitmq-server
In another Terminal, configure the queue:
sudo rabbitmqctl add_user myuser mypassword
sudo rabbitmqctl add_vhost myvhost
sudo rabbitmqctl set_user_tags myuser mytag
sudo rabbitmqctl set_permissions -p myvhost myuser ".*" ".*" ".*"
Launch Celery
I would suggest to go in the folder that contains task.py and use the following command:
celery -A task worker -l info -Q celery --concurrency 5
Beware that this error means two things:
The module is missing
The module exists but cannot be loaded. If it has errors in it, such as a SyntaxError for instance.
To check that it's not the latter, run:
python -c "import <myModuleContainingTasksDotPyFile>"
In the context of this question:
python -c "import etl"
If it crashes, fix this first (Unlike with celery, you'll get a detailed error message).
Solutions above did not work for me.
I had the same issue and my problem was that in main celery.py (that was in SmartCalend folder) I had:
app = Celery('proj')
but instead I must type there:
app = Celery('SmartCalend')
where SmartCalend is the actual app name where celery.py belongs (!). not any random word, but precisely app name. Thats nowhere mentioned, only in official docs here:
Try export PYTHONPATH=<parent directory> where parent directory is the folder where the etl is. Run the Celery worker, and see it if fixes your problem. This is probably one of the most common Celery "issues" (not really Celery, but Python in general). Alternatively, run the Celery worker from that folder.
Answer for MacOS Catalina:
When you install celery with pip (pip install celery), python can import celery, but you are not able to launch celery from the terminal because the terminal does not know of the celery executable.
Add celery to the path to fix:
nano ~/.bash_profile
In the file add: export PATH="/Users/gavinbelson/Library/Python/2.7/bin:$PATH"
To save the file in the nano editor: ctrl+o, then enter, then ctrl+x
To update the terminal with your change type: source ~/.bash_profile
Now you should be able to type celery in the terminal window
---- Note this is for the default python terminal command which runs version 2.7. If you are using python3 to run python, you would need to change alter the path variable accordingly
I know crons run in a different environment than command lines, but I'm using absolute paths everywhere and I don't understand why my script behaves differently. I believe it is somehow related to my cron_supervisor which runs the django "manage.py" within a sub process.
Cron:
0 * * * * /home/p1/.virtualenvs/prod/bin/python /home/p1/p1/manage.py cron_supervisor --command="/home/p1/.virtualenvs/prod/bin/python /home/p1/p1/manage.py envoyer_argent"
This will call the cron_supervisor, and it's call the script, but the script won't be executed as it would if I would run:
/home/p1/.virtualenvs/prod/bin/python /home/p1/p1/manage.py envoyer_argent
Is there something particular to be done for the script to be called properly when running it through another script?
Here is the supervisor, which basically is for error handling and making sure we get warned if something goes wrong within the cron scripts themselves.
import logging
import os
from subprocess import PIPE, Popen
from django.core.management.base import BaseCommand
from command_utils import email_admin_error, isomorphic_logging
from utils.send_slack_message import send_slack_message
CURRENT_DIR = os.path.dirname(os.path.abspath(__file__))
PROJECT_DIR = CURRENT_DIR + '/../../../'
logging.basicConfig(
level=logging.INFO,
filename=PROJECT_DIR + 'cron-supervisor.log',
format='%(asctime)s %(levelname)s: %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
class Command(BaseCommand):
help = "Control a subprocess"
def add_arguments(self, parser):
parser.add_argument(
'--command',
dest='command',
help="Command to execute",
)
parser.add_argument(
'--mute_on_success',
dest='mute_on_success',
action='store_true',
help="Don't post any massage on success",
)
def handle(self, *args, **options):
try:
isomorphic_logging(logging, "Starting cron supervisor with command \"" + options['command'] + "\"")
if options['command']:
self.command = options['command']
else:
error_message = "Empty required parameter --command"
# log error
isomorphic_logging(logging, error_message, "error")
# send slack message
send_slack_message("Cron Supervisor Error: " + error_message)
# send email to admin
email_admin_error("Cron Supervisor Error", error_message)
raise ValueError(error_message)
if options['mute_on_success']:
self.mute_on_success = True
else:
self.mute_on_success = False
# running process
process = Popen([self.command], stdout=PIPE, stderr=PIPE, shell=True)
output, error = process.communicate()
if output:
isomorphic_logging(logging, "Output from cron:" + output)
# check for any subprocess error
if process.returncode != 0:
error_message = 'Command \"{command}\" - Error \nReturn code: {code}\n```{error}```'.format(
code=process.returncode,
error=error,
command=self.command,
)
self.handle_error(error_message)
else:
message = "Command \"{command}\" ended without error".format(command=self.command)
isomorphic_logging(logging, message)
# post message on slack if process isn't muted_on_success
if not self.mute_on_success:
send_slack_message(message)
except Exception as e:
error_message = 'Command \"{command}\" - Error \n```{error}```'.format(
error=e,
command=self.command,
)
self.handle_error(error_message)
def handle_error(self, error_message):
# log the error in local file
isomorphic_logging(logging, error_message)
# post message in slack
send_slack_message(error_message)
# email admin
email_admin_error("Cron Supervisor Error", error_message)
Example of script not executed properly when being called by the cron, through the cron_supervisor:
# -*- coding: utf-8 -*-
import json
import logging
import os
from django.conf import settings
from django.core.management.base import BaseCommand
from utils.lock import handle_lock
logging.basicConfig(
level=logging.INFO,
filename=os.path.join(settings.BASE_DIR, 'crons.log'),
format='%(asctime)s %(levelname)s: %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
class Command(BaseCommand):
help = "Envoi de l'argent en attente"
#handle_lock
def handle(self, *args, **options):
logging.info("some logs that won't be log (not called)")
logging.info("Those logs will be correcly logged")
Additionally, I have another issue with the logging which I don't quite understand either, I specify to store logs within cron-supervisor.log but they don't get stored there, I couldn't figure out why. (but that's not related to my main issue, just doesn't help with debug)
Your cron job can't just run the Python interpreter in the virtualenv; this is completely insufficient. You need to activate the env just like in an interactive environment.
0 * * * * . /home/p1/.virtualenvs/prod/bin/activate; python /home/p1/p1/manage.py cron_supervisor --command="python /home/p1/p1/manage.py envoyer_argent"
This is already complex enough that you might want to create a separate wrapper script containing these commands.
Without proper diagnostics of how your current script doesn't work, it's entirely possible that this fix alone is insufficient. Cron jobs do not only (or particularly) need absoute paths; the main differences compared to interactive shells is that cron jobs run with a different and more spare environment, where e.g. the shell's PATH, various library paths, environment variables etc can be different or missing altogether; and of course, no interactive facilities are available.
The system variables will hopefully be taken care of by your virtualenv; if it's correctly done, activating it will set up all the variables (PATH, PYTHONPATH, etc) your script needs. There could still be things like locale settings which are set up by your shell only when you log in interactively; but again, without details, let's just hope this isn't an issue for you.
The reason some people recommend absolute paths is that this will work regardless of your working directory. But a correctly written script should work fine in any directory; if it matters, the cron job will start in the owner's home directory. If you wanted to point to a relative path from there, this will work fine inside a cron job just as it does outside.
As an aside, you probably should not use subprocess.Popen() if one of the higher-level wrappers from the subprocess module do what you want. Unless compatibility with legacy Python versions is important, you should probably use subprocess.run() ... though running Python as a subprocess of Python is also often a useless oomplication. See also my answer to this related question.
I am a starter to PyEZ. Can I write a cron job in PyEZ which will connect to 8 routers and fetch the running Config on device and save to 8 different files at a particular timestamp. Could you help me achieve the same.
I have already written a PyEZ code which will write the Base config to my local file.
Loading the config files to local file
from jnpr.junos import Device
from lxml import etree
dev = Device(host='hostname',port='22',user='root', password='sitlab123!' )
dev.open()class Create_Config():
def __init__(self):
cnf = dev.rpc.get_config() ####Get Config as Str
with open('myfile.txt', "w") as text_file:
text_file.write(etree.tostring(cnf))
text_file.close()
#####Return Configuration
def get_conf(self):
return dev.cli("show configuration")
You can use python-crontab module along with PyEZ module.
Python-crontab
To create a new cron job is as follows:
from crontab import CronTab
#init cron
cron = CronTab()
#add new cron job
job = cron.new(command='/usr/bin/echo')
#job settings
job.hour.every(4)
I have an application that needs to initialize Celery and other things (e.g. database). I would like to have a .ini file that would contain the applications configuration. This should be passed to the application at runtime.
development.init:
[celery]
broker=amqp://localhost/
backend=amqp://localhost/
task.result.expires=3600
[database]
# database config
# ...
celeryconfig.py:
from celery import Celery
import ConfigParser
config = ConfigParser.RawConfigParser()
config.read(...) # Pass this from the command line somehow
celery = Celery('myproject.celery',
broker=config.get('celery', 'broker'),
backend=config.get('celery', 'backend'),
include=['myproject.tasks'])
# Optional configuration, see the application user guide.
celery.conf.update(
CELERY_TASK_RESULT_EXPIRES=config.getint('celery', 'task.result.expires')
)
# Initialize database, etc.
if __name__ == '__main__':
celery.start()
To start Celery, I call:
celery worker --app=myproject.celeryconfig -l info
Is there anyway to pass in the config file without doing something ugly like setting a environment variable?
Alright, I took Jordan's advice and used a env variable. This is what I get in celeryconfig.py:
celery import Celery
import os
import sys
import ConfigParser
CELERY_CONFIG = 'CELERY_CONFIG'
if not CELERY_CONFIG in os.environ:
sys.stderr.write('Missing env variable "%s"\n\n' % CELERY_CONFIG)
sys.exit(2)
configfile = os.environ['CELERY_CONFIG']
if not os.path.isfile(configfile):
sys.stderr.write('Can\'t read file: "%s"\n\n' % configfile)
sys.exit(2)
config = ConfigParser.RawConfigParser()
config.read(configfile)
celery = Celery('myproject.celery',
broker=config.get('celery', 'broker'),
backend=config.get('celery', 'backend'),
include=['myproject.tasks'])
# Optional configuration, see the application user guide.
celery.conf.update(
CELERY_TASK_RESULT_EXPIRES=config.getint('celery', 'task.result.expires'),
)
if __name__ == '__main__':
celery.start()
To start Celery:
$ export CELERY_CONFIG=development.ini
$ celery worker --app=myproject.celeryconfig -l info
How is setting an environment variable ugly? You either set an environment variable with the current version of your application, or you can derive it based on hostname, or you can make your build/deployment process overwrite the file, and on development you let development.ini copy over to settings.ini in a general location, and on production you let production.ini copy over to settings.ini.
Any of these options are quite common. Using a configuration management tool such as Chef or Puppet to put the file in place is a good option.