We're experiencing some strange issues with Nagios. We wrote a script in Python which uses paramiko, httplib and re. When we comment out the code that is written to use paramiko the script returns OK in Nagios. When we uncomment the script the status just returns (null).
This is our code
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
#ssh.connect(self.get('hostname'),int(self.get('port')),self.get('username'),allow_agent=True)
ssh.connect('192.168.56.102' , 22 , 'oracle' ,allow_agent=True)
link = '127.0.0.1:4848'
stdin,stdout,stderr = ssh.exec_command('wget --no-proxy ' + link + ' 2>&1 | grep -i "failed\|error"')
result = stdout.readlines()
result = " ".join(result)
if result == "":
return MonitoringResult(MonitoringResult.OK,'Webservice up')
else:
return MonitoringResult(MonitoringResult.CRITICAL,'Webservice down %s' % result)
So when we comment out the part above
if result =="":
and add result="" above the if, it will just return 'Webservice up' in Nagios. When we enable the code above the if it will just return (null). Is there a conflict with paramiko or something?
When running the code in terminal it just returns the correct status but it doesn't show when implemented in Nagios
I found that although nagios runs as effective user "nagios" it is using the environment settings for user "root" and not able to read the root private key to make the connection.
Adding key_filename='/nagiosuserhomedir/.ssh/id_dsa' to the options for ssh.connect() solved this same problem for me (pass the private key for nagios user explicitly in the code).
Related
By no means do I write scripts very often, but I am trying to write a Nagios plugin to check the status of a RAID controller on a remote host. The issue is that the command to get the output requires elevated privileges. What would be the correct, and most effective way to pull this off? The goal is to run:
'/opt/MegaRAID/MegaCli/MegaCli64 -ShowSummary -a0'
on a remote host from the monitoring server,
and then follow the basic idea of this logic:
#Nagios Plugin for Testing LSI Raid Status
import os, sys
import argparse
import socket
import subprocess
#nagios exit codes do not change#
OK = 0
WARNING = 1
CRITICAL = 2
DEPENDENT = 3
UNKNOWN = 4
#nagios exit codes do not change#
#patterns to be searched
active = str("Active")
online = str("Online")
k = str("OK")
degrade = str("Degraded")
fail = str("Failed")
parser = argparse.ArgumentParser(description='Py3 script for monitoring RAID status.')
#arguments
parser.add_argument("--user",
metavar = '-U',
help = "username for remote connection")
parser.add_argument("--hostname",
metavar = '-H',
help = "hostname of the remote host")
args = parser.parse_args()
print(args)
#turning args into variables
hostname = args.hostname
user = args.user
ssh = subprocess.Popen(f"ssh {user}#{hostname} /opt/MegaRAID/MegaCli/MegaCli64 -ShowSummary -a0", shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
check = ssh.stdoutreadlines()
OK_STR = str("RAID is OK!")
WARN_STR = str("Warning! Something is wrong with the RAID!")
CRIT_STR = str("CRITICAL! THE RAID IS BROKEN")
UNK_STR = str("Uh oh! Something ain't right?")
if (degrade) in (check):
print(WARN_STR) and exit(WARNING)
elif (fail) in (check):
print (CRIT_STR) and exit(CRITICAL)
elif (active) or (online) or (k) in (check):
print(OK_STR) and exit(OK)
else:
print(UNK_STR) and exit(UNKNOWN)
Any thoughts? This is far from my forte (and also an unfinished script) so I apologize for the layman format and any confusion in my phrasing.
I am trying to write a Nagios plugin to check the status of a RAID controller on a remote host. The issue is that the command to get the output requires elevated privileges. What would be the correct, and most effective way to pull this off?
I would recommend running the script remotely over NRPE on the system in question, and then give the user the NRPE daemon is running as (probably nagios or similar) sudo permissions to run that script with some very exact parameters.
The nrpe.cfg file mentions this example:
# Usage scenario:
# Execute restricted commmands using sudo. For this to work, you need to add
# the nagios user to your /etc/sudoers. An example entry for alllowing
# execution of the plugins from might be:
#
# nagios ALL=(ALL) NOPASSWD: /usr/lib/nagios/plugins/
...but there's no reason to be so forgiving, you can make it a lot safer by only allowing an exact command:
nagios ALL = NOPASSWD: /usr/sbin/megacli
Note that this allows any parameters with that command, this is even safer as it will not allow any other variants (example):
nagios ALL = NOPASSWD: /usr/sbin/megacli -a foo -b bar -c5 -w1
Then configure the nrpe command to run the above with sudo before it, and it should work. You can verify by su:ing to the nagios user and trying the sudo command yourself.
Also, note that there are very likely some available modules you can import for python nagios plugins that makes it easier for you, to get built-in support for things like thresholds and their syntax.
I have a custom python plugin that I am using to pull data into Telegraf. It prints out line protocol output, as expected.
In my Ubuntu 18.04 environment, when this plugin is run I see a single line in my logs:
2020-12-28T21:55:00Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command '/my_company/plugins-enabled/plugin-mysystem/poll_mysystem.py': Traceback (most recent call last):...
That is it. I can't figure out how to get the actual traceback.
If I run sudo -u telegraf /usr/bin/telegraf -config /etc/telegraf/telegraf.conf, the plugin works as expected. It polls and loads data exactly as it should.
I'm not sure how to move forward with troubleshooting this error when telegraf is executing the plugin on it's own.
I have restarted the telegraf service. I have verified permissions (and I think that the execution above shows that it should work).
A few additional details based on the comments and answers received:
The plugin lives in a directory where the entire structure is owned by telegraf:telegraf. The error does not seem to indicate that it can't see the file that is being executed, but rather something within the file is failing when Telegraf executes the plugin.
The code for the plug in is below.
Plugin code (/my_company/plugins-enabled/plugin-mysystem/poll_mysystem.py):
from google.auth.transport.requests import Request
from google.oauth2 import id_token
import requests
import os
RUNTIME_URL = INTERNAL_URL
MEASUREMENT = "MY_MEASUREMENT"
CREDENTIALS = "GOOGLE_SERVICE_FILE.json"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = CREDENTIALS # ENV VAR REQUIRED BY GOOGLE CODE BELOW
CLIENT_ID = VALUE_FROM_GOOGLE
exclude_fields = ["name", "version"] # Don't try to put these into influxdb from json response
def make_iap_request(url, client_id, method="GET", **kwargs):
# Code provided by Google docs
# Set the default timeout, if missing
if "timeout" not in kwargs:
kwargs["timeout"] = 90
# Obtain an OpenID Connect (OIDC) token from metadata server or using service
# account.
open_id_connect_token = id_token.fetch_id_token(Request(), client_id)
# Fetch the Identity-Aware Proxy-protected URL, including an
# Authorization header containing "Bearer " followed by a
# Google-issued OpenID Connect token for the service account.
resp = requests.request(method, url, headers={"Authorization": "Bearer {}".format(open_id_connect_token)}, **kwargs)
if resp.status_code == 403:
raise Exception("Service account does not have permission to " "access the IAP-protected application.")
elif resp.status_code != 200:
raise Exception(
"Bad response from application: {!r} / {!r} / {!r}".format(resp.status_code, resp.headers, resp.text)
)
else:
return resp.json()
def print_results(results):
"""
Take the results of a Dolores call and print influx line protocol results
"""
for item in results["workflow"]:
line_protocol_line_base = f"{MEASUREMENT},name={item['name']}"
values = ""
for key, value in item.items():
if key not in exclude_fields:
values = values + f",{key}={value}"
values = values[1:]
line_protocol_line = f"{line_protocol_line_base} {values}"
print(line_protocol_line)
def main():
current_runtime = make_iap_request(URL, CLIENT_ID, timeout=30)
print_results(current_runtime)
if __name__== "__main__":
main()
Relevant portion of the telegraf.conf file:
[[inputs.exec]]
## Commands array
commands = [
"/my_company/plugins-enabled/plugin-*/poll_*.py",
]
Agent section of config file
[agent]
interval = "60s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = false
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
hostname = ""
omit_hostname = true
What do I do next?
The exec plugin is truncating your Exception message at the newline. If you wrap your call to make_iap_request in a try/except block, and then print(e, file=sys.stderr) rather than letting the Exception bubble all the way up, that should tell you more.
def main():
"""
Query URL and print line protocol
"""
try:
current_runtime = make_iap_request(URL, CLIENT_ID, timeout=30)
print_results(current_runtime)
except Exception as e:
print(e, file=sys.stderr)
Alternately your script could log error messages to it's own log file, rather than passing them back to Telegraf. This would give you more control over what's logged.
I suspect you're running into an environment issue, where there's something different about how you're running it. If not permissions, it could be environment variable differences.
Please do check the permissions.
It seems like it's a permission error. Since telegraf has the necessary permissions running sudo -u telegraf works. But the user you're trying from doesn't have the necessary permissions for accessing the files in /my_company/plugins-enabled/.
So I will recommend looking into them and changing the permissions to Other can access and write or to the username you are trying to use telegraf from.
In order to fix this run the command to go to the directory:
cd /my_company/plugins-enabled/
Then to change ownership to you and only you:
sudo chown -R $(whoami)
Then to change the read/write permissions to all files and folders otherwise:
sudo chmod -R u+w
And if you want everyone, literally everyone on the system to have access to read/write to those files and folders and just want to give all permissions to everyone:
sudo chmod -R 777
I am having this error "sudo: no tty present and no askpass program spcified" when I run the code. It's the line with 2nd subprocesscall, where I try to change the password. The user gets created perfectly, but the password is not, just gives the given error. What I am trying to do, that after you create user, the written password would be given to the user.
import subprocess
def new(username, password):
subprocess.call("echo passd23 | sudo -S adduser '{username}'".format(username=username), shell=True)
subprocess.call("echo passd23 | sudo '{username}':'{password}' | chpasswd".format(username=username, password=password), shell=True)
new("username", "password")
This is bad practice, but it's safer than what it replaces:
def new(username, password, sudopass):
p1 = subprocess.Popen(['sudo', '-S', 'adduser', username], stdin=subprocess.PIPE)
p1.communicate(sudopass)
if p1.retval != 0:
return False
p2 = subprocess.Popen(['sudo', '-S', 'chpasswd'], stdin=subprocess.PIPE)
p2.communicate('%s\n%s:%s\n' % (sudopass, username, password))
return p2.retval == 0
new('username', 'password', 'passd23')
Notably:
No user-provided strings whatsoever are parsed by a shell as code. The sudo password is passed as stdin, not on a generated command line, and the user-provided username and password in particular are never evaluated by a shell.
With chpasswd, both the sudo password and then the user's username:password are provided on stdin, one after the other.
That said -- in a real-world example, your script shouldn't have a plaintext password in memory at all; instead, it should be configured in /etc/sudoers to have passwordless privilege escalation.
Using the following code, I'm unable to get Fabric to detect an expired password prompt at login. The session doesn't timeout, and the abort_on_prompts parameter doesn't appear to be triggering. How I can configure Fabric to detect this state?
from fabric.api import env, run, execute
from fabric import network
from fabric.context_managers import settings
def host_type():
with settings(abort_on_prompts=True):
print ("Using abort mode %(abort_on_prompts)s" % env)
result = run('uname -s')
return result
if __name__ == '__main__':
print ("Fabric v%(version)s" % env)
env.user = 'myuser'
env.password = 'user67user'
env.hosts = ['10.254.254.143']
host_types = execute(host_type)
Executing this script results in a hung script, as depicted below:
Fabric v1.11.1.post1
[10.254.254.143] Executing task 'host_type'
Using abort mode True
[10.254.254.143] run: uname -s
[10.254.254.143] out: WARNING: Your password has expired.
[10.254.254.143] out: You must change your password now and login again!
[10.254.254.143] out: Changing password for myuser.
[10.254.254.143] out: (current) UNIX password:
Fabric include a way to answer prompts questions.
You can use prompts dictionary on the with settings. In the dictionary every key is the question in standard output you want to answer, and the value is your answer.
So in your example:
from fabric.api import env, run, execute
from fabric import network
from fabric.context_managers import settings
def host_type():
with settings(prompts={"(current) UNIX password": "new_password"}):
result = run('uname -s')
return result
if __name__ == '__main__':
print ("Fabric v%(version)s" % env)
env.user = 'myuser'
env.password = 'user67user'
env.hosts = ['10.254.254.143']
host_types = execute(host_type)
I have python script which run during the start of the linux VM:
I have added it in the chkconfig 345
script is supposed to check the hostname and if it is localhost.localdom then it should exit
#!/usr/bin/python
import subprocess,platform,os,sys,logging,shlex
system_name = os.getenv('HOSTNAME')
if system_name == 'localhost.localdom':
logging.info('Please correct host name for proxies, it is showing localhost')
sys.exit()"
if I run it manually it works fine. But during the startup process, even though the hostname is localhost.localdom. It does not exit.
so it look like during the boot process,
os.getenv('HOSTNAME')
is not returning the localshot.localdom what I have set in condition.
Please help getting this to work during reboot.
Thanks,
Jitendra Singh
posting an answer using the info in Isedev's comment...
you could try getting the hostname by:
import os
system_name = os.popen('/bin/hostname')
if system_name.read().rstrip() == 'localhost.localdom':
logging.info('Please correct host name for proxies, it is showing localhost')
sys.exit()