Import text files to pig through python UDF

Import text files to pig through python UDF - python

I'm trying to load files to pig while use python udf, i've tried two ways:
• (myudf1, sample1.pig): try to read the file from python, the file is located on my client server.
• (myudf2, sample2.pig): load file from hdfs to grunt shell first, then pass it as a parameter to python udf.
myudf1.py
from __future__ import with_statement
def get_words(dir):
stopwords=set()
with open(dir) as f1:
for line1 in f1:
stopwords.update([line1.decode('ascii','ignore').split("\n")[0]])
return stopwords
stopwords=get_words("/home/zhge/uwc/mappings/english_stop.txt")
#outputSchema("findit: int")
def findit(stp):
stp=str(stp)
if stp in stopwords:
return 1
else:
return 0
sample1.pig:
REGISTER '/home/zhge/uwc/scripts/myudf1.py' USING jython as pyudf;
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(title);
DUMP S
I get: IOError: (2, 'No such file or directory', '/home/zhge/uwc/mappings/english_stop.txt')
For solution 2:
myudf2:
def get_wordlists(wordbag):
stopwords=set()
for t in wordbag:
stopwords.update(t.decode('ascii','ignore'))
return stopwords
#outputSchema("findit: int")
def findit(stopwordbag, stp):
stopwords=get_wordlists(stopwordbag)
stp=str(stp)
if stp in stopwords:
return 1
else:
return 0
Sample2.pig
REGISTER '/home/zhge/uwc/scripts/myudf2.py' USING jython as pyudf;
stops = load '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
-- this step works fine and i can see the "stops" obejct is loaded to pig
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(stops.stop_w, title);
DUMP S;
Then I got:
ERROR org.apache.pig.tools.grunt.Grunt -ERROR 1066: Unable to open iterator for alias S. Backend error : Scalar has more than one row in the output. 1st : (a), 2nd :(as

Your second example should work. Though you LIMITed the wrong expression -- it should be on the stops relationship. Therefore it should be:
stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = LIMIT stops 1;
S = FOREACH item_title GENERATE pyudf.findit(T.stop_w, title);
However, since it looks like you need to process all of the stop words first this will not be enough. You'll need to do a GROUP ALL and then pass the results to your get_wordlist function instead:
stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = FOREACH (GROUP stops ALL) GENERATE pyudf.get_wordlists(stops) AS ready;
S = FOREACH item_title GENERATE pyudf.findit(T.ready, title);
You'll have to update your UDF to accept a list of dicts though for this method to work.

Related

Python scripting with ete3 to query NCBI's Taxonomy: "sqlite3 Warning (can only execute one statement at a time)"

I am using this script:
import csv
import time
import sys
from ete3 import NCBITaxa
ncbi = NCBITaxa()
def get_desired_ranks(taxid, desired_ranks):
lineage = ncbi.get_lineage(taxid)
names = ncbi.get_taxid_translator(lineage)
lineage2ranks = ncbi.get_rank(names)
ranks2lineage = dict((rank,taxid) for (taxid, rank) in lineage2ranks.items())
return{'{}_id'.format(rank): ranks2lineage.get(rank, '<not present>') for rank in desired_ranks}
if __name__ == '__main__':
file = open(sys.argv[1], "r")
taxids = []
contigs = []
for line in file:
line = line.split("\n")[0]
taxids.append(line.split(",")[0])
contigs.append(line.split(",")[1])
desired_ranks = ['superkingdom', 'phylum']
results = list()
for taxid in taxids:
results.append(list())
results[-1].append(str(taxid))
ranks = get_desired_ranks(taxid, desired_ranks)
for key, rank in ranks.items():
if rank != '<not present>':
results[-1].append(list(ncbi.get_taxid_translator([rank]).values())[0])
else:
results[-1].append(rank)
i = 0
for result in results:
print(contigs[i] + ','),
print(','.join(result))
i += 1
file.close()
The script takes taxids from a file and fetches their respective lineages from a local copy of NCBI's Taxonomy database. Strangely, this script works fine when I run it on small sets of taxids (~70, ~100), but most of my datasets are upwards of 280k taxids and these break the script.
I get this complete error:
Traceback (most recent call last):
File "/data1/lstout/blast/scripts/getLineageByETE3.py", line 31, in <module>
ranks = get_desired_ranks(taxid, desired_ranks)
File "/data1/lstout/blast/scripts/getLineageByETE3.py", line 11, in get_desired_ranks
lineage = ncbi.get_lineage(taxid)
File "/data1/lstout/.local/lib/python2.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 227, in get_lineage
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %taxid)
sqlite3.Warning: You can only execute one statement at a time.
The first two files from the traceback are simply the script I referenced above, the third file is one of ete3's. And as I stated, the script works fine with small datasets.
What I have tried:
Importing the time module and sleeping for a few milliseconds/hundredths of a second before/after my offending lines of code on lines 11 and 31. No effect.
Went to line 227 in ete3's code...
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %merged_conversion[taxid])
and changed the "execute" function to "executescript" in order to be able to handle multiple queries at once (as that seems to be the problem). This produced a new error and led to a rabbit hole of me changing minor things in their script trying to fudge this to work. No result. This is the complete offending function:
def get_lineage(self, taxid):
"""Given a valid taxid number, return its corresponding lineage track as a
hierarchically sorted list of parent taxids.
"""
if not taxid:
return None
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %taxid)
raw_track = result.fetchone()
if not raw_track:
#perhaps is an obsolete taxid
_, merged_conversion = self._translate_merged([taxid])
if taxid in merged_conversion:
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %merged_conversion[taxid])
raw_track = result.fetchone()
# if not raise error
if not raw_track:
#raw_track = ["1"]
raise ValueError("%s taxid not found" %taxid)
else:
warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid]))
track = list(map(int, raw_track[0].split(",")))
return list(reversed(track))
What bothers me so much is that this works on small amounts of data! I'm running these scripts from my school's high performance computer and have tried running on their head node and in an interactive moab scheduler. Nothing has helped.

Python on Visual Studio Error

Can somebody help me identify the error here? I am using Python on Visual Studio, and I have copied the code from another source. When I paste the code into a playground like Ideone.com, the program runs without issue. Only when I paste into visual studio do I get an error. The error is in the last two lines of code that begin with "print". This is probably a huge noob question, but I am a beginner. Help!
import hashlib as hasher
import datetime as date
# Define what a Snakecoin block is
class Block:
def __init__(self, index, timestamp, data, previous_hash):
self.index = index
self.timestamp = timestamp
self.data = data
self.previous_hash = previous_hash
self.hash = self.hash_block()
def hash_block(self):
sha = hasher.sha256()
sha.update(str(self.index) + str(self.timestamp) + str(self.data) + str(self.previous_hash))
return sha.hexdigest()
# Generate genesis block
def create_genesis_block():
# Manually construct a block with
# index zero and arbitrary previous hash
return Block(0, date.datetime.now(), "Genesis Block", "0")
# Generate all later blocks in the blockchain
def next_block(last_block):
this_index = last_block.index + 1
this_timestamp = date.datetime.now()
this_data = "Hey! I'm block " + str(this_index)
this_hash = last_block.hash
return Block(this_index, this_timestamp, this_data, this_hash)
# Create the blockchain and add the genesis block
blockchain = [create_genesis_block()]
previous_block = blockchain[0]
# How many blocks should we add to the chain
# after the genesis block
num_of_blocks_to_add = 20
# Add blocks to the chain
for i in range(0, num_of_blocks_to_add):
block_to_add = next_block(previous_block)
blockchain.append(block_to_add)
previous_block = block_to_add
# Tell everyone about it!
print "Block #{} has been added to the blockchain!".format(block_to_add.index)
print "Hash: {}\n".format(block_to_add.hash)

In visual studio and most other editors, they are updated to the most recent python, also meaning that you have to place parentheses after the print statement.
For example:
print("hello")

Visual Studio runs Python 3.x.x, so, you have to write the print sentences on this way:
print('String to print')
Because in Python 3 print is a function, not a reserved word like in Python 2.

Moving numpy arrays from VBA to Python and back

I have a VBA script in Microsoft Access. The VBA script is part of a large project with multiple people, and so it is not possible to leave the VBA environment.
In a section of my script, I need to do complicated linear algebra on a table quickly. So, I move the VBA tables written as recordsets) into Python to do linear algebra, and back into VBA. The matrices in python are represented as numpy arrays.
Some of the linear algebra is proprietary and so we are compiling the proprietary scripts with pyinstaller.
The details of the process are as follows:
The VBA script creates a csv file representing the table input.csv.
The VBA script runs the python script through the command line
The python script loads the csv file input.csv as a numpy matrix, does linear algebra on it, and creates an output csv file output.csv.
VBA waits until python is done, then loads output.csv.
VBA deletes the no-longer-needed input.csv file and output.csv file.
This process is inefficient.
Is there a way to load VBA matrices into Python (and back) without the csv clutter? Do these methods work with compiled python code through pyinstaller?
I have found the following examples on stackoverflow that are relevant. However, they do not address my problem specifically.
Return result from Python to Vba
How to pass Variable from Python to VBA Sub

Solution 1
Either retrieve the COM running instance of Access and get/set the data directly with the python script via the COM API:
VBA:
Private Cache
Public Function GetData()
GetData = Cache
Cache = Empty
End Function
Public Sub SetData(data)
Cache = data
End Sub
Sub Usage()
Dim wshell
Set wshell = VBA.CreateObject("WScript.Shell")
' Make the data available via GetData()'
Cache = Array(4, 6, 8, 9)
' Launch the python script compiled with pylauncher '
Debug.Assert 0 = wshell.Run("C:\dev\myapp.exe", 0, True)
' Handle the returned data '
Debug.Assert Cache(3) = 2
End Sub
Python (myapp.exe):
import win32com.client
if __name__ == "__main__":
# get the running instance of Access
app = win32com.client.GetObject(Class="Access.Application")
# get some data from Access
data = app.run("GetData")
# return some data to Access
app.run("SetData", [1, 2, 3, 4])
Solution 2
Or create a COM server to expose some functions to Access :
VBA:
Sub Usage()
Dim Py As Object
Set Py = CreateObject("Python.MyModule")
Dim result
result = Py.MyFunction(Array(5, 6, 7, 8))
End Sub
Python (myserver.exe or myserver.py):
import sys, os, win32api, win32com.server.localserver, win32com.server.register
class MyModule(object):
_reg_clsid_ = "{5B4A4174-EE23-4B70-99F9-E57958CFE3DF}"
_reg_desc_ = "My Python COM Server"
_reg_progid_ = "Python.MyModule"
_public_methods_ = ['MyFunction']
def MyFunction(self, data) :
return [(1,2), (3, 4)]
def register(*classes) :
regsz = lambda key, val: win32api.RegSetValue(-2147483647, key, 1, val)
isPy = not sys.argv[0].lower().endswith('.exe')
python_path = isPy and win32com.server.register._find_localserver_exe(1)
server_path = isPy and win32com.server.register._find_localserver_module()
for cls in classes :
if isPy :
file_path = sys.modules[cls.__module__].__file__
class_name = '%s.%s' % (os.path.splitext(os.path.basename(file_path))[0], cls.__name__)
command = '"%s" "%s" %s' % (python_path, server_path, cls._reg_clsid_)
else :
file_path = sys.argv[0]
class_name = '%s.%s' % (cls.__module__, cls.__name__)
command = '"%s" %s' % (file_path, cls._reg_clsid_)
regsz("SOFTWARE\\Classes\\" + cls._reg_progid_ + '\\CLSID', cls._reg_clsid_)
regsz("SOFTWARE\\Classes\\AppID\\" + cls._reg_clsid_, cls._reg_progid_)
regsz("SOFTWARE\\Classes\\CLSID\\" + cls._reg_clsid_, cls._reg_desc_)
regsz("SOFTWARE\\Classes\\CLSID\\" + cls._reg_clsid_ + '\\LocalServer32', command)
regsz("SOFTWARE\\Classes\\CLSID\\" + cls._reg_clsid_ + '\\ProgID', cls._reg_progid_)
regsz("SOFTWARE\\Classes\\CLSID\\" + cls._reg_clsid_ + '\\PythonCOM', class_name)
regsz("SOFTWARE\\Classes\\CLSID\\" + cls._reg_clsid_ + '\\PythonCOMPath', os.path.dirname(file_path))
regsz("SOFTWARE\\Classes\\CLSID\\" + cls._reg_clsid_ + '\\Debugging', "0")
print('Registered ' + cls._reg_progid_)
if __name__ == "__main__":
if len(sys.argv) > 1 :
win32com.server.localserver.serve(set([v for v in sys.argv if v[0] == '{']))
else :
register(MyModule)
Note that you'll have to run the script once without any argument to register the class and to make it available to VBA.CreateObject.
Both solutions work with pylauncher and the array received in python can be converted with numpy.array(data).
Dependency :
https://pypi.python.org/pypi/pywin32

You can try loading your record set into an array, dim'ed as Double
Dim arr(1 to 100, 1 to 100) as Double
by looping, then pass the pointer to the first element ptr = VarPtr(arr(1, 1)) to Python, where
arr = numpy.ctypeslib.as_array(ptr, (100 * 100,)) ?
But VBA will still own the array memory

There is a very simple way of doing this with xlwings. See xlwings.org and make sure to follow the instructions to enable macro settings, tick xlwings in VBA references, etc. etc.
The code would then look as simple as the following (a slightly silly block of code that just returns the same dataframe back, but you get the picture):
import xlwings as xw
import numpy as np
import pandas as pd
# the #xw.decorator is to tell xlwings to create an Excel VBA wrapper for this function.
# It has no effect on how the function behaves in python
#xw.func
#xw.arg('pensioner_data', pd.DataFrame, index=False, header=True)
#xw.ret(expand='table', index=False)
def pensioner_CF(pensioner_data, mortality_table = "PA(90)", male_age_adj = 0, male_improv = 0, female_age_adj = 0, female_improv = 0,
years_improv = 0, arrears_advance = 0, discount_rate = 0, qxy_tables=0):
pensioner_data = pensioner_data.replace(np.nan, '', regex=True)
cashflows_df = pd.DataFrame()
return cashflows_df
I'd be interested to hear if this answers the question. It certainly made my VBA / python experience a lot easier.

Cannot grok python multiprocessing

I need to run a function for the each of the elements of my database.
When I try the following:
from multiprocessing import Pool
from pymongo import Connection
def foo():
...
connection1 = Connection('127.0.0.1', 27017)
db1 = connection1.data
my_pool = Pool(6)
my_pool.map(foo, db1.index.find())
I'm getting the following error:
Job 1, 'python myscript.py ' terminated by signal SIGKILL (Forced quit)
Which is, I think, caused by db1.index.find() eating all the available ram while trying to return millions of database elements...
How should I modify my code for it to work?
Some logs are here:
dmesg | tail -500 | grep memory
[177886.768927] Out of memory: Kill process 3063 (python) score 683 or sacrifice child
[177891.001379] [<ffffffff8110e51a>] out_of_memory+0xfa/0x250
[177891.021362] Out of memory: Kill process 3063 (python) score 684 or sacrifice child
[177891.025399] [<ffffffff8110e51a>] out_of_memory+0xfa/0x250
The actual function below:
def create_barrel(item):
connection = Connection('127.0.0.1', 27017)
db = connection.data
print db.index.count()
barrel = []
fls = []
if 'name' in item.keys():
barrel.append(WhitespaceTokenizer().tokenize(item['name']))
name = item['name']
elif 'name.utf-8' in item.keys():
barrel.append(WhitespaceTokenizer().tokenize(item['name.utf-8']))
name = item['name.utf-8']
else:
print item.keys()
if 'files' in item.keys():
for file in item['files']:
if 'path' in file.keys():
barrel.append(WhitespaceTokenizer().tokenize(" ".join(file['path'])))
fls.append(("\\".join(file['path']),file['length']))
elif 'path.utf-8' in file.keys():
barrel.append(WhitespaceTokenizer().tokenize(" ".join(file['path.utf-8'])))
fls.append(("\\".join(file['path.utf-8']),file['length']))
else:
print file
barrel.append(WhitespaceTokenizer().tokenize(file))
if len(fls) < 1:
fls.append((name,item['length']))
barrel = sum(barrel,[])
for s in barrel:
vs = re.findall("\d[\d|\.]*\d", s) #versions i.e. numbes such as 4.2.7500
b0 = []
for s in barrel:
b0.append(re.split("[" + string.punctuation + "]", s))
b1 = filter(lambda x: x not in string.punctuation, sum(b0,[]))
flag = True
while flag:
bb = []
flag = False
for bt in b1:
if bt[0] in string.punctuation:
bb.append(bt[1:])
flag = True
elif bt[-1] in string.punctuation:
bb.append(bt[:-1])
flag = True
else:
bb.append(bt)
b1 = bb
b2 = b1 + barrel + vs
b3 = list(set(b2))
b4 = map(lambda x: x.lower(), b3)
b_final = {}
b_final['_id'] = item['_id']
b_final['tags'] = b4
b_final['name'] = name
b_final['files'] = fls
print db.barrels.insert(b_final)
I've noticed interesting thing. Then I press ctrl+c to stop process I'm getting the following:
python index2barrel.py
Traceback (most recent call last):
File "index2barrel.py", line 83, in <module>
my_pool.map(create_barrel, db1.index.find, 6)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 227, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 280, in map_async
iterable = list(iterable)
TypeError: 'instancemethod' object is not iterable
I mean, why multiprocessing is trying to convert somethin to the list? Isn't it the source of the problem?
from the stack trace:
brk(0x231ccf000) = 0x231ccf000
futex(0x1abb150, FUTEX_WAKE_PRIVATE, 1) = 1
sendto(3, "+\0\0\0\260\263\355\356\0\0\0\0\325\7\0\0\0\0\0\0data.index\0\0"..., 43, 0, NULL, 0) = 43
recvfrom(3, "Some text from my database."..., 491663, 0, NULL, NULL) = 491663
... [manymany times]
brk(0x2320d5000) = 0x2320d5000
.... manymany times
The above sample goes and goes in strace output and for some reason strace -o logfile python myscript.py
does not halt. It just eats all the available ram and writes in log file.
UPDATE. Using imap instead of map solved my problem.

Since the find() operation is returning the cursor the the map function and since you say that this runs without a problem when you do
for item in db1.index.find(): create_barrel(item)
it looks like the create_barrel function is OK.
Can you try to limit the number of results returned in the cursor and see if this helps? I think the syntax would be:
db1.index.find().limit(100)
If you could try this and see if it helps it might help to get the cause of the problem.
EDIT1: I think you are going about this the wrong way by using the map function - I think you should be using map_reduce in the mongo python driver - that way the map function will be executed by the mongod process.

map() function gives the items in chunks to the given function. By default this chunksize is calculated like this (link to source):
chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
This probably results in too big chunk size in your case and lets the process run out of memory. Try setting the chunk size manually like this:
my_pool.map(foo, db1.index.find(), 100)
EDIT: You should also consider reusing the db connection and closing them after usage. Now you create new db connection for each item, and you don't call close() to them.
EDIT2: Also check if the while loop gets into an infinite loop (would explain the symptoms).
EDIT3: Based on the traceback you added the map function tries to convert the cursor to a list, causing all the items to be fetched at once. This happens because it want's to find how many items there are in the set. This is part of map() code from pool.py:
if not hasattr(iterable, '__len__'):
iterable = list(iterable)
You could try this to avoid conversion to list:
cursor = db1.index.find()
cursor.__len__ = cursor.count()
my_pool.map(foo, cursor)

How to parse nagios status.dat file?

I'd like to parse status.dat file for nagios3 and output as xml with a python script.
The xml part is the easy one but how do I go about parsing the file? Use multi line regex?
It's possible the file will be large as many hosts and services are monitored, will loading the whole file in memory be wise?
I only need to extract services that have critical state and host they belong to.
Any help and pointing in the right direction will be highly appreciated.
LE Here's how the file looks:
########################################
# NAGIOS STATUS FILE
#
# THIS FILE IS AUTOMATICALLY GENERATED
# BY NAGIOS. DO NOT MODIFY THIS FILE!
########################################
info {
created=1233491098
version=2.11
}
program {
modified_host_attributes=0
modified_service_attributes=0
nagios_pid=15015
daemon_mode=1
program_start=1233490393
last_command_check=0
last_log_rotation=0
enable_notifications=1
active_service_checks_enabled=1
passive_service_checks_enabled=1
active_host_checks_enabled=1
passive_host_checks_enabled=1
enable_event_handlers=1
obsess_over_services=0
obsess_over_hosts=0
check_service_freshness=1
check_host_freshness=0
enable_flap_detection=0
enable_failure_prediction=1
process_performance_data=0
global_host_event_handler=
global_service_event_handler=
total_external_command_buffer_slots=4096
used_external_command_buffer_slots=0
high_external_command_buffer_slots=0
total_check_result_buffer_slots=4096
used_check_result_buffer_slots=0
high_check_result_buffer_slots=2
}
host {
host_name=localhost
modified_attributes=0
check_command=check-host-alive
event_handler=
has_been_checked=1
should_be_scheduled=0
check_execution_time=0.019
check_latency=0.000
check_type=0
current_state=0
last_hard_state=0
plugin_output=PING OK - Packet loss = 0%, RTA = 3.57 ms
performance_data=
last_check=1233490883
next_check=0
current_attempt=1
max_attempts=10
state_type=1
last_state_change=1233489475
last_hard_state_change=1233489475
last_time_up=1233490883
last_time_down=0
last_time_unreachable=0
last_notification=0
next_notification=0
no_more_notifications=0
current_notification_number=0
notifications_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_host=1
last_update=1233491098
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}
service {
host_name=gateway
service_description=PING
modified_attributes=0
check_command=check_ping!100.0,20%!500.0,60%
event_handler=
has_been_checked=1
should_be_scheduled=1
check_execution_time=4.017
check_latency=0.210
check_type=0
current_state=0
last_hard_state=0
current_attempt=1
max_attempts=4
state_type=1
last_state_change=1233489432
last_hard_state_change=1233489432
last_time_ok=1233491078
last_time_warning=0
last_time_unknown=0
last_time_critical=0
plugin_output=PING OK - Packet loss = 0%, RTA = 2.98 ms
performance_data=
last_check=1233491078
next_check=1233491378
current_notification_number=0
last_notification=0
next_notification=0
no_more_notifications=0
notifications_enabled=1
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_service=1
last_update=1233491098
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}
It can have any number of hosts and a host can have any number of services.

Pfft, get yerself mk_livestatus. http://mathias-kettner.de/checkmk_livestatus.html

Nagiosity does exactly what you want:
http://code.google.com/p/nagiosity/

Having shamelessly stolen from the above examples,
Here's a version build for Python 2.4 that returns a dict containing arrays of nagios sections.
def parseConf(source):
conf = {}
patID=re.compile(r"(?:\s*define)?\s*(\w+)\s+{")
patAttr=re.compile(r"\s*(\w+)(?:=|\s+)(.*)")
patEndID=re.compile(r"\s*}")
for line in source.splitlines():
line=line.strip()
matchID = patID.match(line)
matchAttr = patAttr.match(line)
matchEndID = patEndID.match( line)
if len(line) == 0 or line[0]=='#':
pass
elif matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2).strip()
cur[1][attribute] = value
elif matchEndID and cur:
conf.setdefault(cur[0],[]).append(cur[1])
del cur
return conf
To get all Names your Host which have contactgroups beginning with 'devops':
nagcfg=parseConf(stringcontaingcompleteconfig)
hostlist=[host['host_name'] for host in nagcfg['host']
if host['contact_groups'].startswith('devops')]

Don't know nagios and its config file, but the structure seems pretty simple:
# comment
identifier {
attribute=
attribute=value
}
which can simply be translated to
<identifier>
<attribute name="attribute-name">attribute-value</attribute>
</identifier>
all contained inside a root-level <nagios> tag.
I don't see line breaks in the values. Does nagios have multi-line values?
You need to take care of equal signs within attribute values, so set your regex to non-greedy.

You can do something like this:
def parseConf(filename):
conf = []
with open(filename, 'r') as f:
for i in f.readlines():
if i[0] == '#': continue
matchID = re.search(r"([\w]+) {", i)
matchAttr = re.search(r"[ ]*([\w]+)=([\w\d]*)", i)
matchEndID = re.search(r"[ ]*}", i)
if matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2)
cur[1][attribute] = value
elif matchEndID:
conf.append(cur)
return conf
def conf2xml(filename):
conf = parseConf(filename)
xml = ''
for ID in conf:
xml += '<%s>\n' % ID[0]
for attr in ID[1]:
xml += '\t<attribute name="%s">%s</attribute>\n' % \
(attr, ID[1][attr])
xml += '</%s>\n' % ID[0]
return xml
Then try to do:
print conf2xml('conf.dat')

If you slightly tweak Andrea's solution you can use that code to parse both the status.dat as well as the objects.cache
def parseConf(source):
conf = []
for line in source.splitlines():
line=line.strip()
matchID = re.match(r"(?:\s*define)?\s*(\w+)\s+{", line)
matchAttr = re.match(r"\s*(\w+)(?:=|\s+)(.*)", line)
matchEndID = re.match(r"\s*}", line)
if len(line) == 0 or line[0]=='#':
pass
elif matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2).strip()
cur[1][attribute] = value
elif matchEndID and cur:
conf.append(cur)
del cur
return conf
It is a little puzzling why nagios chose to use two different formats for these files, but once you've parsed them both into some usable python objects you can do quite a bit of magic through the external command file.
If anybody has a solution for getting this into a a real xml dom that'd be awesome.

For the last several months I've written and released a tool that that parses the Nagios status.dat and objects.cache and builds a model that allows for some really useful manipulation of Nagios data. We use it to drive an internal operations dashboard that is a simplified 'mini' Nagios. Its under continual development and I've neglected testing and documentation but the code isn't too crazy and I feel fairly easy to follow.
Let me know what you think...
https://github.com/zebpalmer/NagParser

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Import text files to pig through python UDF - python

Related

Python scripting with ete3 to query NCBI's Taxonomy: "sqlite3 Warning (can only execute one statement at a time)"

Python on Visual Studio Error

Moving numpy arrays from VBA to Python and back

Cannot grok python multiprocessing

How to parse nagios status.dat file?

Categories

Resources