Python Converting tab limited file into csv

Python Converting tab limited file into csv - python

I basically want to convert tab delimited text file http://www.linux-usb.org/usb.ids into a csv file.
I tried importing using Excel, but it is not optimal, it turns out like:
8087 Intel Corp.
0020 Integrated Rate Matching Hub
0024 Integrated Rate Matching Hub
How I want it so for easy searching is:
8087 Intel Corp. 0020 Integrated Rate Matching Hub
8087 Intel Corp. 0024 Integrated Rate Matching Hub
Is there any ways I can do this in python?

$ListDirectory = "C:\USB_List.csv"
Invoke-WebRequest 'http://www.linux-usb.org/usb.ids' -OutFile $ListDirectory
$pageContents = Get-Content $ListDirectory | Select-Object -Skip 22
"vendor`tvendor_name`tproduct`tproduct_name`r" > $ListDirectory
#Variables and Flags
$currentVid
$currentVName
$currentPid
$currentPName
$vendorDone = $TRUE
$interfaceFlag = $FALSE
$nextline
$tab = "`t"
foreach($line in $pageContents){
if($line.StartsWith("`#")){
continue
}
elseif($line.length -eq 0){
exit
}
if(!($line.StartsWith($tab)) -and ($vendorDone -eq $TRUE)){
$vendorDone = $FALSE
}
if(!($line.StartsWith($tab)) -and ($vendorDone -eq $FALSE)){
$pos = $line.IndexOf(" ")
$currentVid = $line.Substring(0, $pos)
$currentVName = $line.Substring($pos+2)
"$currentVid`t$currentVName`t`t`r" >> $ListDirectory
$vendorDone = $TRUE
}
elseif ($line.StartsWith($tab)){
if ($interfaceFlag -eq $TRUE){
$interfaceFlag = $FALSE
}
$nextline = $line.TrimStart()
if ($nextline.StartsWith($tab)){
$interfaceFlag = $TRUE
}
if ($interfaceFlag -eq $FALSE){
$pos = $nextline.IndexOf(" ")
$currentPid = $nextline.Substring(0, $pos)
$currentPName = $nextline.Substring($pos+2)
"$currentVid`t$currentVName`t$currentPid`t$currentPName`r" >> $ListDirectory
Write-Host "$currentVid`t$currentVName`t$currentPid`t$currentPName`r"
$interfaceFlag = $FALSE
}
}
}
I know the ask is for python, but I built this PowerShell script to do the job. It takes no parameters. Just run as admin from the directory where you want to store the script. The script collects everything from the http://www.linux-usb.org/usb.ids page, parses the data and writes it to a tab delimited file. You can then open the file in excel as a tab delimited file. Ensure the columns are read as "text" and not "general" and you're go to go. :)
Parsing this page is tricky because the script has to be contextually aware of every VID-Vendor line proceeding a series of PID-Product lines. I also forced the script to ignore the commented description section, the interface-interface_name lines, the random comments that he inserted throughout the USB list (sigh) and everything after and including "#List of known device classes, subclasses and protocols" which is out of scope for this request.
I hope this helps!

You just need to write a little program that scans in the data a line at a time. Then it should check to see if the first character is a tab ('\t'). If not then that value should be stored. If it does start with tab then print out the value that was previously stored followed by the current line. The result will be the list in the format you want.

Something like this would work:
import csv
lines = []
with open("usb.ids.txt") as f:
reader = csv.reader(f, delimiter="\t")
device = ""
for line in reader:
# Ignore empty lines and comments
if len(line) == 0 or (len(line[0]) > 0 and line[0][0] == "#"):
continue
if line[0] != "":
device = line[0]
elif line[1] != "":
lines.append((device, line[1]))
print(lines)
You basically need to loop through each line, and if it's a device line, remember that for the following lines. This will only work for two columns, and you would then need to write them all to a csv file but that's easy enough

Related

How to escape only delimiter and not the newline character in CSV

I am receiving normal comma delimited CSV files with data having new line character.
Input data
I want to convert the input data to:
Pipe (|) delimited
Without any quotes to escape (" or ')
Pipe (|) within data escaped with a caret (^) character
My file may also contain multiple lines on data (or data in newline in a single row).
Expected output data
Output file I was able to generate.
As you can see in the image that caret (^) perfectly escaped all pipes (|) in data, but also escaping the newline character in 5th and 6th line, which I don't want.
NOTE: All the carriage returns (\r, or CR) and newline (\n, LF) characters should be as it is just like shown in images.
import csv
import sys
inputPath = sys.argv[1]
outputPath = sys.argv[2]
with open(inputPath, encoding="utf-8") as inputFile:
with open(outputPath, 'w', newline='', encoding="utf-8") as outputFile:
reader = csv.DictReader(inputFile, delimiter=',')
writer = csv.DictWriter(
outputFile, reader.fieldnames, delimiter='|', quoting=csv.QUOTE_NONE, escapechar='^', doublequote=False, quotechar="")
writer.writeheader()
writer.writerows(reader)
print("Formationg complete.")
The above code has been written in Python, it would be great if I can get help in Python.
Answers in other programming languages also accepted.
There is more than 8 million records
Please find below some sample data:
"VENDOR ID","VENDOR NAME","ORGANIZATION NUMBER","ADDRESS 1","CITY","COUNTRY","ZIP","PRIMARY PHONE","FAX","EMAIL","LMS RECORD CREATED DATE","LMS RECORD MODIFY DATE","DELETE FLAG","LMS RECORD ID"
"a0E6D000001Fag8UAC","Test 'Vendor' 1","","This Vendor contains a single (') quote.","","","","","","test#test.com","2020-4-1 06:32:29","2020-4-1 06:34:43","false",""
"a0E6D000001FagDUAS","Test ""Vendor"" 2","","This Vendor contains a double("") quote.","","","","","","test#test.com","2020-4-1 06:33:38","2020-4-1 06:35:18","false",""
"a0E6D000001FagIUAS","Test Vendor | 3","","This Vendor contains a Pipe (|).","","","","","","test#test.com","2020-4-1 06:38:45","2020-4-1 06:38:45","false",""
"a0E6D000001FagNUAS","Test Vendor 4","","This Vendor contains a
carriage return, i.e
data in new line.","","","","","","test#test.com","2020-4-1 06:43:08","2020-4-1 06:43:08","false",""
NOTE: If you copy above data, please make sure that 5th and 6th line should end with only LF (i.e New Line, \n) just like shown in images, or else please try to replicate those 2 line as that's what this question is all about not escaping those 2 lines specificaly, as highlighted in the image below.
The above code is the final out come of all my findings on internet. I've even tried pandas library and it's final output is same as well.

The code below is just an alternate way to get my expected output, but still the issue exists as this script takes forever (more than 12 hours) to complete (and still not finishes, ultimately I have to kill the process) when ran on 9 Millions of records.
Batch wrapper for VBS code:
0</* :
#echo off
cscript /nologo /E:jscript "%~f0" %*
exit /b %errorlevel% */0;
var ARGS = WScript.Arguments;
if (ARGS.Length < 3 ) {
WScript.Echo("Wrong arguments");
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(1);
}
if (ARGS.Item(0).toLowerCase() == "-help" || ARGS.Item(0).toLowerCase() == "-h") {
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(0);
}
if (ARGS.Length % 2 !== 1 ) {
WScript.Echo("Wrong arguments");
WScript.Quit(2);
}
var jsEscapes = {
'n': '\n',
'r': '\r',
't': '\t',
'f': '\f',
'v': '\v',
'b': '\b'
};
//string evaluation
//http://stackoverflow.com/questions/24294265/how-to-re-enable-special-character-sequneces-in-javascript
function decodeJsEscape(_, hex0, hex1, octal, other) {
var hex = hex0 || hex1;
if (hex) { return String.fromCharCode(parseInt(hex, 16)); }
if (octal) { return String.fromCharCode(parseInt(octal, 8)); }
return jsEscapes[other] || other;
}
function decodeJsString(s) {
return s.replace(
// Matches an escape sequence with UTF-16 in group 1, single byte hex in group 2,
// octal in group 3, and arbitrary other single-character escapes in group 4.
/\\(?:u([0-9A-Fa-f]{4})|x([0-9A-Fa-f]{2})|([0-3][0-7]{0,2}|[4-7][0-7]?)|(.))/g,
decodeJsEscape);
}
function convertToPipe(find, replace, str) {
return str.replace(new RegExp('\\|','g'),"^|");
}
function removeStartingQuote(find, replace, str) {
return str.replace(new RegExp('^"', 'g'), '');
}
function removeEndQuote(find, replace, str) {
return str.replace(new RegExp('"\r\n$', 'g'), '\r\n');
}
function removeLeadingAndTrailingQuotes(find, replace, str) {
return str.replace(new RegExp('"\r\n"', 'g'), '\r\n');
}
function replaceDelimiter(find, replace, str) {
return str.replace(new RegExp('","', 'g'), '|');
}
function convertSFDCDoubleQuotes(find, replace, str) {
return str.replace(new RegExp('""', 'g'), '"');
}
function getContent(file) {
// :: http://www.dostips.com/forum/viewtopic.php?f=3&t=3855&start=15&p=28898 ::
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // code page with minimum adjustments for input
ado.Open();
ado.LoadFromFile(file);
var adjustment = "\u20AC\u0081\u201A\u0192\u201E\u2026\u2020\u2021" +
"\u02C6\u2030\u0160\u2039\u0152\u008D\u017D\u008F" +
"\u0090\u2018\u2019\u201C\u201D\u2022\u2013\u2014" +
"\u02DC\u2122\u0161\u203A\u0153\u009D\u017E\u0178" ;
var fs = new ActiveXObject("Scripting.FileSystemObject");
var size = (fs.getFile(file)).size;
var lnkBytes = ado.ReadText(size);
ado.Close();
var chars=lnkBytes.split('');
for (var indx=0;indx<size;indx++) {
if ( chars[indx].charCodeAt(0) > 255 ) {
chars[indx] = String.fromCharCode(128 + adjustment.indexOf(chars[indx]));
}
}
return chars.join("");
}
function writeContent(file,content) {
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // right code page for output (no adjustments)
//ado.Mode=2;
ado.Open();
ado.WriteText(content);
ado.SaveToFile(file, 2);
ado.Close();
}
if (typeof String.prototype.startsWith != 'function') {
// see below for better implementation!
String.prototype.startsWith = function (str){
return this.indexOf(str) === 0;
};
}
var evaluate=false;
var filename=ARGS.Item(0);
if(filename.toLowerCase().startsWith("e?")) {
filename=filename.substring(2,filename.length);
evaluate=true;
}
var content=getContent(filename);
var newContent=content;
var find="";
var replace="";
for (var i=1;i<ARGS.Length-1;i=i+2){
find=ARGS.Item(i);
replace=ARGS.Item(i+1);
if(evaluate){
find=decodeJsString(find);
replace=decodeJsString(replace);
}
newContent=convertToPipe(find,replace,newContent);
newContent=removeStartingQuote(find,replace,newContent);
newContent=removeEndQuote(find,replace,newContent);
newContent=removeLeadingAndTrailingQuotes(find,replace,newContent);
newContent=replaceDelimiter(find,replace,newContent);
newContent=convertSFDCDoubleQuotes(find,replace,newContent);
}
writeContent(filename,newContent);
Execution Steps:
> replace.bat <file_name or full_path_to_file> "." "."
This batch file is made for the purpose of any file's manipulation according to our requirement.
I've compiled and made this from lot of google searches. It's still in process as I've hardcoded my regular expressions in the file. You can make changes according to your need in the functions i've made, or even make your own functions by replicating other functions, and calling them at the end.

Another alternateive to what I want to achive I've done using Wondows Powershell script.
((Get-Content -path $args[0] -Raw) -replace '\|', '^|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '^"', '') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace "`"\r\n$", "") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '"\r\n"', "`r`n") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '","', '|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '""', '"' ) | Set-Content -Path $args[0]
Execution Ways:
Using Powershell
replace.ps1 '< path_to_file >'
Using a Batch Script
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -ExecutionPolicy ByPass -command "& '< path_to_ps_script >\replace.ps1' '< path_to_csv_file >.csv'"
NOTE: Powershell V5.0 or greater required
This can process 1 Million of records in a minute or so.
What I've figured out is that we have to split bulky csv files to multiplve file with 1 Million records each and then process them all seperately.
Please correct me if I'm wrong, or there's any other alternate to it.

Python 2.7 path replace string

I need to replace some string by others. I'm using function pathlib to do that it's working fine but I have a problem when there is two same string in my file and I need to change just one
My file (wireless.temp) is like this example
config 'conf'
option disabled '0'
option hidden '1'
option ssid 'confSSID'
option encryption 'psk2'
option key 'qwerty'
config 'client'
option disabled '0'
option hidden '0'
option ssid 'clientSSID'
option encryption 'psk2'
option key 'qwerty'
For example, I need to change string like 'disabled', 'hidden', 'ssid', 'key', in config station and/or config device. Right now I'm using this code
f1=open('wireless.temp', 'r').read()
f2=open('wireless.temp','w')
#checkbox from QT interface
if self.chkWifiEnable.isChecked():
newWifiEnable = "0"
else:
newWifiEnable = "1"
start = f1.find("config 'client'")
print start
end = f1.find("config", start + 1)
print end
if end < 0:
end = len(f1)
station = f1[start:end]
print station
print f1.find("disabled")
print f1.find("'")
actualValue = f1[(station.find("disabled")+10):station.find("'")]
print actualValue
station = station.replace("disabled '" + actualValue, "disabled '" + newWifiEnable)
print station
m = f1[:start] + station + f1[end:]
f2.write(m)
I have a problem with this code, first when I execute my output is
config 'conf'
option device 'radio0'
option ifname 'conf'
option network 'conf'
option mode 'ap'
option disabled '0'
option hidden '1'
option isolate '1'
option ssid 'Conf-2640'
option encryption 'psk2'
option key '12345678'
config 'client'
option device 'radio0'
option ifname 'ra0'
option network 'lan'
option mode 'ap'
option disabled '00' <---- problem
option hidden '0'
option ssid 'FW-2640'
option encryption 'psk2'
option key '12345678'
option disabled line in config 'client' section, my program add another 0 all time also I want to lighten my code because I need to do that for many others string.
Does anyone have an idea?
Thanks

The Path and pathlib2 is a red-herring. You are using that to find and read a file into a string. The issue is replacing text in a narrow section of the whole text. Specifically, between 'config device' and the next 'config ...' item
You can use .find() to find the beginning of the correct config section, and again to find the start of the next config section. Ensure you treat -1, for not found, as the end of the section. The text within that range can be modified, and then combine the resulting modification with the unmodified parts that came before and after it.
wirelessF = """
config device
.....
option key 'qwerty'
.....
config station
.....
option key 'qwerty'
.....
config otherthing
.....
option key 'qwerty'
.....
"""
actualWifiPass = 'qwerty'
newWifiPass = 's3cr3t'
start = wirelessF.find("config station")
end = wirelessF.find("config ", start+1)
if end < 0:
end = len(wirelessF)
station = wirelessF[start:end]
station = station.replace("key '"+actualWifiPass, "key '"+newWifiPass)
wirelessF = wirelessF[:start] + station + wirelessF[end:]
print(wirelessF)
Outputs:
config device
.....
option key 'qwerty'
.....
config station
.....
option key 's3cr3t'
.....
config otherthing
.....
option key 'qwerty'
.....

Import text files to pig through python UDF

I'm trying to load files to pig while use python udf, i've tried two ways:
• (myudf1, sample1.pig): try to read the file from python, the file is located on my client server.
• (myudf2, sample2.pig): load file from hdfs to grunt shell first, then pass it as a parameter to python udf.
myudf1.py
from __future__ import with_statement
def get_words(dir):
stopwords=set()
with open(dir) as f1:
for line1 in f1:
stopwords.update([line1.decode('ascii','ignore').split("\n")[0]])
return stopwords
stopwords=get_words("/home/zhge/uwc/mappings/english_stop.txt")
#outputSchema("findit: int")
def findit(stp):
stp=str(stp)
if stp in stopwords:
return 1
else:
return 0
sample1.pig:
REGISTER '/home/zhge/uwc/scripts/myudf1.py' USING jython as pyudf;
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(title);
DUMP S
I get: IOError: (2, 'No such file or directory', '/home/zhge/uwc/mappings/english_stop.txt')
For solution 2:
myudf2:
def get_wordlists(wordbag):
stopwords=set()
for t in wordbag:
stopwords.update(t.decode('ascii','ignore'))
return stopwords
#outputSchema("findit: int")
def findit(stopwordbag, stp):
stopwords=get_wordlists(stopwordbag)
stp=str(stp)
if stp in stopwords:
return 1
else:
return 0
Sample2.pig
REGISTER '/home/zhge/uwc/scripts/myudf2.py' USING jython as pyudf;
stops = load '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
-- this step works fine and i can see the "stops" obejct is loaded to pig
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(stops.stop_w, title);
DUMP S;
Then I got:
ERROR org.apache.pig.tools.grunt.Grunt -ERROR 1066: Unable to open iterator for alias S. Backend error : Scalar has more than one row in the output. 1st : (a), 2nd :(as

Your second example should work. Though you LIMITed the wrong expression -- it should be on the stops relationship. Therefore it should be:
stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = LIMIT stops 1;
S = FOREACH item_title GENERATE pyudf.findit(T.stop_w, title);
However, since it looks like you need to process all of the stop words first this will not be enough. You'll need to do a GROUP ALL and then pass the results to your get_wordlist function instead:
stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = FOREACH (GROUP stops ALL) GENERATE pyudf.get_wordlists(stops) AS ready;
S = FOREACH item_title GENERATE pyudf.findit(T.ready, title);
You'll have to update your UDF to accept a list of dicts though for this method to work.

Grep reliably all C #defines

I need to analyse some C files and print out all the #define found.
It's not that hard with a regexp (for example)
def with_regexp(fname):
print("{0}:".format(fname))
for line in open(fname):
match = macro_regexp.match(line)
if match is not None:
print(match.groups())
But for example it doesn't handle multiline defines for example.
There is a nice way to do it in C for example with
gcc -E -dM file.c
the problem is that it returns all the #defines, not just the one from the given file, and I don't find any option to only use the given file..
Any hint?
Thanks
EDIT:
This is a first solution to filter out the unwanted defines, simply checking that the name of the define is actually part of the original file, not perfect but seems to work nicely..
def with_gcc(fname):
cmd = "gcc -dM -E {0}".format(fname)
proc = Popen(cmd, shell=True, stdout=PIPE)
out, err = proc.communicate()
source = open(fname).read()
res = set()
for define in out.splitlines():
name = define.split(' ')[1]
if re.search(name, source):
res.add(define)
return res

Sounds like a job for a shell one-liner!
What I want to do is remove the all #includes from the C file (so we don't get junk from other files), pass that off to gcc -E -dM, then remove all the built in #defines - those start with _, and apparently linux and unix.
If you have #defines that start with an underscore this won't work exactly as promised.
It goes like this:
sed -e '/#include/d' foo.c | gcc -E -dM - | sed -e '/#define \(linux\|unix\|_\)/d'
You could probably do it in a few lines of Python too.

In PowerShell you could do something like the following:
function Get-Defines {
param([string] $Path)
"$Path`:"
switch -regex -file $Path {
'\\$' {
if ($multiline) { $_ }
}
'^\s*#define(.*)$' {
$multiline = $_.EndsWith('\');
$_
}
default {
if ($multiline) { $_ }
$multiline = $false
}
}
}
Using the following sample file
#define foo "bar"
blah
#define FOO \
do { \
do_stuff_here \
do_more_stuff \
} while (0)
blah
blah
#define X
it prints
\x.c:
#define foo "bar"
#define FOO \
do { \
do_stuff_here \
do_more_stuff \
} while (0)
#define X
Not ideal, at least how idiomatic PowerShell functions should work, but should work well enough for your needs.

Doing this in pure python I'd use a small state machine:
def getdefines(fname):
""" return a list of all define statements in the file """
lines = open(fname).read().split("\n") #read in the file as a list of lines
result = [] #the result list
current = []#a temp list that holds all lines belonging to a define
lineContinuation = False #was the last line break escaped with a '\'?
for line in lines:
#is the current line the start or continuation of a define statement?
isdefine = line.startswith("#define") or lineContinuation
if isdefine:
current.append(line) #append to current result
lineContinuation = line.endswith("\\") #is the line break escaped?
if not lineContinuation:
#we reached the define statements end - append it to result list
result.append('\n'.join(current))
current = [] #empty the temp list
return result

How to parse nagios status.dat file?

I'd like to parse status.dat file for nagios3 and output as xml with a python script.
The xml part is the easy one but how do I go about parsing the file? Use multi line regex?
It's possible the file will be large as many hosts and services are monitored, will loading the whole file in memory be wise?
I only need to extract services that have critical state and host they belong to.
Any help and pointing in the right direction will be highly appreciated.
LE Here's how the file looks:
########################################
# NAGIOS STATUS FILE
#
# THIS FILE IS AUTOMATICALLY GENERATED
# BY NAGIOS. DO NOT MODIFY THIS FILE!
########################################
info {
created=1233491098
version=2.11
}
program {
modified_host_attributes=0
modified_service_attributes=0
nagios_pid=15015
daemon_mode=1
program_start=1233490393
last_command_check=0
last_log_rotation=0
enable_notifications=1
active_service_checks_enabled=1
passive_service_checks_enabled=1
active_host_checks_enabled=1
passive_host_checks_enabled=1
enable_event_handlers=1
obsess_over_services=0
obsess_over_hosts=0
check_service_freshness=1
check_host_freshness=0
enable_flap_detection=0
enable_failure_prediction=1
process_performance_data=0
global_host_event_handler=
global_service_event_handler=
total_external_command_buffer_slots=4096
used_external_command_buffer_slots=0
high_external_command_buffer_slots=0
total_check_result_buffer_slots=4096
used_check_result_buffer_slots=0
high_check_result_buffer_slots=2
}
host {
host_name=localhost
modified_attributes=0
check_command=check-host-alive
event_handler=
has_been_checked=1
should_be_scheduled=0
check_execution_time=0.019
check_latency=0.000
check_type=0
current_state=0
last_hard_state=0
plugin_output=PING OK - Packet loss = 0%, RTA = 3.57 ms
performance_data=
last_check=1233490883
next_check=0
current_attempt=1
max_attempts=10
state_type=1
last_state_change=1233489475
last_hard_state_change=1233489475
last_time_up=1233490883
last_time_down=0
last_time_unreachable=0
last_notification=0
next_notification=0
no_more_notifications=0
current_notification_number=0
notifications_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_host=1
last_update=1233491098
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}
service {
host_name=gateway
service_description=PING
modified_attributes=0
check_command=check_ping!100.0,20%!500.0,60%
event_handler=
has_been_checked=1
should_be_scheduled=1
check_execution_time=4.017
check_latency=0.210
check_type=0
current_state=0
last_hard_state=0
current_attempt=1
max_attempts=4
state_type=1
last_state_change=1233489432
last_hard_state_change=1233489432
last_time_ok=1233491078
last_time_warning=0
last_time_unknown=0
last_time_critical=0
plugin_output=PING OK - Packet loss = 0%, RTA = 2.98 ms
performance_data=
last_check=1233491078
next_check=1233491378
current_notification_number=0
last_notification=0
next_notification=0
no_more_notifications=0
notifications_enabled=1
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_service=1
last_update=1233491098
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}
It can have any number of hosts and a host can have any number of services.

Pfft, get yerself mk_livestatus. http://mathias-kettner.de/checkmk_livestatus.html

Nagiosity does exactly what you want:
http://code.google.com/p/nagiosity/

Having shamelessly stolen from the above examples,
Here's a version build for Python 2.4 that returns a dict containing arrays of nagios sections.
def parseConf(source):
conf = {}
patID=re.compile(r"(?:\s*define)?\s*(\w+)\s+{")
patAttr=re.compile(r"\s*(\w+)(?:=|\s+)(.*)")
patEndID=re.compile(r"\s*}")
for line in source.splitlines():
line=line.strip()
matchID = patID.match(line)
matchAttr = patAttr.match(line)
matchEndID = patEndID.match( line)
if len(line) == 0 or line[0]=='#':
pass
elif matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2).strip()
cur[1][attribute] = value
elif matchEndID and cur:
conf.setdefault(cur[0],[]).append(cur[1])
del cur
return conf
To get all Names your Host which have contactgroups beginning with 'devops':
nagcfg=parseConf(stringcontaingcompleteconfig)
hostlist=[host['host_name'] for host in nagcfg['host']
if host['contact_groups'].startswith('devops')]

Don't know nagios and its config file, but the structure seems pretty simple:
# comment
identifier {
attribute=
attribute=value
}
which can simply be translated to
<identifier>
<attribute name="attribute-name">attribute-value</attribute>
</identifier>
all contained inside a root-level <nagios> tag.
I don't see line breaks in the values. Does nagios have multi-line values?
You need to take care of equal signs within attribute values, so set your regex to non-greedy.

You can do something like this:
def parseConf(filename):
conf = []
with open(filename, 'r') as f:
for i in f.readlines():
if i[0] == '#': continue
matchID = re.search(r"([\w]+) {", i)
matchAttr = re.search(r"[ ]*([\w]+)=([\w\d]*)", i)
matchEndID = re.search(r"[ ]*}", i)
if matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2)
cur[1][attribute] = value
elif matchEndID:
conf.append(cur)
return conf
def conf2xml(filename):
conf = parseConf(filename)
xml = ''
for ID in conf:
xml += '<%s>\n' % ID[0]
for attr in ID[1]:
xml += '\t<attribute name="%s">%s</attribute>\n' % \
(attr, ID[1][attr])
xml += '</%s>\n' % ID[0]
return xml
Then try to do:
print conf2xml('conf.dat')

If you slightly tweak Andrea's solution you can use that code to parse both the status.dat as well as the objects.cache
def parseConf(source):
conf = []
for line in source.splitlines():
line=line.strip()
matchID = re.match(r"(?:\s*define)?\s*(\w+)\s+{", line)
matchAttr = re.match(r"\s*(\w+)(?:=|\s+)(.*)", line)
matchEndID = re.match(r"\s*}", line)
if len(line) == 0 or line[0]=='#':
pass
elif matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2).strip()
cur[1][attribute] = value
elif matchEndID and cur:
conf.append(cur)
del cur
return conf
It is a little puzzling why nagios chose to use two different formats for these files, but once you've parsed them both into some usable python objects you can do quite a bit of magic through the external command file.
If anybody has a solution for getting this into a a real xml dom that'd be awesome.

For the last several months I've written and released a tool that that parses the Nagios status.dat and objects.cache and builds a model that allows for some really useful manipulation of Nagios data. We use it to drive an internal operations dashboard that is a simplified 'mini' Nagios. Its under continual development and I've neglected testing and documentation but the code isn't too crazy and I feel fairly easy to follow.
Let me know what you think...
https://github.com/zebpalmer/NagParser

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Converting tab limited file into csv - python

Related

How to escape only delimiter and not the newline character in CSV

Python 2.7 path replace string

Import text files to pig through python UDF

Grep reliably all C #defines

How to parse nagios status.dat file?

Categories

Resources