Python parse large db over 4gb

Python parse large db over 4gb - python

I'm trying to parse a db file with python that is over 4 gb.
Example from the db file:
% Tags relating to '217.89.104.48 - 217.89.104.63'
% RIPE-USER-RESOURCE
inetnum: 194.243.227.240 - 194.243.227.255
netname: PRINCESINDUSTRIEALIMENTARI
remarks: INFRA-AW
descr: PRINCES INDUSTRIE ALIMENTARI
descr: Provider Local Registry
descr: BB IBS
country: IT
admin-c: DUMY-RIPE
tech-c: DUMY-RIPE
status: ASSIGNED PA
notify: order.manager2#telecomitalia.it
mnt-by: INTERB-MNT
changed: unread#ripe.net 20000101
source: RIPE
remarks: ****************************
remarks: * THIS OBJECT IS MODIFIED
remarks: * Please note that all data that is generally regarded as personal
remarks: * data has been removed from this object.
remarks: * To view the original object, please query the RIPE Database at:
remarks: * http://www.ripe.net/whois
remarks: ****************************
% Tags relating to '194.243.227.240 - 194.243.227.255'
% RIPE-USER-RESOURCE
inetnum: 194.16.216.176 - 194.16.216.183
netname: SE-CARLSTEINS
descr: CARLSTEINS TRAFIK AB
org: ORG-CTA17-RIPE
country: SE
admin-c: DUMY-RIPE
tech-c: DUMY-RIPE
status: ASSIGNED PA
notify: mntripe#telia.net
mnt-by: TELIANET-LIR
changed: unread#ripe.net 20000101
source: RIPE
remarks: ****************************
remarks: * THIS OBJECT IS MODIFIED
remarks: * Please note that all data that is generally regarded as personal
remarks: * data has been removed from this object.
remarks: * To view the original object, please query the RIPE Database at:
remarks: * http://www.ripe.net/whois
remarks: ****************************
I want to parse each block starting with % Tags relating to
and out of the block I want to extract the inetnum and first descr
This is what I got so far: (Updated)
import re
with open('test.db', "r") as f:
content = f.read()
r = re.compile(r''
'descr:\s+(.*?)\n',
re.IGNORECASE)
res = r.findall(content)
print res

as it's over 4gb file you don't want to read all the file in one time by using f.read()
but using the file object as an iterator (when you iterate on a file you get one line after the other)
the following genererator should do the job
def parse(filename):
current= None
for l in open(filename):
if l.startswith("% Tags relating to"):
if current is not None:
yield current
current = {}
elif l.startswith("inetnum:"):
current["inetnum"] = l.split(":",1)[1].strip()
elif l.startswith("descr") and not "descr" in current:
current["descr"] = l.split(":",1)[1].strip()
if current is not None:
yield current
and you can use it as the following
for record in parse("test.db"):
print (record)
result on the test file:
{'inetnum': '194.243.227.240 - 194.243.227.255', 'descr': 'PRINCES INDUSTRIE ALIMENTARI'}
{'inetnum': '194.16.216.176 - 194.16.216.183', 'descr': 'CARLSTEINS TRAFIK AB'}

If you only want to get first descr :
r = re.compile(r''
'descr:\s+(.*?)\n(?:descr:.*\n)*',
re.IGNORECASE)
If you want inetnum and first descr :
[ a + b for (a,b) in re.compile(r''
'(?:descr:\s+(.*?)\n(?:descr:.*\n)*)|(?:inetnum:\s+(.*?)\n)',
re.IGNORECASE) ]
I must admit I make no use of % Tags relating to, and that I suppose that all descr are consecutive.

Related

Extract data using regex between specified strings

Question1: I want to extract the data between "Target Information" and the line before "Group Information" and store it as a variable or appropriately.
Question2: Next, I want to extract the data from "Group Information" till the end of the file and store it in a variable or something appropriate.
Question3: With this information in both the above cases, I want to extract the line just after the line which starts with "Name"
From the below code I was able to get the information between "Target Information" and "Group Information" and Captured the data in "required_lines" variable.
Next, I am trying to get the line after the line "Name". But this fails. And can the logic be implemented using regex call?
# Extract the lines between
with open ('showrcopy.txt', 'r') as f:
file = f.readlines()
required_lines1 = []
required_lines = []
inRecordingMode = False
for line in file:
if not inRecordingMode:
if line.startswith('Target Information'):
inRecordingMode = True
elif line.startswith('Group Information'):
inRecordingMode = False
else:
required_lines.append(line.strip())
print(required_lines)
#Extract the line after the line "Name"
def gen():
for x in required_lines:
yield x
for line in gen():
if "Name" in line:
print(next(gen())
showrcopy.txt
root#gnodee184119:/home/usr/redsuren# date; showrcopy -qw
Tue Aug 24 00:20:38 PDT 2021
Remote Copy System Information
Status: Started, Normal
Target Information
Name ID Type Status Policy QW-Server QW-Ver Q-Status Q-Status-Qual ATF-Timeout
s2976 4 IP ready mirror_config https://10.157.35.148:8443 4.0.007 Re-starting Quorum not stable 10
Link Information
Target Node Address Status Options
s2976 0:9:1 192.168.20.21 Up -
s2976 1:9:1 192.168.20.22 Up -
receive 0:9:1 192.168.10.21 Up -
receive 1:9:1 192.168.10.22 Up -
Group Information
Name Target Status Role Mode Options
SG_hpux_vgcgloack.r518634 s2976 Started Primary Sync auto_recover,auto_failover,path_management,auto_synchronize,active_active
LocalVV ID RemoteVV ID SyncStatus LastSyncTime
vgcglock_SG_cluster 13496 vgcglock_SG_cluster 28505 Synced NA
Name Target Status Role Mode Options
aix_rcg1_AA.r518634 s2976 Started Primary Sync auto_recover,auto_failover,path_management,auto_synchronize,active_active
LocalVV ID RemoteVV ID SyncStatus LastSyncTime
tpvvA_aix_r.2 20149 tpvvA_aix.2 41097 Synced NA
tpvvA_aix_r.3 20150 tpvvA_aix.3 41098 Synced NA
tpvvA_aix_r.4 20151 tpvvA_aix.4 41099 Synced NA
tpvvA_aix_r.5 20152 tpvvA_aix.5 41100 Synced NA
tpvvA_aix_r.6 20153 tpvvA_aix.6 41101 Synced NA
tpvvA_aix_r.7 20154 tpvvA_aix.7 41102 Synced NA
tpvvA_aix_r.8 20155 tpvvA_aix.8 41103 Synced NA
tpvvA_aix_r.9 20156 tpvvA_aix.9 41104 Synced NA
tpvvA_aix_r.10 20157 tpvvA_aix.10 41105 Synced NA

Here's a regex solution to pull the target info and group info:
import re
with open("./showrcopy.txt", "r") as f:
text = f.read()
target_info_pattern = re.compile(r"Target Information([.\s\S]*)Group Information")
group_info_pattern = re.compile(r"Group Information([.\s\S]*)")
target_info = target_info_pattern.findall(text)[0].strip().split("\n")
group_info = group_info_pattern.findall(text)[0].strip().split("\n")
target_info_line_after_name = target_info[1]
group_info_line_after_name = group_info[1]
And the lines you're interested in:
>>> target_info_line_after_name
's2976 4 IP ready mirror_config https://10.157.35.148:8443 4.0.007 Re-starting Quorum not stable 10'
>>> group_info_line_after_name
'SG_hpux_vgcgloack.r518634 s2976 Started Primary Sync auto_recover,auto_failover,path_management,auto_synchronize,active_active'

Does anyone know how to add row numbers?

I can open this file directly from the net,and I want to add row numbers to each line based on rules. If you need header row number,then start from number 1, if no need, then start from next line. This is my code, I tried a lot but doesn't work. It looks like picture. Does anyone how to solve this problem? Thanks in advance!
import sys
class Main:
def task1(self):
print('*' * 30, 'Task')
import urllib.request
# url
url = 'http://www.born.nhely.hu/group_list.txt'
# Initiate a request to get a response
while True:
try:
response = urllib.request.urlopen(url)
except Exception as e:
print('An error has occurred, the request is being made again, the error message is as follows：', e)
else:
break
# Print all student information
content = response.read().decode('utf-8')
#add row number
header_row = input("Do you want to know header_row numbers? Y OR N?")
if header_row == 'Y':
for i, line in enumerate(content, start=1):
print(f'{i},{line}')
else:
for i, line in enumerate(content, start=0):
print('{},{}'.format(i, line.strip()))
def start(self):
self.task1()
Main().start()

Have a look at the data you are downloading:
Name;Short name;Email;Country;Other spoken languages
ABOUELHASSAN Shehab Ibrahim Adbelazin;?;dwedar909#gmail.com;?;?
AGHAEI HOSSEIN ABADI Mohammad Mehdi;Matt;mahdiaghaei355#gmail.com;Iran;English
...
Now look at the results you are getting:
1,N
2,a
3,m
4,e
5,;
6,S
7,h
8,o
...
It should be apparent that you are looping character by character; not line by line.
When you have:
for i, line in enumerate(content, start=1):
print(f'{i},{line}')
content is a string -- not a list of lines -- so you will loop over the string character by character with the for loop.
So to fix, do:
for i, line in enumerate(content.splitlines(), start=1):
print(f'{i},{line}')
Or, you can change the method of reading from the server to reading lines instead of characters:
content = response.readlines()

Your absorbing the .txt content in one big string... if you use .readlines() instead of .read(), you can achieve what you want.
You should modify this:
# Print all student information
content = response.read().decode('utf-8')
To this:
# Print all student information
content = response.readlines()
You can use the repr() method to take a look at your data:
print(repr(content))
'Name;Short name;Email;Country;Other spoken languages\r\nABOUELHASSAN Shehab Ibrahim Adbelazin;?;dwedar909#gmail.com;?;?\r\nAGHAEI HOSSEIN ABADI Mohammad Mehdi;Matt;mahdiaghaei355#gmail.com;Iran;English\r\nAMIN Asjad;?;;?;?\r\nATILA Arda Burak;Arda;arda_atila#hotmail.com;Turkey;English\r\nBELTRAN CASTRO Carlos Ricardo;Ricardo;crbeltrancas#gmail.com;Colombia;English, Chinese\r\nBhatti Muhammad Hasan;?;;?;?\r\nCAKIR Alp Hazar;Alp;alphazarc#gmail.com;Turkey;English\r\nDENG Zhihui;Deng;dzhfalcon0727#gmail.com;China;English\r\nDURUER Ahmet Enes;Ahmet / kahverengi;hello#ahmetduruer.com;Turkey;English\r\nENKHZAYA Jagar;Jager;japman2400#gmail.com;Mongolia;English\r\nGHAIBAH Sanaa;Sanaa;sanaagheibeh12#gmail.com;Syria;English\r\nGUO Ruizheng;?;ruizhengguo#gmail.com;China;English\r\nGURBANZADE Gurban;Qurban;gurbanzade01#gmail.com;Azeribaijan;English, Russian, Turkish\r\nHASNAIN Syed Muhammad;Hasnain;syedhasnainhijazy313#gmail.com;Pakistan;?\r\nISMAYILOV Firdovsi;Firi;firiisi#gmail.com;Azeribaijan ?;English,Russian,Turkish\r\nKINGRANI Muskan;Muskan;muskankingrani4#gmail.com;India;English\r\nKOKO Susan Kekeli Ruth;Susan;susankoko3#gmail.com;Ghana;N/A\r\nKOLA-OLALEYE Adeola Damilola;Adeola;inboxadeola#gmail.com;Nigeria;French\r\nLEWIS Madison Buse;?;madisonbuse#yahoo.com;Turkey;Turkish\r\nLI Ting;Ting;514053044#qq.com;China;English\r\nMARUSENKO Svetlana;Svetlana;svetlana.maru#gmail.com;Russia;English, German\r\nMOHANTY Cyrus;cyrus;cyrusmohanty5261#gmail.com;India;English\r\nMOTHOBI Thabo Emmanuel;thabo;thabomothobi#icloud.com;South Africa;English\r\nNayudu Yashmit Vinay;?;;?;?\r\nPurevsuren Davaadorj;?;Purevsuren.davaadorj99#gmail.com;Mongolia ?;English\r\nSAJID Anoosha;Anoosha;anooshasajid12#gmail.com;Pakistan;English\r\nSHANG Rongxiang;Xiang;1074482757#qq.com;China;English\r\nSU Haobo;Su;2483851740#qq.com;China;English\r\nTAKEUCHI ROSSMAN Elly;Elly;elliebanana10th#gmail.com;Japan;English\r\nULUSOY Nedim Can;Nedim;nedimcanulusoy#gmail.com;Turkey;English, Hungarian\r\nXuan Qijian;Xuan;xjwjadon#gmail.com;China ?;?\r\nYUAN Gaopeng;Yuan;1277237374#qq.com;China;English\r\n'
vs
print(repr(content))
[b'Name;Short name;Email;Country;Other spoken languages\r\n', b'ABOUELHASSAN Shehab Ibrahim Adbelazin;?;dwedar909#gmail.com;?;?\r\n', b'AGHAEI HOSSEIN ABADI Mohammad Mehdi;Matt;mahdiaghaei355#gmail.com;Iran;English\r\n', b'AMIN Asjad;?;;?;?\r\n', b'ATILA Arda Burak;Arda;arda_atila#hotmail.com;Turkey;English\r\n', b'BELTRAN CASTRO Carlos Ricardo;Ricardo;crbeltrancas#gmail.com;Colombia;English, Chinese\r\n', b'Bhatti Muhammad Hasan;?;;?;?\r\n', b'CAKIR Alp Hazar;Alp;alphazarc#gmail.com;Turkey;English\r\n', b'DENG Zhihui;Deng;dzhfalcon0727#gmail.com;China;English\r\n', b'DURUER Ahmet Enes;Ahmet / kahverengi;hello#ahmetduruer.com;Turkey;English\r\n', b'ENKHZAYA Jagar;Jager;japman2400#gmail.com;Mongolia;English\r\n', b'GHAIBAH Sanaa;Sanaa;sanaagheibeh12#gmail.com;Syria;English\r\n', b'GUO Ruizheng;?;ruizhengguo#gmail.com;China;English\r\n', b'GURBANZADE Gurban;Qurban;gurbanzade01#gmail.com;Azeribaijan;English, Russian, Turkish\r\n', b'HASNAIN Syed Muhammad;Hasnain;syedhasnainhijazy313#gmail.com;Pakistan;?\r\n', b'ISMAYILOV Firdovsi;Firi;firiisi#gmail.com;Azeribaijan ?;English,Russian,Turkish\r\n', b'KINGRANI Muskan;Muskan;muskankingrani4#gmail.com;India;English\r\n', b'KOKO Susan Kekeli Ruth;Susan;susankoko3#gmail.com;Ghana;N/A\r\n', b'KOLA-OLALEYE Adeola Damilola;Adeola;inboxadeola#gmail.com;Nigeria;French\r\n', b'LEWIS Madison Buse;?;madisonbuse#yahoo.com;Turkey;Turkish\r\n', b'LI Ting;Ting;514053044#qq.com;China;English\r\n', b'MARUSENKO Svetlana;Svetlana;svetlana.maru#gmail.com;Russia;English, German\r\n', b'MOHANTY Cyrus;cyrus;cyrusmohanty5261#gmail.com;India;English\r\n', b'MOTHOBI Thabo Emmanuel;thabo;thabomothobi#icloud.com;South Africa;English\r\n', b'Nayudu Yashmit Vinay;?;;?;?\r\n', b'Purevsuren Davaadorj;?;Purevsuren.davaadorj99#gmail.com;Mongolia ?;English\r\n', b'SAJID Anoosha;Anoosha;anooshasajid12#gmail.com;Pakistan;English\r\n', b'SHANG Rongxiang;Xiang;1074482757#qq.com;China;English\r\n', b'SU Haobo;Su;2483851740#qq.com;China;English\r\n', b'TAKEUCHI ROSSMAN Elly;Elly;elliebanana10th#gmail.com;Japan;English\r\n', b'ULUSOY Nedim Can;Nedim;nedimcanulusoy#gmail.com;Turkey;English, Hungarian\r\n', b'Xuan Qijian;Xuan;xjwjadon#gmail.com;China ?;?\r\n', b'YUAN Gaopeng;Yuan;1277237374#qq.com;China;English\r\n']
Also, instead of hard-coding the charset as utf-8, you can use response.headers.get_content_charset()

bibtexparser - pyparsing.ParseException: Expected end of text

I'm using bibtexparser to parse a bibtex file.
import bibtexparser
with open('MetaGJK12842.bib','r') as bibfile:
bibdata = bibtexparser.load(bibfile)
While parsing I get the error message:
Could not parse properly, starting at
#article{Frenn:EvidenceBasedNursing:1999,
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pyparsing.py", line 3183, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected end of text (at char 5773750),
(line:47478, col:1)`
The line refers to the following bibtex entry:
#article{Frenn:EvidenceBasedNursing:1999,
author = {Frenn, M.},
title = {A Mediterranean type diet reduced all cause and cardiac mortality after a first myocardial infarction [commentary on de Lorgeril M, Salen P, Martin JL, et al. Mediterranean dietary pattern in a randomized trial: prolonged survival and possible reduced cancer rate. ARCH INTERN MED 1998;158:1181-7]},
journal = {Evidence Based Nursing},
uuid = {15A66A61-0343-475A-8700-F311B08BB2BC},
volume = {2},
number = {2},
pages = {48-48},
address = {College of Nursing, Marquette University, Milwaukee, WI},
year = {1999},
ISSN = {1367-6539},
url = {},
keywords = {Treatment Outcomes;Mediterranean Diet;Mortality;France;Neoplasms -- Prevention and Control;Phase One Excluded - No Assessment of Vegetable as DV;Female;Phase One - Reviewed by Hao;Myocardial Infarction -- Diet Therapy;Diet, Fat-Restricted;Phase One Excluded - No Fruit or Vegetable Study;Phase One Excluded - No Assessment of Fruit as DV;Male;Clinical Trials},
tags = {Phase One Excluded - No Assessment of Vegetable as DV;Phase One Excluded - No Fruit or Vegetable Study;Phase One - Reviewed by Hao;Phase One Excluded - No Assessment of Fruit as DV},
accession_num = {2000008864. Language: English. Entry Date: 20000201. Revision Date: 20130524. Publication Type: journal article},
remote_database_name = {rzh},
source_app = {EndNote},
EndNote_reference_number = {4413},
Secondary_title = {Evidence Based Nursing},
Citation_identifier = {Frenn 1999a},
remote_database_provider = {EBSCOhost},
publicationStatus = {Unknown},
abstract = {Question: text.},
notes = {(0) abstract; commentary. Journal Subset: Core Nursing; Europe; Nursing; Peer Reviewed; UK \& Ireland. No. of Refs: 1 ref. NLM UID: 9815947.}
}
What is wrong with this entry?

It seems that the issue has been addressed and resolved in the project repository (see Issue 147)
Until the next release, installing the library from the git repository can serve as a temporary fix.
pip install --upgrade git+https://github.com/sciunto-org/python-bibtexparser.git#master

I had this same error and found an entry near the line mentioned in the error that had a line like this
...
year = {1959},
month =
}
When I removed the null month item it parsed for me.

Memory limit hit with appengine-mapreduce

I'm working on appengine-mapreduce function and have modified the demo to fit my purpose.
Basically I have a million over lines in the following format: userid, time1, time2. My purpose is to find the difference between time1 and time2 for each userid.
However, as I run this on Google App Engine, I encountered this error message in the logs section:
Exceeded soft private memory limit with 180.56 MB after servicing 130 requests total
While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
def time_count_map(data):
"""Time count map function."""
(entry, text_fn) = data
text = text_fn()
try:
q = text.split('\n')
for m in q:
reader = csv.reader([m.replace('\0', '')], skipinitialspace=True)
for s in reader:
"""Calculate time elapsed"""
sdw = s[1]
start_date = time.strptime(sdw,"%m/%d/%y %I:%M:%S%p")
edw = s[2]
end_date = time.strptime(edw,"%m/%d/%y %I:%M:%S%p")
time_difference = time.mktime(end_date) - time.mktime(start_date)
yield (s[0], time_difference)
except IndexError, e:
logging.debug(e)
def time_count_reduce(key, values):
"""Time count reduce function."""
time = 0.0
for subtime in values:
time += float(subtime)
realtime = int(time)
yield "%s: %d\n" % (key, realtime)
Can anyone suggest how else I can optimize my code better? Thanks!!
Edited:
Here's the pipeline handler:
class TimeCountPipeline(base_handler.PipelineBase):
"""A pipeline to run Time count demo.
Args:
blobkey: blobkey to process as string. Should be a zip archive with
text files inside.
"""
def run(self, filekey, blobkey):
logging.debug("filename is %s" % filekey)
output = yield mapreduce_pipeline.MapreducePipeline(
"time_count",
"main.time_count_map",
"main.time_count_reduce",
"mapreduce.input_readers.BlobstoreZipInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
mapper_params={
"blob_key": blobkey,
},
reducer_params={
"mime_type": "text/plain",
},
shards=32)
yield StoreOutput("TimeCount", filekey, output)
Mapreduce.yaml:
mapreduce:
- name: Make messages lowercase
params:
- name: done_callback
value: /done
mapper:
handler: main.lower_case_posts
input_reader: mapreduce.input_readers.DatastoreInputReader
params:
- name: entity_kind
default: main.Post
- name: processing_rate
default: 100
- name: shard_count
default: 4
- name: Make messages upper case
params:
- name: done_callback
value: /done
mapper:
handler: main.upper_case_posts
input_reader: mapreduce.input_readers.DatastoreInputReader
params:
- name: entity_kind
default: main.Post
- name: processing_rate
default: 100
- name: shard_count
default: 4
The rest of the files are exactly the same as the demo.
I've uploaded a copy of my codes on dropbox: http://dl.dropbox.com/u/4288806/demo%20compressed%20fail%20memory.zip

Also consider calling gc.collect() at regular points during your code. I've seen several SO questions about exceeding soft memory limits that were alleviated by calling gc.collect(), most having to do with blobstore.

It is likely your input file exceeds the soft memory limit in size. For big files use either BlobstoreLineInputReader or BlobstoreZipLineInputReader.
These input readers pass something different to the map function, they pass the start_position in the file and the line of text.
Your map function might look something like:
def time_count_map(data):
"""Time count map function."""
text = data[1]
try:
reader = csv.reader([text.replace('\0', '')], skipinitialspace=True)
for s in reader:
"""Calculate time elapsed"""
sdw = s[1]
start_date = time.strptime(sdw,"%m/%d/%y %I:%M:%S%p")
edw = s[2]
end_date = time.strptime(edw,"%m/%d/%y %I:%M:%S%p")
time_difference = time.mktime(end_date) - time.mktime(start_date)
yield (s[0], time_difference)
except IndexError, e:
logging.debug(e)
Using BlobstoreLineInputReader will allow the job to run much faster as it can use more than one shard, up to 256, but it means you need to upload your files uncompressed, which can be a pain. I handle it by uploading the compressed files to an EC2 windows server, then decompress and upload from there, since upstream bandwidth is so big.

How to parse nagios status.dat file?

I'd like to parse status.dat file for nagios3 and output as xml with a python script.
The xml part is the easy one but how do I go about parsing the file? Use multi line regex?
It's possible the file will be large as many hosts and services are monitored, will loading the whole file in memory be wise?
I only need to extract services that have critical state and host they belong to.
Any help and pointing in the right direction will be highly appreciated.
LE Here's how the file looks:
########################################
# NAGIOS STATUS FILE
#
# THIS FILE IS AUTOMATICALLY GENERATED
# BY NAGIOS. DO NOT MODIFY THIS FILE!
########################################
info {
created=1233491098
version=2.11
}
program {
modified_host_attributes=0
modified_service_attributes=0
nagios_pid=15015
daemon_mode=1
program_start=1233490393
last_command_check=0
last_log_rotation=0
enable_notifications=1
active_service_checks_enabled=1
passive_service_checks_enabled=1
active_host_checks_enabled=1
passive_host_checks_enabled=1
enable_event_handlers=1
obsess_over_services=0
obsess_over_hosts=0
check_service_freshness=1
check_host_freshness=0
enable_flap_detection=0
enable_failure_prediction=1
process_performance_data=0
global_host_event_handler=
global_service_event_handler=
total_external_command_buffer_slots=4096
used_external_command_buffer_slots=0
high_external_command_buffer_slots=0
total_check_result_buffer_slots=4096
used_check_result_buffer_slots=0
high_check_result_buffer_slots=2
}
host {
host_name=localhost
modified_attributes=0
check_command=check-host-alive
event_handler=
has_been_checked=1
should_be_scheduled=0
check_execution_time=0.019
check_latency=0.000
check_type=0
current_state=0
last_hard_state=0
plugin_output=PING OK - Packet loss = 0%, RTA = 3.57 ms
performance_data=
last_check=1233490883
next_check=0
current_attempt=1
max_attempts=10
state_type=1
last_state_change=1233489475
last_hard_state_change=1233489475
last_time_up=1233490883
last_time_down=0
last_time_unreachable=0
last_notification=0
next_notification=0
no_more_notifications=0
current_notification_number=0
notifications_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_host=1
last_update=1233491098
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}
service {
host_name=gateway
service_description=PING
modified_attributes=0
check_command=check_ping!100.0,20%!500.0,60%
event_handler=
has_been_checked=1
should_be_scheduled=1
check_execution_time=4.017
check_latency=0.210
check_type=0
current_state=0
last_hard_state=0
current_attempt=1
max_attempts=4
state_type=1
last_state_change=1233489432
last_hard_state_change=1233489432
last_time_ok=1233491078
last_time_warning=0
last_time_unknown=0
last_time_critical=0
plugin_output=PING OK - Packet loss = 0%, RTA = 2.98 ms
performance_data=
last_check=1233491078
next_check=1233491378
current_notification_number=0
last_notification=0
next_notification=0
no_more_notifications=0
notifications_enabled=1
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_service=1
last_update=1233491098
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}
It can have any number of hosts and a host can have any number of services.

Pfft, get yerself mk_livestatus. http://mathias-kettner.de/checkmk_livestatus.html

Nagiosity does exactly what you want:
http://code.google.com/p/nagiosity/

Having shamelessly stolen from the above examples,
Here's a version build for Python 2.4 that returns a dict containing arrays of nagios sections.
def parseConf(source):
conf = {}
patID=re.compile(r"(?:\s*define)?\s*(\w+)\s+{")
patAttr=re.compile(r"\s*(\w+)(?:=|\s+)(.*)")
patEndID=re.compile(r"\s*}")
for line in source.splitlines():
line=line.strip()
matchID = patID.match(line)
matchAttr = patAttr.match(line)
matchEndID = patEndID.match( line)
if len(line) == 0 or line[0]=='#':
pass
elif matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2).strip()
cur[1][attribute] = value
elif matchEndID and cur:
conf.setdefault(cur[0],[]).append(cur[1])
del cur
return conf
To get all Names your Host which have contactgroups beginning with 'devops':
nagcfg=parseConf(stringcontaingcompleteconfig)
hostlist=[host['host_name'] for host in nagcfg['host']
if host['contact_groups'].startswith('devops')]

Don't know nagios and its config file, but the structure seems pretty simple:
# comment
identifier {
attribute=
attribute=value
}
which can simply be translated to
<identifier>
<attribute name="attribute-name">attribute-value</attribute>
</identifier>
all contained inside a root-level <nagios> tag.
I don't see line breaks in the values. Does nagios have multi-line values?
You need to take care of equal signs within attribute values, so set your regex to non-greedy.

You can do something like this:
def parseConf(filename):
conf = []
with open(filename, 'r') as f:
for i in f.readlines():
if i[0] == '#': continue
matchID = re.search(r"([\w]+) {", i)
matchAttr = re.search(r"[ ]*([\w]+)=([\w\d]*)", i)
matchEndID = re.search(r"[ ]*}", i)
if matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2)
cur[1][attribute] = value
elif matchEndID:
conf.append(cur)
return conf
def conf2xml(filename):
conf = parseConf(filename)
xml = ''
for ID in conf:
xml += '<%s>\n' % ID[0]
for attr in ID[1]:
xml += '\t<attribute name="%s">%s</attribute>\n' % \
(attr, ID[1][attr])
xml += '</%s>\n' % ID[0]
return xml
Then try to do:
print conf2xml('conf.dat')

If you slightly tweak Andrea's solution you can use that code to parse both the status.dat as well as the objects.cache
def parseConf(source):
conf = []
for line in source.splitlines():
line=line.strip()
matchID = re.match(r"(?:\s*define)?\s*(\w+)\s+{", line)
matchAttr = re.match(r"\s*(\w+)(?:=|\s+)(.*)", line)
matchEndID = re.match(r"\s*}", line)
if len(line) == 0 or line[0]=='#':
pass
elif matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2).strip()
cur[1][attribute] = value
elif matchEndID and cur:
conf.append(cur)
del cur
return conf
It is a little puzzling why nagios chose to use two different formats for these files, but once you've parsed them both into some usable python objects you can do quite a bit of magic through the external command file.
If anybody has a solution for getting this into a a real xml dom that'd be awesome.

For the last several months I've written and released a tool that that parses the Nagios status.dat and objects.cache and builds a model that allows for some really useful manipulation of Nagios data. We use it to drive an internal operations dashboard that is a simplified 'mini' Nagios. Its under continual development and I've neglected testing and documentation but the code isn't too crazy and I feel fairly easy to follow.
Let me know what you think...
https://github.com/zebpalmer/NagParser

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python parse large db over 4gb - python

Related

Extract data using regex between specified strings

Does anyone know how to add row numbers?

bibtexparser - pyparsing.ParseException: Expected end of text

Memory limit hit with appengine-mapreduce

How to parse nagios status.dat file?

Categories

Resources