Python: Parsing a colon delimited file with various counts of fields

Python: Parsing a colon delimited file with various counts of fields - python

I'm trying to parse a a few files with the following format in 'clientname'.txt
hostname:comp1
time: Fri Jan 28 20:00:02 GMT 2011
ip:xxx.xxx.xx.xx
fs:good:45
memory:bad:78
swap:good:34
Mail:good
Each section is delimited by a : but where lines 0,2,6 have 2 fields... lines 1,3-5 have 3 or more fields. (A big issue I've had trouble with is the time: line, since 20:00:02 is really a time and not 3 separate fields.
I have several files like this that I need to parse. There are many more lines in some of these files with multiple fields.
...
for i in clients:
if os.path.isfile(rpt_path + i + rpt_ext): # if the rpt exists then do this
rpt = rpt_path + i + rpt_ext
l_count = 0
for line in open(rpt, "r"):
s_line = line.rstrip()
part = s_line.split(':')
print part
l_count = l_count + 1
else: # else break
break
First I'm checking if the file exists first, if it does then open the file and parse it (eventually) As of now I'm just printing the output (print part) to make sure it's parsing right.
Honestly, the only trouble I'm having at this point is the time: field. How can I treat that line specifically different than all the others? The time field is ALWAYS the 2nd line in all of my report files.

split method has the following syntax split( [sep [,maxsplit]]) and if the maxsplit is given, it will make maxsplit+1 parts. In you case, you just have give maxsplit as 1. Just split(':',1) would solve your problem.

If time is a special case, you could do:
[...]
s_line = line.rstrip()
if line.startswith('time:'):
part = s_line.split(':', 1)
else:
part = s_line.split(':')
print part
[...]
This would give you:
['hostname', 'comp1']
['time', ' Fri Jan 28 20:00:02 GMT 2011']
['ip', 'xxx.xxx.xx.xx']
['fs', 'good', '45']
['memory', 'bad', '78']
['swap', 'good', '34']
['Mail', 'good']
And doesn't rely on the position of time in the file.

Design considerations:
Robustly handle extraneous whitespace, including blank lines, and missing colons.
Extract a record_type, which is then used to decide how to parse the remainder of the line.
>>> def munched(s, n=None):
... if n is None:
... n = 99999999 # this kludge should not be necessary
... return [x.strip() for x in s.split(':', n)]
...
>>> def parse_line(line):
... if ':' not in line:
... return [line.strip(), '']
... record_type, remainder = munched(line, 1)
... if record_type == 'time':
... data = [remainder]
... else:
... data = munched(remainder)
... return record_type, data
...
>>> for guff in """
... hostname:comp1
... time: Fri Jan 28 20:00:02 GMT 2011
... ip:xxx.xxx.xx.xx
... fs:good:45
... memory : bad : 78
... missing colon
... Mail:good""".splitlines(True):
... print repr(guff), parse_line(guff)
...
'\n' ['', '']
'hostname:comp1\n' ('hostname', ['comp1'])
'time: Fri Jan 28 20:00:02 GMT 2011\n' ('time', ['Fri Jan 28 20:00:02 GMT 2011'])
'ip:xxx.xxx.xx.xx\n' ('ip', ['xxx.xxx.xx.xx'])
'fs:good:45\n' ('fs', ['good', '45'])
' memory : bad : 78 \n' ('memory', ['bad', '78'])
'missing colon\n' ['missing colon', '']
'Mail:good' ('Mail', ['good'])
>>>

If the time field always the 2nd line. Why can't you skip it and parse it separately?
Something like
for i, line in enumerate(open(rpt, "r").read().splitlines()):
if i==1: # Special parsing for time: line
data = line[5:]
else:
# your normal parsing logic

Related

How to parse log file by regex grouping

Trying to parse based on the grouping, below is the input file to parse.
Cannot able to aggregate multiple groups from my regex which produces expected output. Need some recommendations to print data in expected output. (Note Group2 can have different other (strings) in the actual log-file)
#Parse out the timedate stamp Jan 20 03:25:08 to capture two groups
Example groups
1.) Jan 20 03:25 2.) logrotate
1.) Jan 20 05:03 2.) ntpd
logfile= """Jan 20 03:25:08 fakehost logrotate: ALERT exited abnormally with [1]
Jan 20 03:25:08 fakehost run-parts(/etc/cron.daily)[20447]: finished logrotate
Jan 20 03:26:21 fakehost anacron[28969]: Job 'cron.daily' terminated
Jan 20 03:26:21 fakehost anacron[28969]: Normal exit (1 job run)
Jan 20 03:30:01 fakehost CROND[31462]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jan 20 03:30:01 fakehost CROND[31461]: (root) CMD (/var/system/bin/sys-cmd -F
Jan 20 05:03:03 fakehost ntpd[3705]: synchronized to time.faux.biz, stratum 2
"""
Expected output:
minute,total_count,logrotate,CROND,ntpd,anacron,run-parts
Jan 20 03:25,2,1,0,0,0,1
Jan 20 03:26,2,0,2,0,1,1
Jan 20 03:30,2,0,2,0,0,0
Jan 20 05:03,1,0,0,1,0,0
This is my code:
import re
output = {}
regex = re.compile(r'^(\w+ \d+ \d+:\d+):\d+ \w+ (\w+).*$')
with open("logfile", "r+") as myfile:
for log_line in myfile:
match = regex.match(log_line)
if match:
if match.group(1) and match.group(2):
print(match.groups())
# Struck here to arrange the data
output[match.group(1)]['total_count'] += 1
output[match.group(1)][match.group(2)] += 1
for k, v in output.items():
print('{0} {1}'.format(k, v))

import re
output = []
regex = re.compile(r'^(\w+ \d+ \d+:\d+):\d+ \w+ (\w+).*$')
with open("logfile.txt", "r+") as myfile:
for log_line in myfile:
match = regex.match(log_line)
if match:
if match.group(1) and match.group(2):
dataDict = {'minute': match.group(1), 'total_count': 1}
dataDict[match.group(2)] = 1
lastInsertedIndex = len(output) - 1
if (len(output) > 0): # data exist, check if same minute data exist or not
# same minute, update existing data
if (output[lastInsertedIndex]['minute'] == match.group(1)):
lastInsertedIndexDict = output[lastInsertedIndex]
if (match.group(2) in lastInsertedIndexDict):
lastInsertedIndexDict[match.group(2)] = lastInsertedIndexDict[match.group(2)] + 1 # updating group(2)
else:
lastInsertedIndexDict[match.group(2)] = 1
# updating total count
lastInsertedIndexDict['total_count'] = lastInsertedIndexDict['total_count'] + 1
output[lastInsertedIndex] = lastInsertedIndexDict
else: # new minute, simply append
output.append(dataDict)
else: # output list is empty
output.append(dataDict)
for data in output:
print(data)
Here the idea is after we have match.groups(), create a dictionary with minute as a key & for total_count value as 1. Then for match.group(1) set value as 1 with the new found key.
As the data would be in increasing order of time, so check if previously inserted data is of same minute or different minute.
If same minute then increase the dictionary's total_count & match.group(2) values by 1
If different minute then simply append the dictionary to output list
Currently output list would print keys & values. Incase if you want to print only values then instead of print(data) in the last line, you can change to print(data.values())
Just to mention, I have assumed that you are not facing any issue in regex & that whatever regex you have provided is fulfilling your requirement.
In case you face any issue in regex or need help in that do let me know in comment.

Dig out information with Python re

I want to dig out information from the log files and wrote the script below:
import re
file = '''Date,Time,Type,User,Message
Thu Jul 18, 2019 14:18:41.945,EFM,201202 ,Robot picked
Thu Jul 18, 2019 14:18:51.486,DS ,201202 ,Module 1
Thu Jul 18, 2019 14:19:07.747,DS ,201202 ,Door opened
Thu Jul 18, 2019 14:20:08.231,EFM,203204205206,Robot picked
Thu Jul 18, 2019 14:20:08.231,DS ,203204 ,Module 2
Thu Jul 18, 2019 14:20:10.282,DS ,203204 ,Door opened
...
'''
p1 = re.compile(r'\w{3} \w{3} \d\d, \d{4} (\d\d:\d\d:\d\d.\d{3}),EFM,(\d+?\s*?),Robot picked')
p2 = re.compile(r'\w{3} \w{3} \d\d, \d{4} (\d\d:\d\d:\d\d.\d{3}),DS ,(\d+?\s*?),Module 1')
p3 = re.compile(r'\w{3} \w{3} \d\d, \d{4} (\d\d:\d\d:\d\d.\d{3}),DS ,(\d+?\s*?),Door opened')
w_file = r'D:\sample.txt'
lines = file.readlines()
t_file =open(w_file,'w')
info = ['User','Time1','Time2','Time3' ]
t_file.write('{}\n'.format(','.join(item for item in info)))
for line in lines:
p1_line = re.findall(p1, line.strip())
p2_line = re.findall(p2, line.strip())
p3_line = re.findall(p3, line.strip())
if p1_line and p2_line and p3_line:
if p1_line[0][1][:3] == p2_line[0][1][:3] and p1_line[0][1][:3] == p5_line[0][1][:3]:
t_file.write('{},{},{},{}\n'.format(p1_line[0][1].strip(),p1_line[0][0],p2_line[0][0],p3_line[0][0])
t_file.close()
When I open the sample.txt file, there is only the 'User,Time1,Time2,Time3' row. Can any find what's wrong in my script?
What I want is like below:
User,Time1,Time2,Time3
201202,14:18:41.945,14:18:51.486,14:19:07.747
203204205206,14:20:08.231,14:20:08.231,14:20:10.282

The issue with your script is that you are trying to match all regular expressions to the same line, and then performing an and condition, which of course fails.
Each regular expression works but only for specific lines, hence 2 out of the 3 will return [] which evaluates to False.
For example, given:
line = 'Thu Jul 18, 2019 14:18:41.945,EFM,201202 ,Robot picked'
You will have:
p1_line = [('14:18:41.945', '201202 ')] # match
p2_line = [] # no match
p3_line = [] # no match
Once you and these three values, the condition will evaluate to False and for this reason nothing will be written to the file:
if p1_line and p2_line and p3_line: # this evaluates to False
So, depending on the exact logic you want to implement, you may have to store and remember past matches and build on that.

Splitting parts of a DHCP log and storing into a variable using python

How do you split part of a simple line from DHCP Log files using python
E.g
Dec 15 09:57:17 6con-dhcp-01 dhcpd: DHCPREQUEST for 103.26.222.234 from 14:91:82:ab:4d:32 via eth1
I want to split the above line into different parts and store it in a variable
Date = Dec 15
Time = 09:57:17
IP Address = 103.26.222.234
Mac Address = 14:91:82:ab:4d:32
I already tried using .split() but to no avail.
var = Dec 15 09:57:17 6con-dhcp-01 dhcpd: DHCPREQUEST for 103.26.222.234 from 14:91:82:ab:4d:32 via eth1
datas = var.split()
for data in datas:
print(data)

For exacting data of any log file, you should consider the text pattern first. And then segment the required data. Last, use a regex tool to extract the information.

I found an way to split them and store them in variables
import re
splitted = re.split(' ', var)
print splitted
The output will be in this format:
['Dec', '15', '09:57:27', '6con-dhcp-01', 'dhcpd:', 'DHCPREQUEST', 'for', '132.147.83.212', 'from', '44:d9:e7:41:ee:77', 'via', 'eth1\n']
With the code below i can store the separated part into variables
Monthdatetime = splitted[0] + ' ' + splitted[1] + ' ' + splitted[2]
print Monthdatetime
The output would be:
Dec 15 09:57:27

how to ignore a field data while comparing 2 files in python

Input files are as below with fields schema asMode|Date|Count|timestamp|status|insertTimeStamp
test1.txt:
HR|06/08/2016|3000|Thu Jun 09 2016|Complete|20160627020300
HR|06/08/2016|2000|Thu Jun 09 2016|Complete|20160627020400
HR|06/08/2016|1000|Thu Jun 09 2016|Complete|20160627020500
test2.txt:
HR|06/08/2016|3010|Thu Jun 09 2016|Complete|20160627070300
HR|06/08/2016|2000|Fri Jun 09 2016|Complete|20160627080300
HR|06/08/2016|1500|Thu Jun 09 2016|Complete|20160627090300
Now my requirement is to compare the difference lines between both the files, but it should ignore insertTimeStamp field (last column data) while comparing.
I tried below code. It's working fine, but its comparing line by line. Could someone please suggest me how can my code skip the insertTimeStamp field while comparison?
Thanks in advance for helping me.
import difflib
import sys
with open('/tmp/test1.txt', 'r') as hosts0:
with open('/tmp/test2.txt', 'r') as hosts1:
diff = difflib.unified_diff(
hosts0.readlines(),
hosts1.readlines(),
fromfile='hosts0',
tofile='hosts1',
n=0,
)
for line in diff:
for prefix in ('---', '+++', '##'):
if line.startswith(prefix):
break
else:
sys.stdout.write(line[1:])

You could potentially just slice off the last element in each line before passing them into the diff function
diff = difflib.unified_diff(
['|'.join(x.split('|')[:-1]) for x in hosts0.readlines()],
['|'.join(x.split('|')[:-1]) for x in hosts1.readlines()],
fromfile='hosts0',
tofile='hosts1',
n=0,
)
Line-by-line comparison w/o using difflib:
with open('/tmp/test1.txt', 'r') as fh:
hosts1 = fh.readlines()
with open('/tmp/test2.txt', 'r') as fh:
hosts2 = fh.readlines()
for h1, h2 in zip(hosts1, hosts2):
if h1.split('|')[:-1] != h2.split('|')[:-1]:
print 'Lines are not the same!'

Read and select specific rows from text file regex Python

I have a large number of text files to read from in Python. Each file is structured as the following sample:
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is a multiline abstract of the paper)
blablabla
blablabla
\\
I would like to automatically extract and store (e.g., as a list) the Title, Authors, and abstract (the text between the second and third \\ - note that it starts with an indent) from each text file. Also note that the white line between Date (revised) and Title is really there (it is not a typo that I introduced).
My attempts so far have involved (I am showing the steps for a single text file, say the first file in the list):
filename = os.listdir(path)[0]
test = pd.read_csv(filename, header=None, delimiter="\t")
Which gives me:
0
0 ----------------------------------------------...
1 \\
2 Paper: some_integer
3 From: <some_email_address>
4 Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
5 Date (revised v2): Tue, 8 May 2001 10:39:33 G...
6 Title: some_title...
7 Authors: name_1, name_2
8 Comments: 28 pages, JHEP latex
9 Report-no: DUKE-CGTP-00-01
10 \\
11 blabla...
12 blabla...
13 blabla...
14 \\
I can then select a given row (e.g., the one featuring the title) with:
test[test[0].str.contains("Title")].to_string()
But it is truncated, it is not a clean string (some attributes show up) and I find this entire pandas-based approach quite tedious actually... There must be an easier way to directly select the rows of interest from the text file using regex. At least I hope so...

you could process line by line.
import re
data = {}
temp_s = match = ''
with open('myfile.txt', 'r') as infile:
for line in infile:
if ":" in line:
line = line.split(':')
data[line[0]] = line[1]
elif re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
match = match.group(1)
temp_s += match
while 1:
line = infile.next()
if re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
temp_s += match.group(1)
else:
break
data['abstract'] = temp_s

How about iterating over each line in the file and split by the first : if it is present in line, collecting the result of the split in a dictionary:
with open("input.txt") as f:
data = dict(line.strip().split(": ", 1) for line in f if ": " in line)
As a result, the data would contain:
{
'Comments': '28 pages, JHEP latex',
'Paper': 'some_integer',
'From': '<some_email_address>',
'Date (revised v2)': 'Tue, 8 May 2001 10:39:33 GMT (27kb)',
'Title': 'some_title',
'Date': 'Wed, 4 Apr 2001 12:08:13 GMT (27kb)',
'Authors': 'name_1, name_2'
}

If your files really always have the same structure, you could come up with:
# -*- coding: utf-8> -*-
import re
string = """
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is the abstract of the paper)
\\
"""
rx = re.compile(r"""
^Title:\s(?P<title>.+)[\n\r] # Title at the beginning of a line
Authors:\s(?P<authors>.+)[\n\r] # Authors: ...
Comments:\s(?P<comments>.+)[\n\r] # ... and so on ...
.*[\n\r]
(?P<abstract>.+)""",
re.MULTILINE|re.VERBOSE) # so that the caret matches any line
# + verbose for this explanation
for match in rx.finditer(string):
print match.group('title'), match.group('authors'), match.group('abstract')
# some_title name_1, name_2 blablabla (this is the abstract of the paper)
This approach takes Title as the anchor (beginning of a line) and skims the text afterwards. The named groups may not really be necessary but make the code easier to understand. The pattern [\n\r] looks for newline characters.
See a demo on regex101.com.

This pattern will get you started:
\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\
Assume 'txtfile.txt' is of the format as shown on the top. If using python 2.7x:
import re
with open('txtfile.txt', 'r') as f:
input_string = f.read()
p = r'\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\'
print re.findall(p, input_string)
Output:
[('some_title', 'name_1, name_2', 'blablabla (this is a multiline abstract of the paper)\n blablabla\n blablabla')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Parsing a colon delimited file with various counts of fields - python

split method has the following syntax split( [sep [,maxsplit]]) and if the maxsplit is given, it will make maxsplit+1 parts. In you case, you just have give maxsplit as 1. Just split(':',1) would solve your problem.

If the time field always the 2nd line. Why can't you skip it and parse it separately? Something like for i, line in enumerate(open(rpt, "r").read().splitlines()): if i==1: # Special parsing for time: line data = line[5:] else: # your normal parsing logic

Related

How to parse log file by regex grouping

Dig out information with Python re

Splitting parts of a DHCP log and storing into a variable using python

how to ignore a field data while comparing 2 files in python

Read and select specific rows from text file regex Python

Categories

Resources