How would I go about parsing the following log? - python

I need to parse a log in the following format:
===== Item 5483/14800 =====
This is the item title
Info: some note
===== Item 5483/14800 (Update 1/3) =====
This is the item title
Info: some other note
===== Item 5483/14800 (Update 2/3) =====
This is the item title
Info: some more notes
===== Item 5483/14800 (Update 3/3) =====
This is the item title
Info: some other note
Test finished. Result Foo. Time 12 secunds.
Stats: CPU 0.5 MEM 5.3
===== Item 5484/14800 =====
This is this items title
Info: some note
Test finished. Result Bar. Time 4 secunds.
Stats: CPU 0.9 MEM 4.7
===== Item 5485/14800 =====
This is the title of this item
Info: some note
Test finished. Result FooBar. Time 7 secunds.
Stats: CPU 2.5 MEM 2.8
I only need to extract each item's title (next line after ===== Item 5484/14800 =====) and the result.
So i need to keep only the line with the item title and the result for that title and discard everything else.
The issue is that sometimes a item has notes (maxim 3) and sometimes the result is displayed without additional notes so this makes it tricky.
Any help would be appreciated. I'm doing the parser in python but don't need the actual code but some pointing in how could i achive this?
LE: The result I'm looking for is to discard everything else and get something like:
('This is the item title','Foo')
then
('This is this items title','Bar')

1) Loop through every line in the log
a)If line matches appropriate Regex:
Display/Store Next Line as the item title.
Look for the next line containing "Result
XXXX." and parse out that result for
including in the result set.
EDIT: added a bit more now that I see the result you're looking for.

I know you didn't ask for real code but this is too great an opportunity for a generator-based text muncher to pass up:
# data is a multiline string containing your log, but this
# function could be easily rewritten to accept a file handle.
def get_stats(data):
title = ""
grab_title = False
for line in data.split('\n'):
if line.startswith("====="):
grab_title = True
elif grab_title:
grab_title = False
title = line
elif line.startswith("Test finished."):
start = line.index("Result") + 7
end = line.index("Time") - 2
yield (title, line[start:end])
for d in get_stats(data):
print d
# Returns:
# ('This is the item title', 'Foo')
# ('This is this items title', 'Bar')
# ('This is the title of this item', 'FooBar')
Hopefully this is straightforward enough. Do ask if you have questions on how exactly the above works.

Maybe something like (log.log is your file):
def doOutput(s): # process or store data
print s
s=''
for line in open('log.log').readlines():
if line.startswith('====='):
if len(s):
doOutput(s)
s=''
else:
s+=line
if len(s):
doOutput(s)

I would recommend starting a loop that looks for the "===" in the line. Let that key you off to the Title which is the next line. Set a flag that looks for the results, and if you don't find the results before you hit the next "===", say no results. Else, log the results with the title. Reset your flag and repeat. You could store the results with the Title in a dictionary as well, just store "No Results" when you don't find the results between the Title and the next "===" line.
This looks pretty simple to do based on the output.

Regular expression with group matching seems to do the job in python:
import re
data = """===== Item 5483/14800 =====
This is the item title
Info: some note
===== Item 5483/14800 (Update 1/3) =====
This is the item title
Info: some other note
===== Item 5483/14800 (Update 2/3) =====
This is the item title
Info: some more notes
===== Item 5483/14800 (Update 3/3) =====
This is the item title
Info: some other note
Test finished. Result Foo. Time 12 secunds.
Stats: CPU 0.5 MEM 5.3
===== Item 5484/14800 =====
This is this items title
Info: some note
Test finished. Result Bar. Time 4 secunds.
Stats: CPU 0.9 MEM 4.7
===== Item 5485/14800 =====
This is the title of this item
Info: some note
Test finished. Result FooBar. Time 7 secunds.
Stats: CPU 2.5 MEM 2.8"""
p = re.compile("^=====[^=]*=====\n(.*)$\nInfo: .*\n.*Result ([^\.]*)\.",
re.MULTILINE)
for m in re.finditer(p, data):
print "title:", m.group(1), "result:", m.group(2)er code here
If You need more info about regular expressions check: python docs.

This is sort of a continuation of maciejka's solution (see the comments there). If the data is in the file daniels.log, then we could go through it item by item with itertools.groupby, and apply a multi-line regexp to each item. This should scale fine.
import itertools, re
p = re.compile("Result ([^.]*)\.", re.MULTILINE)
for sep, item in itertools.groupby(file('daniels.log'),
lambda x: x.startswith('===== Item ')):
if not sep:
title = item.next().strip()
m = p.search(''.join(item))
if m:
print (title, m.group(1))

You could try something like this (in c-like pseudocode, since i don't know python):
string line=getline();
regex boundary="^==== [^=]+ ====$";
regex info="^Info: (.*)$";
regex test_data="Test ([^.]*)\. Result ([^.]*)\. Time ([^.]*)\.$";
regex stats="Stats: (.*)$";
while(!eof())
{
// sanity check
test line against boundary, if they don't match, throw excetion
string title=getline();
while(1)
{
// end the loop if we finished the data
if(eof()) break;
line=getline();
test line against boundary, if they match, break
test line against info, if they match, load the first matched group into "info"
test line against test_data, if they match, load the first matched group into "test_result", load the 2nd matched group into "result", load the 3rd matched group into "time"
test line against stats, if they match, load the first matched group into "statistics"
}
// at this point you can use the variables set above to do whatever with a line
// for example, you want to use title and, if set, test_result/result/time.
}

Parsing is not done using regex. If you have a reasonably well structured text (which it looks as you do), you can use faster testing (e.g. line.startswith() or such).
A list of dictionaries seems to be a suitable data type for such key-value pairs. Not sure what else to tell you. This seems pretty trivial.
OK, so the regexp way proved to be more suitable in this case:
import re
re.findall("=\n(.*)\n", s)
is faster than list comprehensions
[item.split('\n', 1)[0] for item in s.split('=\n')]
Here's what I got:
>>> len(s)
337000000
>>> test(get1, s) #list comprehensions
0:00:04.923529
>>> test(get2, s) #re.findall()
0:00:02.737103
Lesson learned.

Here's some not so good looking perl code that does the job. Perhaps you can find it useful in some way. Quick hack, there are other ways of doing it (I feel that this code needs defending).
#!/usr/bin/perl -w
#
# $Id$
#
use strict;
use warnings;
my #ITEMS;
my $item;
my $state = 0;
open(FD, "< data.txt") or die "Failed to open file.";
while (my $line = <FD>) {
$line =~ s/(\r|\n)//g;
if ($line =~ /^===== Item (\d+)\/\d+/) {
my $item_number = $1;
if ($item) {
# Just to make sure we don't have two lines that seems to be a headline in a row.
# If we have an item but haven't set the title it means that there are two in a row that matches.
die "Something seems to be wrong, better safe than sorry. Line $. : $line\n" if (not $item->{title});
# If we have a new item number add previuos item and create a new.
if ($item_number != $item->{item_number}) {
push(#ITEMS, $item);
$item = {};
$item->{item_number} = $item_number;
}
} else {
# First entry, don't have an item.
$item = {}; # Create new item.
$item->{item_number} = $item_number;
}
$state = 1;
} elsif ($state == 1) {
die "Data must start with a headline." if (not $item);
# If we already have a title make sure it matches.
if ($item->{title}) {
if ($item->{title} ne $line) {
die "Title doesn't match for item " . $item->{item_number} . ", line $. : $line\n";
}
} else {
$item->{title} = $line;
}
$state++;
} elsif (($state == 2) && ($line =~ /^Info:/)) {
# Just make sure that for state 2 we have a line that match Info.
$state++;
} elsif (($state == 3) && ($line =~ /^Test finished\. Result ([^.]+)\. Time \d+ secunds{0,1}\.$/)) {
$item->{status} = $1;
$state++;
} elsif (($state == 4) && ($line =~ /^Stats:/)) {
$state++; # After Stats we must have a new item or we should fail.
} else {
die "Invalid data, line $.: $line\n";
}
}
# Need to take care of the last item too.
push(#ITEMS, $item) if ($item);
close FD;
# Loop our items and print the info we stored.
for $item (#ITEMS) {
print $item->{item_number} . " (" . $item->{status} . ") " . $item->{title} . "\n";
}

Related

How can I extract only the initial description of a javadoc comment and ignore the javadoc tags using python?

I am trying to extract the text in a Javadoc before the Javadoc tags in python. I have so far been able to avoid the parameter tag, but there are other Javadoc tags that could be mentioned all at once. Is there a better way to do this?
parameterTag = "#param"
if (parameterTag in comments):
splitComments = subsentence.split(my_string[my_string.find(start) + 1: my_string.find(parameterTag)])
Input:
/**
* Checks if the given node is inside the graph and
* throws exception if the given node is null
* #param a single node to be check
* #return true if given node is contained in graph,
* return false otherwise
* #requires given node != null
*/
public boolean containsNode(E node){
if(node==null){
throw new IllegalArgumentException();
}
if(graph.containsKey(node)){
return true;
}
return false;
}
Output:
/**
* Checks if the given node is inside the graph and
* throws exception if the given node is null
*/
public boolean containsNode(E node){
if(node==null){
throw new IllegalArgumentException();
}
if(graph.containsKey(node)){
return true;
}
return false;
}
Following your logic, there is a description part followed by a "tags" part, and then a closing comment mark.
If you check line by line an occurrence of tag keyword, you won't be able to deal with this line:
* return false otherwise
Hence, you need to detect if you entered in of exited from a tags part. Below a working example:
import re
# Javascript content loading
JSF = open("your_js_file.js", 'r')
js_content = JSF.read()
# We split the script into lines, and then we'll loop through the array
lineiterator = iter(js_content.splitlines())
# This boolean marks if we are or not in a "tags" part
in_tags = False
for line in lineiterator:
# If we matched a line with a word starting with "#",
# Then we entered into a "tags" section
if re.search("#\w+", line) is not None :
in_tags = True
# If we matched a closing comment mark, then we left
# The "tags" section (or never entered into it)
if re.search("\*/",line) is not None:
in_tags = False
if not in_tags:
print(line)

Finding a pattern multiple times between start and end patterns python regex

i am trying to find a certain pattern between a start and end patterns for multiple lines. Here is what i mean:
i read a file and saved it in variable File, this is what the original file looks like:
File:
...
...
...
Keyword some_header_file {
XYZ g1234567S7894561_some_other_trash_underscored_text;
XYZ g1122334S9315919_different_other_trash_underscored_text;
}
...
...
...
I am trying to grab the 1234567 between the g and S and also the 1122334. The some_header_file block can be any number of lines but always ends with }
So i am trying to grab exactly 7 digits between the g and the S for all the lines from the "Keyword" till the "}" for that specific header.
this is what i used:
FirstSevenDigitPart = str(re.findall(r"Keyword\s%s.*\n.*XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}.*\}"%variable , str(File) , flags=re.MULTILINE))
but unfortunately it does not return anything.. just a blank []
what am i doing wrong? how can i accomplish this?
Thanks in advance.
You may read your file into a contents variable and use
import re
contents = "...\n...\n...\nKeyword some_header_file {\n XYZ \ng1234567S7894561_some_other_trash_underscored_text;\n XYZ \n1122334S9315919_different_other_trash_underscored_text;\n}\n...\n...\n..."
results = []
variable = 'some_header_file'
block_rx = r'Keyword\s+{}\s*{{([^{{}}]*)}}'.format(re.escape(variable))
value_rx = r'XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}'
for block in re.findall(block_rx, contents):
results.extend(re.findall(value_rx, block))
print(results)
# => ['1234567', '1122334']
See the Python demo.
The first regex (block_rx) will look like Keyword\s+some_header_file\s*{([^{}]*)} and will match all the blocks you need to search for values in. The second regex, XYZ\s[gd]([0-9]{7})[A-Z][0-9]{7}, matches what you need and returns the list of captures.
I think that the simplest way here will be to use two expressions and run it in two steps. There is a little example. Of course you should optimize it for your needs.
import re
text = """Keyword some_header_file {
XYZ g1234567S7894561_some_other_trash_underscored_text;
XYZ g1122334S9315919_different_other_trash_underscored_text;
}"""
all_lines_pattern = 'Keyword\s*%s\s*\{\n(?P<all_lines>(.|\s)*)\}'
first_match = re.match(all_lines_pattern % 'some_header_file', text)
if first_match is None:
# some break logic here
pass
found_lines = first_match.group(1)
print(found_lines) # ' XYZ g1234567S7894561_some_other_trash_underscored_text;\n XYZ g1122334S9315919_different_other_trash_underscored_text;\n '
sub_pattern = '(XYZ\s*[gd](?P<your_pattern>[0-9]{7})[A-Z]).*;'
found_groups = re.findall(sub_pattern, found_lines)
print(found_groups) # [('XYZ g1234567S', '1234567'), ('XYZ g1122334S', '1122334')]

How to extract a block of lines from given file using python

I have a file like this
grouping data-rate-parameters {
description
"Data rate configuration parameters.";
reference
"ITU-T G.997.2 clause 7.2.1.";
leaf maximum-net-data-rate {
type bbf-yang:data-rate32;
default "4294967295";
description
"Defines the value of the maximum net data rate (see clause
11.4.2.2/G.9701).";
reference
"ITU-T G.997.2 clause 7.2.1.1 (MAXNDR).";
}
leaf psd-level {
type psd-level;
description
"The PSD level of the referenced sub-carrier.";
}
}
}
grouping line-spectrum-profile {
description
"Defines the parameters contained in a line spectrum
profile.";
leaf profiles {
type union {
type enumeration {
enum "all" {
description
"Used to indicate that all profiles are allowed.";
}
}
type profiles;
}
Here I want to extract every leaf block. ex., leaf maximum-net-data-rate block is
leaf maximum-net-data-rate {
type bbf-yang:data-rate32;
default "4294967295";
description
"Defines the value of the maximum net data rate (see clause
11.4.2.2/G.9701).";
reference
"ITU-T G.997.2 clause 7.2.1.1 (MAXNDR).";
}
like this I want to extract
I tried with this code, here based on the counting of braces('{') i am trying to read the block
with open(r'file.txt','r') as f:
leaf_part = []
count = 0
c = 'psd-level'
for line in f:
if 'leaf %s {'%c in line:
cur_line=line
for line in f:
pre_line=cur_line
cur_line=line
if '{' in pre_line:
leaf_part.append(pre_line)
count+=1
elif '}' in pre_line:
leaf_part.append(pre_line)
count-=1
elif count==0:
break
else:
leaf_part.append(pre_line)
Its worked for leaf maximum-net-data-rate but its not working for leaf psd-level
while doing for leaf psd-level, its displaying out of block lines also.
Help me to achieve this task.
it just need simple edit in your break loop because of multiple closing bracket '}' your count is already been negative hence you need to change that line with
elif count<=0:
break
but it is still appending multiple braces in your list so you can handle it by keeping record of opening bracket and I changed the code as below:
with open(r'file.txt','r') as f:
leaf_part = []
braces_record = []
count = 0
c = 'psd-level'
for line in f:
if 'leaf %s {'%c in line:
braces_record.append('{')
cur_line=line
for line in f:
pre_line=cur_line
cur_line=line
if '{' in pre_line:
braces_record.append('{')
leaf_part.append(pre_line)
count+=1
elif '}' in pre_line:
try:
braces_record.pop()
if len(braces_record)>0:
leaf_part.append(pre_line)
except:
pass
count-=1
elif count<=0:
break
elif '}' not in pre_line:
leaf_part.append(pre_line)
Result of above code:
leaf psd-level {
type psd-level;
description
"The PSD level of the referenced sub-carrier.";
}
You can use regex:
import re
reg = re.compile(r"leaf.+?\{.+?\}", re.DOTALL)
reg.findall(file)
It returns an array of all matched blocks
If you want to search for specific leaf names, you can use format(remember to double curly brackets):
leafname = "maximum-net-data-rate"
reg = re.compile(r"leaf\s{0}.+?\{{.+?\}}".format(temp), re.DOTALL)
EDIT: for python 2.7
reg = re.compile(r"leaf\s%s.+?\{.+?\}" %temp, re.DOTALL)
EDIT2: totally missed that you have nested brackets in your last example.
This solution will be much more involved than a simple regex, so you might consider another approach. Still, it is possible to do.
First, you will need to install regex module, since built-in re does not support recursive patterns.
pip install regex
second, here is you pattern
import regex
reg = regex.compile(r"(leaf.*?)({(?>[^\{\}]|(?2))*})", regex.DOTALL)
reg.findall(file)
Now, this pattern will return a list of tuples, so you may want to do something like this
res = [el[0]+el[1] for el in reg.findall(file)]
This should give you the list of full results.

Parsing named nested expressions with pyparsing

I'm trying to parse some data using pyparsing that looks (more or less) like this:
User.Name = Dave
User.Age = 42
Date = 2015/01/01
Begin Component List
Begin Component 2
1 some data = a value
2 another key = 999
End Component 2
Begin Another Component
a.key = 42
End Another Component
End Component List
Begin MoreData
Another = KeyPair
End MoreData
I've found some similar examples, but I've not done very well for myself.
parsing file with curley brakets
Parse line data until keyword with pyparsing
Here's what I have so far, but I keep hitting an error similar to: pyparsing.ParseException: Expected "End" (at char 26), (line:5, col:1)
from pyparsing import *
data = '''Begin A
hello
world
End A
'''
opener = Literal('Begin') + Word(alphas)
closer = Literal('End') + Word(alphas)
content = Combine(OneOrMore(~opener
+ ~closer
+ CharsNotIn('\n', exact=1)))
expr = nestedExpr(opener=opener, closer=closer, content=content)
parser = expr
res = parser.parseString(data)
print(res)
It's important the the words after "Begin" are captured, as these are the names of the dictionaries, as well as the key-value pairs. Where there is a number after the opener, e.g. "Begin Component 2" the "2" is the number of pairs which I don't need (presumably this is used by the original software?). Similarly, I don't need the numbers in the list (the "1" and "2").
Is nestedExpr the correct approach to this?

How can I elegantly combine/concat files by section with python?

Like many an unfortunate programmer soul before me, I am currently dealing with an archaic file format that refuses to die. I'm talking ~1970 format specification archaic. If it were solely up to me, we would throw out both the file format and any tool that ever knew how to handle it, and start from scratch. I can dream, but that unfortunately that won't resolve my issue.
The format: Pretty Loosely defined, as years of nonsensical revisions have destroyed almost all back compatibility it once had. Basically, the only constant is that there are section headings, with few rules about what comes before or after these lines. The headings are sequential (e.g. HEADING1, HEADING2, HEADING3,...), but not numbered and are not required (e.g HEADING1, HEADING3, HEADING7). Thankfully, all possible heading permutations are known. Here's a fake example:
# Bunch of comments
SHOES # First heading
# bunch text and numbers here
HATS # Second heading
# bunch of text here
SUNGLASSES # Third heading
...
My problem: I need to concatenate multiple of these files by these section headings. I have a perl script that does this quite nicely:
while(my $l=<>) {
if($l=~/^SHOES/i) { $r=\$shoes; name($r);}
elsif($l=~/^HATS/i) { $r=\$hats; name($r);}
elsif($l=~/^SUNGLASSES/i) { $r=\$sung; name($r);}
elsif($l=~/^DRESS/i || $l=~/^SKIRT/i ) { $r=\$dress; name($r);}
...
...
elsif($l=~/^END/i) { $r=\$end; name($r);}
else {
$$r .= $l;
}
print STDERR "Finished processing $ARGV\n" if eof;
}
As you can see, with the perl script I basically just change where a reference points to when I get to a certain pattern match, and concatenate each line of the file to its respective string until I get to the next pattern match. These are then printed out later as one big concated file.
I would and could stick with perl, but my needs are becoming more complex every day and I would really like to see how this problem can be solved elegantly with python (can it?). As of right now my method in python is basically to load the entire file as a string, search for the heading locations, then split up the string based on the heading indices and concat the strings. This requires a lot of regex, if-statements and variables for something that seems so simple in another language.
It seems that this really boils down to a fundamental language issue. I found a very nice SO discussion about python's "call-by-object" style as compared with that of other languages that are call-by-reference.
How do I pass a variable by reference?
Yet, I still can't think of an elegant way to do this in python. If anyone can help kick my brain in the right direction, it would be greatly appreciated.
That's not even elegant Perl.
my #headers = qw( shoes hats sunglasses dress );
my $header_pat = join "|", map quotemeta, #headers;
my $header_re = qr/$header_pat/i;
my ( $section, %sections );
while (<>) {
if (/($header_re)/) { name( $section = \$sections{$1 } ); }
elsif (/skirt/i) { name( $section = \$sections{'dress'} ); }
else { $$section .= $_; }
print STDERR "Finished processing $ARGV\n" if eof;
}
Or if you have many exceptions:
my #headers = qw( shoes hats sunglasses dress );
my %aliases = ( 'skirt' => 'dress' );
my $header_pat = join "|", map quotemeta, #headers, keys(%aliases);
my $header_re = qr/$header_pat/i;
my ( $section, %sections );
while (<>) {
if (/($header_re)/) {
name( $section = \$sections{ $aliases{$1} // $1 } );
} else {
$$section .= $_;
}
print STDERR "Finished processing $ARGV\n" if eof;
}
Using a hash saves the countless my declarations you didn't show.
You could also do $header_name = $1; name(\$sections{$header_name}); and $sections{$header_name} .= $_ for a bit more readability.
I'm not sure if I understand your whole problem, but this seems to do everything you need:
import sys
headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = [[] for header in headers]
for arg in sys.argv[1:]:
section_index = 0
with open(arg) as f:
for line in f:
if line.startswith(headers[section_index + 1]):
section_index = section_index + 1
else:
sections[section_index].append(line)
Obviously you could change this to read or mmap the whole file, then re.search or just buf.find for the next header. Something like this (untested pseudocode):
import sys
headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = defaultdict(list)
for arg in sys.argv[1:]:
with open(arg) as f:
buf = f.read()
section = None
start = 0
for header in headers[1:]:
idx = buf.find('\n'+header, start)
if idx != -1:
sections[section].append(buf[start:idx])
section = header
start = buf.find('\n', idx+1)
if start == -1:
break
else:
sections[section].append(buf[start:])
And there are plenty of other alternatives, too.
But the point is, I can't see anywhere where you'd need to pass a variable by reference in any of those solutions, so I'm not sure where you're stumbling on whichever one you've chosen.
So, what if you want to treat two different headings as the same section?
Easy: create a dict mapping headers to sections. For example, for the second version:
headers_to_sections = {None: None, 'SHOES': 'SHOES', 'HATS': 'HATS',
'DRESSES': 'DRESSES', 'SKIRTS': 'DRESSES'}
Now, in the code that doessections[section], just do sections[headers_to_sections[section]].
For the first, just make this a mapping from strings to indices instead of strings to strings, or replace sections with a dict. Or just flatten the two collections by using a collections.OrderedDict.
My deepest sympathies!
Here's some code (please excuse minor syntax errors)
def foundSectionHeader(l, secHdrs):
for s in secHdrs:
if s in l:
return True
return False
def main():
fileList = ['file1.txt', 'file2.txt', ...]
sectionHeaders = ['SHOES', 'HATS', ...]
sectionContents = dict()
for section in sectionHeaders:
sectionContents[section] = []
for file in fileList:
fp = open(file)
lines = fp.readlines()
idx = 0
while idx < len(lines):
sec = foundSectionHeader(lines[idx]):
if sec:
idx += 1
while not foundSectionHeader(lines[idx], sectionHeaders):
sectionContents[sec].append(lines[idx])
idx += 1
This assumes that you don't have content lines which look like "SHOES"/"HATS" etc.
Assuming you're reading from stdin, as in the perl script, this should do it:
import sys
import collections
headings = {'SHOES':'SHOES','HATS':'HATS','DRESS':'DRESS','SKIRT':'DRESS'} # etc...
sections = collections.defaultdict(str)
key = None
for line in sys.stdin:
sline = line.strip()
if sline not in headings:
sections[headings.get(key)].append(sline)
else:
key = sline
You'll end up with a dictionary where like this:
{
None: <all lines as a single string before any heading>
'HATS' : <all lines as a single string below HATS heading and before next heading> ],
etc...
}
The headings list does not have to be defined in the some order as the headings appear in the input.

Categories