How to extract a block of lines from given file using python

How to extract a block of lines from given file using python - python

I have a file like this
grouping data-rate-parameters {
description
"Data rate configuration parameters.";
reference
"ITU-T G.997.2 clause 7.2.1.";
leaf maximum-net-data-rate {
type bbf-yang:data-rate32;
default "4294967295";
description
"Defines the value of the maximum net data rate (see clause
11.4.2.2/G.9701).";
reference
"ITU-T G.997.2 clause 7.2.1.1 (MAXNDR).";
}
leaf psd-level {
type psd-level;
description
"The PSD level of the referenced sub-carrier.";
}
}
}
grouping line-spectrum-profile {
description
"Defines the parameters contained in a line spectrum
profile.";
leaf profiles {
type union {
type enumeration {
enum "all" {
description
"Used to indicate that all profiles are allowed.";
}
}
type profiles;
}
Here I want to extract every leaf block. ex., leaf maximum-net-data-rate block is
leaf maximum-net-data-rate {
type bbf-yang:data-rate32;
default "4294967295";
description
"Defines the value of the maximum net data rate (see clause
11.4.2.2/G.9701).";
reference
"ITU-T G.997.2 clause 7.2.1.1 (MAXNDR).";
}
like this I want to extract
I tried with this code, here based on the counting of braces('{') i am trying to read the block
with open(r'file.txt','r') as f:
leaf_part = []
count = 0
c = 'psd-level'
for line in f:
if 'leaf %s {'%c in line:
cur_line=line
for line in f:
pre_line=cur_line
cur_line=line
if '{' in pre_line:
leaf_part.append(pre_line)
count+=1
elif '}' in pre_line:
leaf_part.append(pre_line)
count-=1
elif count==0:
break
else:
leaf_part.append(pre_line)
Its worked for leaf maximum-net-data-rate but its not working for leaf psd-level
while doing for leaf psd-level, its displaying out of block lines also.
Help me to achieve this task.

it just need simple edit in your break loop because of multiple closing bracket '}' your count is already been negative hence you need to change that line with
elif count<=0:
break
but it is still appending multiple braces in your list so you can handle it by keeping record of opening bracket and I changed the code as below:
with open(r'file.txt','r') as f:
leaf_part = []
braces_record = []
count = 0
c = 'psd-level'
for line in f:
if 'leaf %s {'%c in line:
braces_record.append('{')
cur_line=line
for line in f:
pre_line=cur_line
cur_line=line
if '{' in pre_line:
braces_record.append('{')
leaf_part.append(pre_line)
count+=1
elif '}' in pre_line:
try:
braces_record.pop()
if len(braces_record)>0:
leaf_part.append(pre_line)
except:
pass
count-=1
elif count<=0:
break
elif '}' not in pre_line:
leaf_part.append(pre_line)
Result of above code:
leaf psd-level {
type psd-level;
description
"The PSD level of the referenced sub-carrier.";
}

You can use regex:
import re
reg = re.compile(r"leaf.+?\{.+?\}", re.DOTALL)
reg.findall(file)
It returns an array of all matched blocks
If you want to search for specific leaf names, you can use format(remember to double curly brackets):
leafname = "maximum-net-data-rate"
reg = re.compile(r"leaf\s{0}.+?\{{.+?\}}".format(temp), re.DOTALL)
EDIT: for python 2.7
reg = re.compile(r"leaf\s%s.+?\{.+?\}" %temp, re.DOTALL)
EDIT2: totally missed that you have nested brackets in your last example.
This solution will be much more involved than a simple regex, so you might consider another approach. Still, it is possible to do.
First, you will need to install regex module, since built-in re does not support recursive patterns.
pip install regex
second, here is you pattern
import regex
reg = regex.compile(r"(leaf.*?)({(?>[^\{\}]|(?2))*})", regex.DOTALL)
reg.findall(file)
Now, this pattern will return a list of tuples, so you may want to do something like this
res = [el[0]+el[1] for el in reg.findall(file)]
This should give you the list of full results.

Related

How can I extract only the initial description of a javadoc comment and ignore the javadoc tags using python?

I am trying to extract the text in a Javadoc before the Javadoc tags in python. I have so far been able to avoid the parameter tag, but there are other Javadoc tags that could be mentioned all at once. Is there a better way to do this?
parameterTag = "#param"
if (parameterTag in comments):
splitComments = subsentence.split(my_string[my_string.find(start) + 1: my_string.find(parameterTag)])
Input:
/**
* Checks if the given node is inside the graph and
* throws exception if the given node is null
* #param a single node to be check
* #return true if given node is contained in graph,
* return false otherwise
* #requires given node != null
*/
public boolean containsNode(E node){
if(node==null){
throw new IllegalArgumentException();
}
if(graph.containsKey(node)){
return true;
}
return false;
}
Output:
/**
* Checks if the given node is inside the graph and
* throws exception if the given node is null
*/
public boolean containsNode(E node){
if(node==null){
throw new IllegalArgumentException();
}
if(graph.containsKey(node)){
return true;
}
return false;
}

Following your logic, there is a description part followed by a "tags" part, and then a closing comment mark.
If you check line by line an occurrence of tag keyword, you won't be able to deal with this line:
* return false otherwise
Hence, you need to detect if you entered in of exited from a tags part. Below a working example:
import re
# Javascript content loading
JSF = open("your_js_file.js", 'r')
js_content = JSF.read()
# We split the script into lines, and then we'll loop through the array
lineiterator = iter(js_content.splitlines())
# This boolean marks if we are or not in a "tags" part
in_tags = False
for line in lineiterator:
# If we matched a line with a word starting with "#",
# Then we entered into a "tags" section
if re.search("#\w+", line) is not None :
in_tags = True
# If we matched a closing comment mark, then we left
# The "tags" section (or never entered into it)
if re.search("\*/",line) is not None:
in_tags = False
if not in_tags:
print(line)

In Python, how to match a string to a dictionary item (like 'Bra*')

I'm a complete novice at Python so please excuse me for asking something stupid.
From a textfile a dictionary is made to be used as a pass/block filter.
The textfile contains addresses and either a block or allow like "002029568,allow" or "0011*,allow" (without the quotes).
The search-input is a string with a complete code like "001180000".
How can I evaluate if the search-item is in the dictionary and make it match the "0011*,allow" line?
Thank you very much for your efford!
The filter-dictionary is made with:
def loadFilterDict(filename):
global filterDict
try:
with open(filename, "r") as text_file:
lines = text_file.readlines()
for s in lines:
fields = s.split(',')
if len(fields) == 2:
filterDict[fields[0]] = fields[1].strip()
text_file.close()
except:
pass
Check if the code (ccode) is in the dictionary:
if ccode in filterDict:
if filterDict[ccode] in ['block']:
continue
else:
if filterstat in ['block']:
continue
The filters-file is like:
002029568,allow
000923993,allow
0011*, allow

If you can use re, you don't have to worry about the wildcard but let re.match do the hard work for you:
# Rules input (this could also be read from file)
lines = """002029568,allow
0011*,allow
001180001,block
"""
# Parse rules from string
rules = []
for line in lines.split("\n"):
line = line.strip()
if not line:
continue
identifier, ruling = line.split(",")
rules += [(identifier, ruling)]
# Get rulings for specific number
def rule(number):
from re import match
rulings = []
for identifier, ruling in rules:
# Replace wildcard with regex .*
identifier = identifier.replace("*", ".*")
if match(identifier, number):
rulings += [ruling]
return rulings
print(rule("001180000"))
print(rule("001180001"))
Which prints:
['allow']
['allow', 'block']
The function will return a list of rulings. Their order is the same order as they appear in your config lines. So you could easily just pick the last or first ruling whichever is the one you're interested in.
Or break the loop prematurely if you can assume that no two rulings will interfere.
Examples:
001180000 is matched by 0011*,allow only, so the only ruling which applies is allow.
001180001 is matched by 0011*,allow at first, so you'll get allow as before. However, it is also matched by 001180001,block, so a block will get added to the rulings, too.

If the wildcard entries in the file have a fixed length (for example, you only need to support lines like 0011*,allow and not 00110*,allow or 0*,allow or any other arbitrary number of digits followed by *) you can use a nested dictionary, where the outer keys are the known parts of the wildcarded entries.
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
Then when you parse the file and get to the line 0011*,allow you do not need to do any matching. All you have to do is check if '0011' is present. Crude example:
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
if prefix in d:
# there is a "match", then you can deal with all the entries that match,
# in this case the items in the inner dictionary
# {'001180000': 'value', '001180001': 'value'}
print('match')
else:
print('no match')
If you do need to support arbitrary lengths of wildcarded entries, you will have to resort to a loop iterating over the dictionary (and therefore beating the point of using a dictionary to begin with):
d = {'001180000': 'value', '001180001': 'value'}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
for k, v in d.items():
if k.startswith(prefix):
# found matching key-value pair
print(k, v)

Regex Python find everything between four characters

I have a string that holds data. And I want everything in between ({ and })
"({Simple Data})"
Should return "Simple Data"

Or regex:
s = '({Simple Data})'
print(re.search('\({([^})]+)', s).group(1))
Output:
'Simple Data'

You could try the following:
^\({(.*)}\)$
Group 1 will contain Simple Data.
See an example on regexr.

If the brackets are always positioned at the beginning and the end of the string, then you can do this:
l = "({Simple Data})"
print(l[2:-2])
Which resulst in:
"Simple Data"
In Python you can access single characters via the [] operator. With this you can access the sequence of characters starting with the third one (index = 2) up to the second-to-last (index = -2, second-to-last is not included in the sequence).

You could try this regex (?s)\(\{(.*?)\}\)
which simply captures the contents between the delimiters.
Beware though, this doesn't account for nesting.
If nesting is a concern, the best you can to with standard Python re engine
is to get the inner nest only, using this regex:
\(\{((?:(?!\(\{|\}\).)*)\}\)

Hereby I designed a tokenizer aimming at nesting data. OP should check out here.
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(code):
token_specification = [
('DATA', r'[ \t]*[\w]+[\w \t]*'),
('SKIP', r'[ \t\f\v]+'),
('NEWLINE', r'\n|\r\n'),
('BOUND_L', r'\(\{'),
('BOUND_R', r'\}\)'),
('MISMATCH', r'.'),
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
lines = code.splitlines()
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group(kind)
if kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
elif kind == 'SKIP':
pass
else:
column = mo.start() - line_start
yield Token(kind, value, line_num, column)
statements = '''
({Simple Data})
({
Parent Data Prefix
({Nested Data (Yes)})
Parent Data Suffix
})
'''
queue = collections.deque()
for token in tokenize(statements):
if token.typ == 'DATA' or token.typ == 'MISMATCH':
queue.append(token.value)
elif token.typ == 'BOUND_L' or token.typ == 'BOUND_R':
print(''.join(queue))
queue.clear()
Output of this code should be:
Simple Data
Parent Data Prefix
Nested Data (Yes)
Parent Data Suffix

How can I elegantly combine/concat files by section with python?

Like many an unfortunate programmer soul before me, I am currently dealing with an archaic file format that refuses to die. I'm talking ~1970 format specification archaic. If it were solely up to me, we would throw out both the file format and any tool that ever knew how to handle it, and start from scratch. I can dream, but that unfortunately that won't resolve my issue.
The format: Pretty Loosely defined, as years of nonsensical revisions have destroyed almost all back compatibility it once had. Basically, the only constant is that there are section headings, with few rules about what comes before or after these lines. The headings are sequential (e.g. HEADING1, HEADING2, HEADING3,...), but not numbered and are not required (e.g HEADING1, HEADING3, HEADING7). Thankfully, all possible heading permutations are known. Here's a fake example:
# Bunch of comments
SHOES # First heading
# bunch text and numbers here
HATS # Second heading
# bunch of text here
SUNGLASSES # Third heading
...
My problem: I need to concatenate multiple of these files by these section headings. I have a perl script that does this quite nicely:
while(my $l=<>) {
if($l=~/^SHOES/i) { $r=\$shoes; name($r);}
elsif($l=~/^HATS/i) { $r=\$hats; name($r);}
elsif($l=~/^SUNGLASSES/i) { $r=\$sung; name($r);}
elsif($l=~/^DRESS/i || $l=~/^SKIRT/i ) { $r=\$dress; name($r);}
...
...
elsif($l=~/^END/i) { $r=\$end; name($r);}
else {
$$r .= $l;
}
print STDERR "Finished processing $ARGV\n" if eof;
}
As you can see, with the perl script I basically just change where a reference points to when I get to a certain pattern match, and concatenate each line of the file to its respective string until I get to the next pattern match. These are then printed out later as one big concated file.
I would and could stick with perl, but my needs are becoming more complex every day and I would really like to see how this problem can be solved elegantly with python (can it?). As of right now my method in python is basically to load the entire file as a string, search for the heading locations, then split up the string based on the heading indices and concat the strings. This requires a lot of regex, if-statements and variables for something that seems so simple in another language.
It seems that this really boils down to a fundamental language issue. I found a very nice SO discussion about python's "call-by-object" style as compared with that of other languages that are call-by-reference.
How do I pass a variable by reference?
Yet, I still can't think of an elegant way to do this in python. If anyone can help kick my brain in the right direction, it would be greatly appreciated.

That's not even elegant Perl.
my #headers = qw( shoes hats sunglasses dress );
my $header_pat = join "|", map quotemeta, #headers;
my $header_re = qr/$header_pat/i;
my ( $section, %sections );
while (<>) {
if (/($header_re)/) { name( $section = \$sections{$1 } ); }
elsif (/skirt/i) { name( $section = \$sections{'dress'} ); }
else { $$section .= $_; }
print STDERR "Finished processing $ARGV\n" if eof;
}
Or if you have many exceptions:
my #headers = qw( shoes hats sunglasses dress );
my %aliases = ( 'skirt' => 'dress' );
my $header_pat = join "|", map quotemeta, #headers, keys(%aliases);
my $header_re = qr/$header_pat/i;
my ( $section, %sections );
while (<>) {
if (/($header_re)/) {
name( $section = \$sections{ $aliases{$1} // $1 } );
} else {
$$section .= $_;
}
print STDERR "Finished processing $ARGV\n" if eof;
}
Using a hash saves the countless my declarations you didn't show.
You could also do $header_name = $1; name(\$sections{$header_name}); and $sections{$header_name} .= $_ for a bit more readability.

I'm not sure if I understand your whole problem, but this seems to do everything you need:
import sys
headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = [[] for header in headers]
for arg in sys.argv[1:]:
section_index = 0
with open(arg) as f:
for line in f:
if line.startswith(headers[section_index + 1]):
section_index = section_index + 1
else:
sections[section_index].append(line)
Obviously you could change this to read or mmap the whole file, then re.search or just buf.find for the next header. Something like this (untested pseudocode):
import sys
headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = defaultdict(list)
for arg in sys.argv[1:]:
with open(arg) as f:
buf = f.read()
section = None
start = 0
for header in headers[1:]:
idx = buf.find('\n'+header, start)
if idx != -1:
sections[section].append(buf[start:idx])
section = header
start = buf.find('\n', idx+1)
if start == -1:
break
else:
sections[section].append(buf[start:])
And there are plenty of other alternatives, too.
But the point is, I can't see anywhere where you'd need to pass a variable by reference in any of those solutions, so I'm not sure where you're stumbling on whichever one you've chosen.
So, what if you want to treat two different headings as the same section?
Easy: create a dict mapping headers to sections. For example, for the second version:
headers_to_sections = {None: None, 'SHOES': 'SHOES', 'HATS': 'HATS',
'DRESSES': 'DRESSES', 'SKIRTS': 'DRESSES'}
Now, in the code that doessections[section], just do sections[headers_to_sections[section]].
For the first, just make this a mapping from strings to indices instead of strings to strings, or replace sections with a dict. Or just flatten the two collections by using a collections.OrderedDict.

My deepest sympathies!
Here's some code (please excuse minor syntax errors)
def foundSectionHeader(l, secHdrs):
for s in secHdrs:
if s in l:
return True
return False
def main():
fileList = ['file1.txt', 'file2.txt', ...]
sectionHeaders = ['SHOES', 'HATS', ...]
sectionContents = dict()
for section in sectionHeaders:
sectionContents[section] = []
for file in fileList:
fp = open(file)
lines = fp.readlines()
idx = 0
while idx < len(lines):
sec = foundSectionHeader(lines[idx]):
if sec:
idx += 1
while not foundSectionHeader(lines[idx], sectionHeaders):
sectionContents[sec].append(lines[idx])
idx += 1
This assumes that you don't have content lines which look like "SHOES"/"HATS" etc.

Assuming you're reading from stdin, as in the perl script, this should do it:
import sys
import collections
headings = {'SHOES':'SHOES','HATS':'HATS','DRESS':'DRESS','SKIRT':'DRESS'} # etc...
sections = collections.defaultdict(str)
key = None
for line in sys.stdin:
sline = line.strip()
if sline not in headings:
sections[headings.get(key)].append(sline)
else:
key = sline
You'll end up with a dictionary where like this:
{
None: <all lines as a single string before any heading>
'HATS' : <all lines as a single string below HATS heading and before next heading> ],
etc...
}
The headings list does not have to be defined in the some order as the headings appear in the input.

Pyparsing: How can I parse data and then edit a specific value in a .txt file?

my data is located in a .txt file (no, I can't change it to a different format) and it looks like this:
varaiablename = value
something = thisvalue
youget = the_idea
Here is my code so far (taken from the examples in Pyparsing):
from pyparsing import Word, alphas, alphanums, Literal, restOfLine, OneOrMore, \
empty, Suppress, replaceWith
input = open("text.txt", "r")
src = input.read()
# simple grammar to match #define's
ident = Word(alphas + alphanums + "_")
macroDef = ident.setResultsName("name") + "= " + ident.setResultsName("value") + Literal("#") + restOfLine.setResultsName("desc")
for t,s,e in macroDef.scanString(src):
print t.name,"=", t.value
So how can I tell my script to edit a specific value for a specific variable?
Example:
I want to change the value of variablename, from value to new_value.
So essentially variable = (the data we want to edit).
I probably should make it clear that I don't want to go directly into the file and change the value by changing value to new_value but I want to parse the data, find the variable and then give it a new value.

Even though you have already selected another answer, let me answer your original question, which was how to do this using pyparsing.
If you are trying to make selective changes in some body of text, then transformString is a better choice than scanString (although scanString or searchString are fine for validating your grammar expression by looking for matching text). transformString will apply token suppression or parse action modifications to your input string as it scans through the text looking for matches.
# alphas + alphanums is unnecessary, since alphanums includes all alphas
ident = Word(alphanums + "_")
# I find this shorthand form of setResultsName is a little more readable
macroDef = ident("name") + "=" + ident("value")
# define values to be updated, and their new values
valuesToUpdate = {
"variablename" : "new_value"
}
# define a parse action to apply value updates, and attach to macroDef
def updateSelectedDefinitions(tokens):
if tokens.name in valuesToUpdate:
newval = valuesToUpdate[tokens.name]
return "%s = %s" % (tokens.name, newval)
else:
raise ParseException("no update defined for this definition")
macroDef.setParseAction(updateSelectedDefinitions)
# now let transformString do all the work!
print macroDef.transformString(src)
Gives:
variablename = new_value
something = thisvalue
youget = the_idea

For this task you do not need to use special utility or module
What you need is reading lines and spliting them in list, so first index is left and second index is right side.
If you need these values later you might want to store them in dictionary.
Well here is simple way, for somebody new in python. Uncomment lines whit print to use it as debug.
f=open("conf.txt","r")
txt=f.read() #all text is in txt
f.close()
fwrite=open("modified.txt","w")
splitedlines = txt.splitlines():
#print splitedlines
for line in splitedlines:
#print line
conf = line.split('=')
#conf[0] is what it is on left and conf[1] is what it is on right
#print conf
if conf[0] == "youget":
#we get this
conf[1] = "the_super_idea" #the_idea is now the_super_idea
#join conf whit '=' and write
newline = '='.join(conf)
#print newline
fwrite.write(newline+"\n")
fwrite.close()

Actually, you should have a look at the config parser module
Which parses exactly your syntax (you need only to add [section] at the beginning).
If you insist on your implementation, you can create a dictionary :
dictt = {}
for t,s,e in macroDef.scanString(src):
dictt[t.name]= t.value
dictt[variable]=new_value

ConfigParser
import ConfigParser
config = ConfigParser.RawConfigParser()
config.read('example.txt')
variablename = config.get('variablename', 'float')
It'll yell at you if you don't have a [section] header, though, but it's ok, you can fake one.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract a block of lines from given file using python - python

Related

How can I extract only the initial description of a javadoc comment and ignore the javadoc tags using python?

In Python, how to match a string to a dictionary item (like 'Bra*')

Regex Python find everything between four characters

How can I elegantly combine/concat files by section with python?

Pyparsing: How can I parse data and then edit a specific value in a .txt file?

Categories

Resources