substring extract in a file using Python Regex

substring extract in a file using Python Regex - python

A file has n number of lines in blocks of logically defined strings. I'm parsing each line and capturing the required data based on some matching conditions.
I have read through each line and finding the blocks with this code:
#python
for lines in file.readlines():
if re.match(r'block.+',lines)!= None:
block_name = re.match(r'block.+', lines).group(0)
# string matching code to be added here
Input File:
line1 select KT_TT=$TMTL/$SYSNAME.P1
line2 . $dhe/ISFUNC sprfl tm/tm1032 int 231
line3 select IT_TT=$TMTL/$SYSNAME.P2
line4 . $DHE/ISFUNC ptoic ca/ca256 tli 551
.....
.....
line89 CALLING IK02=$TMTL/$SYSNAME.P2
line90 CALLING KK01=$TMTL/$SYSNAME.P1
Matching conditions & expected output of each step:
While reading the lines, match the word "/ISFUNC" and fetch the characters from the last till it matches a "/" and save it to a variable. Expected o/p->tm1032 int 231, ca256 tli 551 (matching string found in line2 & line 4, etc)
Once ISFUNC is found, read the immediate previous line and fetch the data from that line, start form the last character till it matches a "/" and save it to a variable. Expected o/p->$SYSNAME.P1 & $SYSNAME.P2(line 1 & line 3, etc)
Continue reading the lines down and look for the line starting with "CALLING" and the last string after "/" should match with o/p of step 2($SYSNAME.P1 & $SYSNAME.P2). Just capture the data after CALLING word and save it. expected o/p -> KK01 (line 90) & IK02(line 89)
final output should be like
FUNC SYS CALL
tm1032 int 231 $SYSNAME.P1 KK01
ca256 tli 551 $SYSNAME.P2 IK02

If all you need is the text next to the last slash, you need not go for regex at all .
Simply use the .split("/") on each line and you can get the last part next to the slash
sample = "$dhe/ISFUNC sprfl tm/tm1032 int 231"
sample.split("/")
will result in
['$dhe', 'ISFUNC sprfl tm', 'tm1032 int 231']
and then just access the last element of the list using -1 indexing to get the value
PS : Use the split function once you have found the corresponding line

While reading the lines, match the word "/ISFUNC" and fetch the characters from the last till it matches a "/" and save it to a variable. Expected o/p->tm1032 int 231 (matching string found in line2)
char_list = re.findall(r'/ISFUNC.*/(.*)$', line)
if char_list:
chars = char_list[0]
Once ISFUNC is found, read the immediate previous line and fetch the data from that line, start form the last character till it matches a "/" and save it to a variable. Expected o/p->$SYSNAME.P1 (line 1)
The ideal approach here is to either (a) iterate through the list indices rather than the lines themselves (i.e. for i in range(len(file.readlines()): ... file.readlines()[i]) or (b) maintain a copy of the last line (say, put last_line = line at the end of your for loop. Then, reference that last line for this expression:
data_list = re.findall(r'/([^/]*)$', last_line)
if data_list:
data = data_list[0]
Continue reading the lines down and look for the line starting with "CALLING" and the last string after "/" should match with o/p of step 2($SYSNAME.P1). Just capture the data after CALLING word and save it. expected o/p -> KK01 (line 90)
Assuming, from your example, you mean "just the data immediately after (i.e. up until the equals sign):
calling_list = re.findall(r'CALLING(.*)=.*/' + re.escape(data) + '$', line)
if calling_list:
calling = calling_list[0]
You can move the parentheses around to change what from that line exactly you want to capture. re.findall() will output a list of matches, including only the bits inside the parentheses that were matched.

Related

Search and Replace a string in text file

I would like to search for a line in a text file which contains the string "SECTION=C-BEAM" and replace the first 13 characters in the "next line" by reading a pattern from first line (pattern highlighted in bold (see example below - read 1.558 from first line and replace it with 1.558/2 =0.779 in the second line). The number to read from first line is always in between the strings "H_" and "H_0".
Example Input:
SECTION, ELSET=DIORH_1_558H_0_76W_241_1, SECTION=C-BEAM, MAT=XYZ;
0., 1, 2, 3, 4, 5
Output as follows:
SECTION, ELSET=DIORH_1_558H_0_76W_241_1, SECTION=C-BEAM, MAT=XYZ;
0.779, 1, 2, 3, 4, 5
This is what I have tried so far.
file_in = open(test_input, 'rb')
file_out = open(test_output, 'wb')
lines = file_in.readlines()
print ("Total no. of lines to process: ", len(lines))
for i in range(len(lines)):
if lines.startswith("SECTION") and "SECTION=C-BEAM" in lines:
start_index = lines.find("H_")+1
end_index = lines.find("H_0")
x = lines[start_index:end_index]/2.0
print (x)
lines[i+1]= lines[i+1].replace(" 0.",x)+lines[i+1][13:]
file_out.write(lines[i])
file_in.close()
file_out.close()

As you have mentioned that the content resides in a file, I tried to store some other random lines in a string other than the pattern you are looking for.
Tested below piece of code and it works. I assume there is only one such occurrence in the file.If there are multiple occurrences in the file that can be done through a loop though.
import re
st = '''These are some different lines - you need not worry about.
SECTION, ELSET=DIORH_1_558H_0_76W_241_1, SECTION=C-BEAM, MAT=XYZ;
0., 1, 2, 3, 4, 5
These are more different lines - you need not worry about.
0.,2 numbers'''
num = str(float(re.findall('.*H_(.+)H_0.*SECTION=C-BEAM.*\n.*',st)[0].replace("_","."))/2)
print (re.sub(r'(.*SECTION=C-BEAM.*\n)(0\.)(,.*)',r'\g<1>'+num+r'\g<3>',st))
# re.findall('.*H_(.*)H_0.*SECTION=C-BEAM.*\n.*',st) --> Returns ['1_558']. Extract 1_558 by indexing it -[0]
# Then replace "_" with "." Convert to a float, divide by 2 and then convert the result to string again
# .* means 0 or more non-newline characters,.+ means 1 or more non-newline characters "\n" stands for new line.
# (.+) means characters inside the bracket from the overall pattern will be extracted
# Second line of the code: I replaced the desired number("0.") for the matching patternin the second line.
# Divided the pattern in to 3 groups: 1) Before the pattern "0." 2) The pattern "0." itself 3) After the pattern "0.".
# Replaced the pattern "0." with "group 1 + num + group 2"
Output as shown below:

Basic python regex should do it :
my_text = """SECTION, ELSET=DIORH_1_558H_0_76W_241_1, SECTION=C-BEAM, MAT=XYZ;\n0., 1, 2, 3, 4, 5"""
# This find the index of the first occurence of your regex in my_text
index = my_text.find('SECTION=C-BEAM')
# You select everything before the first occurence of your regex
# and count the number of lines (\n is the escape line character)
nb_line = my_text[:index].count('\n')
# Now you wand to find the index of the beginning of the n + 1 line.
# You can do this thanks to finditer function
# This creates the list of index of a specified regex,
# you select the n + 1 (here it is nb_line because python indexing starts at 0)
index = [m.start() for m in re.finditer(r"\n",my_text)][nb_line]
# the you re build the wanted string with :
# the beginning of your string until the n + 1 line,
# the text you want (0.779)
# the text after the substring you removed (you need to know the length of the string you want to remove here 2
string_to_remove = "0."
my_text = my_text[:index+1] + '0.779' + my_text[index + 1 + len(string_to_remove):]
print(my_text)

Loop the remaining elements within a loop

I have the following text:
ERROR: <C:\Includes\Library1.inc:123> This is the Error
Call Trace:
<C:\Includes\Library2.inc:456>
<C:\Includes\Library2.inc:789>
<C:\Code\Main.ext:12>
<Line:1>
ERROR: <C:\Includes\Library2.inc:2282> Another Error
Call Trace:
<C:\Code\Main.ext:34>
<C:\Code\Main.ext:56>
<C:\Code\Main.ext:78>
<Line:1>
ERROR: <C:\Code\Main.ext:90> Error Three
I would like to extract the following information:
line, Error = 12, This is the Error
line, Error = 34, Another Error
line, Error = 90, Error Three
Here is how far I got:
theText = 'ERROR: ...'
ERROR_RE = re.compile(r'^ERROR: <(?P<path>.*):(?P<line>[0-9]+)> (?P<error>.*)$')
mainName = '\Main.ext'
# Go through each line
for fullline in theText.splitlines():
match = self.ERROR_RE.match(fullline)
if match:
path, line, error = match.group('path'), match.group('line'), match.group('error')
if path.endswith(mainName):
callSomething(line, error)
# else check next line for 'Call Trace:'
# check next lines for mainName and get the linenumber
# callSomething(linenumber, error)
What is the pythonic way to loop the remaining elements within a loop?
Solution:
http://codepad.org/BcYmybin

The direct answer to your question, regarding how to loop over remaining lines, is: change the first line of the loop to
lines = theText.splitlines()
for (linenum, fullline) in enumerate(lines):
Then, after a match, you can get at the remaining lines by looking at lines[j] in an inner loop where j starts at linenum+1 and runs until the next match.
However, a slicker way to solve the problem is to first split the text into blocks. There are many ways to do this, however, being a former perl user, my impulse is to use regular expressions.
# Split into blocks that start with /^ERROR/ and run until either the next
# /^ERROR/ or until the end of the string.
#
# (?m) - lets '^' and '$' match the beginning/end of each line
# (?s) - lets '.' match newlines
# ^ERROR - triggers the beginning of the match
# .*? - grab characters in a non-greedy way, stopping when the following
# expression matches
# (?=^ERROR|$(?!\n)) - match until the next /^ERROR/ or the end of string
# $(?!\n) - match end of string. Normally '$' suffices but since we turned
# on multiline mode with '(?m)' we have to use '(?!\n)$ to prevent
# this from matching end-of-line.
blocks = re.findall('(?ms)^ERROR.*?(?=^ERROR|$(?!\n))', theText)

Replace this:
# else check next line for 'Call Trace:'
# check next lines for mainName and get the linenumber
# callSomething(linenumber, error)
With this:
match = stackframe_re.match(fullline)
if match and error: # if error is defined from earlier when you matched ERROR_RE
path, line = match.group('path'), match.group('line')
if path.endsWith(mainName):
callSomething(line, error)
error = None # don't report this error again if you see main again
Note the indentation. Also initialize error = None before the loop begins and set error = None after the first call to callSomething. In general, the code I've suggested should work for properly-formatted data, but you might want to improve it so it doesn't give misleading results if the data does not match the format you expect.
You will have to write stackframe_re, but it should be an RE that matches, for example,
<C:\Includes\Library2.inc:789>
I don't really understand what you mean when you say "loop the remaining elements within a loop". A loop continues to the remaining elements by default.

finding and replacing a string within a line with an if statement

I am trying to parse a particular text file. I am trying to open the text file and line by line ask if a particular string is there (In the following example case its the presence of the number 01 in the curly brackets), then manipulate a particular string either forwards backwards, or keep it the same. Here's that example, with one line named arbitrarily "go"... (other lines in the full file have similar format but have {01}, {00} etc...
go = 'USC_45774-1111-0 <hkxhk> {10} ; 78'
go = go.replace(go[22:24],go[23:21:-1])
>>> go
'USC_45774-1111-0 <khxkh> {10} ; 78'
I am trying to manipulate the first "hk" (go[22:24]) by replacing it with the same letters but backwards (go[23:21:-1).What I want is to see khxhk but as you can see, the result I am getting is that both are turned backwards to khxkh.
I am also having a problem of executing the specific if statement for each line. Many lines that dont have {01} are being manipulated as if they were....
with open('c:/LG 1A.txt', 'r') as rfp:
with open('C:/output5.txt', 'w') as wfp:
for line in rfp.readlines():
if "{01}" or "{-1}" in line:
line = line.replace(line[25:27],line[26:24:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
elif "{10}" or "{1-}" in line:
line = line.replace(line[22:24],line[23:21:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
elif "{11}" in line:
line = line.replace(line[22:27],line[26:21:-1])
line = line.replace("<"," ")
line = line.replace(">"," ")
line = line.replace("x"," ")
wfp.write(line)
wfp.close()
Am I missing something simple?

The string replace method does not replace characters by position, it replaces them by what characters they are.
>>> 'apple aardvark'.replace('a', '!')
'!pple !!rdv!rk'
So in your first case, you are telling to replace "hk" with "kh". It doesn't "know" that you want to only replace one of the occurrences; it just knows you want to replace "hk" with "kh", so it replaces all occurrences.
You can use the count argument to replace to specify that you only want to replace the first occurrence:
>>> go = 'USC_45774-1111-0 <hkxhk> {10} ; 78'
... go.replace(go[22:24],go[23:21:-1],1)
'USC_45774-1111-0 <khxhk> {10} ; 78'
Note, though, that this will always replace the first occurrence, not necessarily the occurrence at the position in the string you specified. In this case I guess that's what you want, but it may not work directly for other similar tasks. (That is, there is no way to use this method as-is to replace the second occurrence or the third occurrence; you can only replace the first, or the first two, or the first three, etc. To replace the second or third occurrence you'd need to do a bit more.)
As for the second part of your question, you are misunderstanding what if "{01}" or "{-1}" in line means. It means, in layman's terms, if "{01}" or if "{-1}" in line. Since if "{01}" is always true (i.e., the string "{01}" is not a false value), the whole condition is always true. What you want is if "{01}" in line or "{-1}" in line".

I don't know what it is about Python, but your problem is one that gets posted here at least a couple times every day.
if "{01}" or "{-1}" in line:
This doesn't do what you think it does. It asks, "is "{01}" true"? Because it's a non-zero-length string, it is. Because or short-circuits, the rest of the condition is not tested because the first argument is true. Therefore the body of your if statement is always executed.
In other words, Python evaluates as if you'd written this:
if ("{01}") or ("{-1}" in line):
You want something like:
if "{01}" in line or "{-1}" in line:
Or if you have a lot of similar conditions:
if any(x in line for x in ("{01}", "{-1}")):

you can use count argument of replace():
'USC_45774-1111-0 <hkxhk> {10} ; 78'.replace("hk","kh",1)
For your second question, you need change the condition to:
if "{01}" in line or "{-1}" in line:
...

delete only lines after match1 up to match2

I have checked and played with various examples and it appears that my problem is a bit more complex than what I have been able to find. What I need to do is search for a particular string and then delete the following line and keep deleting lines until another string is found. So an example would be the following:
a
b
color [
0 0 0,
1 1 1,
3 3 3,
] #color
y
z
Here, "color [" is match1, and "] #color" is match2. So then what is desired is the following:
a
b
color [
] #color
y
z

This "simple to follow" code example will get you started .. you can tweak it as needed. Note that it processes the file line-by-line, so this will work with any size file.
start_marker = 'startdel'
end_marker = 'enddel'
with open('data.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if end_marker in line:
ignoreLines = False
if not ignoreLines:
print line,
It uses startdel and enddel as "markers" for starting and ending the ignoring of data.
Update:
Modified code based on a request in the comments, this will now include/print the lines that contain the "markers".
Given this input data (borrowed from #drewk):
Beginning of the file...
stuff
startdel
delete this line
delete this line also
enddel
stuff as well
the rest of the file...
it yields:
Beginning of the file...
stuff
startdel
enddel
stuff as well
the rest of the file...

You can do this with a single regex by using nongreedy *. E.g., assuming you want to keep both the "look for this line" and the "until this line is found" lines, and discard only the lines in between, you could do:
>>> my_regex = re.compile("(look for this line)"+
... ".*?"+ # match as few chars as possible
... "(until this line is found)",
... re.DOTALL)
>>> new_str = my_regex.sub("\1\2", old_str)
A few notes:
The re.DOTALL flag tells Python that "." can match newlines -- by default it matches any character except a newline
The parentheses define "numbered match groups", which are then used later when I say "\1\2" to make sure that we don't discard the first and last line. If you did want to discard either or both of those, then just get rid of the \1 and/or the \2. E.g., to keep the first but not the last use my_regex.sub("\1", old_str); or to get rid of both use my_regex.sub("", old_str)
For further explanation, see: http://docs.python.org/library/re.html or search for "non-greedy regular expression" in your favorite search engine.

This works:
s="""Beginning of the file...
stuff
look for this line
delete this line
delete this line also
until this line is found
stuff as well
the rest of the file... """
import re
print re.sub(r'(^look for this line$).*?(^until this line is found$)',
r'\1\n\2',s,count=1,flags=re.DOTALL | re.MULTILINE)
prints:
Beginning of the file...
stuff
look for this line
until this line is found
stuff as well
the rest of the file...
You can also use list slices to do this:
mStart='look for this line'
mStop='until this line is found'
li=s.split('\n')
print '\n'.join(li[0:li.index(mStart)+1]+li[li.index(mStop):])
Same output.
I like re for this (being a Perl guy at heart...)

Conversion of Multiple Strings To ASCII

This seems fairly trivial but I can't seem to work it out
I have a text file with the contents:
B>F
I am reading this with the code below, stripping the '>' and trying to convert the strings into their corresponding ASCII value, minus 65 to give me a value that will correspond to another list index
def readRoute():
routeFile = open('route.txt', 'r')
for line in routeFile.readlines():
route = line.strip('\n' '\r')
route = line.split('>')
#startNode, endNode = route
startNode = ord(route[0])-65
endNode = ord(route[1])-65
# Debug (this comment was for my use to explain below the print values)
print 'Route Entered:'
print line
print startNode, ',', endNode, '\n'
return[startNode, endNode]
However I am having slight trouble doing the conversion nicely, because the text file only contains one line at the moment but ideally I need it to be able to support more than one line and run an amount of code for each line.
For example it could contain:
B>F
A>D
C>F
E>D
So I would want to run the same code outside this function 4 times with the different inputs
Anyone able to give me a hand
Edit:
Not sure I made my issue that clear, sorry
What I need it do it parse the text file (possibly containing one line or multiple lines like above. I am able to do it for one line with the lines
startNode = ord(route[0])-65
endNode = ord(route[1])-65
But I get errors when trying to do more than one line because the ord() is expecting different inputs
If I have (below) in the route.txt
B>F
A>D
This is the error it gives me:
line 43, in readRoute endNode = ord(route[1])-65
TypeError: ord() expected a character, but string of length 2 found
My code above should read the route.txt file and see that B>F is the first route, strip the '>' - convert the B & F to ASCII, so 66 & 70 respectively then minus 65 from both to give 1 & 5 (in this example)
The 1 & 5 are corresponding indexes for another "array" (list of lists) to do computations and other things on
Once the other code has completed it can then go to the next line in route.txt which could be A>D and perform the above again

Perhaps this will work for you. I turned the fileread into a generator so you can do as you please with the parsed results in the for-i loop.
def readRoute(file_name):
with open(file_name, 'r') as r:
for line in r:
yield (ord(line[0])-65, ord(line[2])-65)
filename = 'route.txt'
for startnode, endnode in readRoute(filename):
print startnode, endnode

If you can't change readRoute, change the contents of the file before each call. Better yet, make readRoute take the filename as a parameter (default it to 'route.txt' to preserve the current behavior) so you can have it process other files.

What about something like this? It takes the routes defined in your file and turns them into path objects with start and end member variables. As an added bonus PathManager.readFile() allows you to load multiple route files without overwriting the existing paths.
import re
class Path:
def __init__(self, start, end):
self.start = ord(start) - 65 # Scale the values as desired
self.end = ord(end) - 65 # Scale the values as desired
class PathManager:
def __init__(self):
self.expr = re.compile("^([A-Za-z])[>]([A-Za-z])$") # looks for string "C>C"
# where C is a char
self.paths = []
def do_logic_routine(self, start, end):
# Do custom logic here that will execute before the next line is read
# Return True for 'continue reading' or False to stop parsing file
return True
def readFile(self, path):
file = open(path,"r")
for line in file:
item = self.expr.match(line.strip()) # strip whitespaces before parsing
if item:
'''
item.group(0) is *not* used here; it matches the whole expression
item.group(1) matches the first parenthesis in the regular expression
item.group(2) matches the second
'''
self.paths.append(Path(item.group(1), item.group(2)))
if not do_logic_routine(self.paths[-1].start, self.paths[-1].end):
break
# Running the example
MyManager = PathManager()
MyManager.readFile('route.txt')
for path in MyManager.paths:
print "Start: %s End: %s" % (path.start, path.end)
Output is:
Start: 1 End: 5
Start: 0 End: 3
Start: 2 End: 5
Start: 4 End: 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

substring extract in a file using Python Regex - python

Related

Search and Replace a string in text file

Loop the remaining elements within a loop

finding and replacing a string within a line with an if statement

delete only lines after match1 up to match2

Conversion of Multiple Strings To ASCII

Categories

Resources