Match lines between two patterns in Python with regular expressions [duplicate] - python

This question already has answers here:
Python: consecutive lines between matches similar to awk
(3 answers)
Closed 4 years ago.
I am parsing log files that include lines regarding events by many jobs, identified by a job id. I am trying to get all lines in a log file between two patterns in Python.
I have read this very useful post How to select lines between two patterns? and had already solved the problem with awk like so:
awk '/pattern1/,/pattern2/' file
Since I am processing the log information in a Python script, I am using subprocess.Popen() to execute that awk command. My program works, but I would like to solve this using Python alone.
I know of the re module, but don't quite understand how to use it. The log files have already been compressed to bz2, so this is my code to open the .bz2 files and find the lines between the two patterns:
import bz2
import re
logfile = '/some/log/file.bz2'
PATTERN = r"/{0}/,/{1}/".format('pattern1', 'pattern2')
# example: PATTERN = r"/0001.server;Considering job to run/,/0040;pbs_sched;Job;0001.server/"
re.compile(PATTERN)
with bz2.BZ2File(logfile) as fh:
match = re.findall(PATTERN, fh.read())
However, match is empty (fh.read() is not!). Using re.findall(PATTERN, fh.read(), re.MULTILINE) has no effect.
Using re.DEBUG after re.compile() shows many lines with
literal 47
literal 50
literal 48
literal 49
literal 57
and two say
any None
I could solve the problem with loops like here python print between two patterns, including lines containing patterns but I avoid nested for-if loops as much as I can. I belive the re module can yield the result I want but I am no expert in how to use it.
I am using Python 2.7.9.

It's usually a bad idea to read a whole log file into memory, so I'll give you a line-by-line solution. I'll assume that the dots you have in your example are the only varying part of the pattern. I'll also assume that you want to collect line groups in a list of lists.
import bz2
import re
with_delimiting_lines = True
logfile = '/some/log/file.bz2'
group_start_regex = re.compile(r'/0001.server;Considering job to run/')
group_stop_regex = re.compile(r'/0040;pbs_sched;Job;0001.server/')
group_list = []
with bz2.BZ2File(logfile) if logfile.endswith('.bz2') else open(logfile) as fh:
inside_group = False
for line_with_nl in fh:
line = line_with_nl.rstrip()
if inside_group:
if group_stop_regex.match(line):
inside_group = False
if with_delimiting_lines:
group.append(line)
group_list.append(group)
else:
group.append(line)
elif group_start_regex.match(line):
inside_group = True
group = []
if with_delimiting_lines:
group.append(line)
Please note that match() matches from the beginning of the line (as if the pattern started with ^, when re.MULTILINE mode is off)

/pattern1/,/pattern2/ isn't a regex, it's a construct specific to awk which is composed of two regexs.
With pure regex you could use pattern1.*?pattern2 with the DOTALL flag (which makes . match newlines when it usually wouldn't) :
re.findall("pattern1.*?pattern2", input, re.DOTALL)
It differs from the awk command which will match the full lines containing the start and end pattern ; this could be achieved as follows :
re.findall("[^\n]*pattern1.*?pattern2[^\n]*", input, re.DOTALL)
Try it here !
Note that I answered your question as it was asked for the sake of pedagogy, but Walter Tross' solution should be preferred.

Related

How do you use the sub() method with a regular expression that has an optional group? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I've been working on this file rename program for a few days now. I've learned a lot thanks to all of the "silly" questions those before me have asked on this site and the quality answers they have received. Well, on to my problem.
My filenames are in the following format:
ACP001.jpg, ACP002.jpg,... ACP010.jpg, ACP011.jpg, ACP012_x.jpg, ACP013.jpg, ACP014_x.jpg
pattern = r'(ACP0)(0*)(\d+)(\.jpg)'
replace = r'\3\4'
So that was working fine for most of them... but then there were some that had the "_x" just before the file extension. I ammended the pattern and replacement pattern as follows
pattern = r'(ACP0)(0*)(\d+)(_w)*(\.jpg)'
replace = r'\3.jpg'
I think I cheated by hardcoding the ".jpg" in the replace string. How would I handle these situations where the match object groups may be of varying sizes? I essentially want the last group and the third group in this example.
Make the _x term optional:
pattern = r'(ACP0)(0*)(\d+)(_x)?(\.jpg)'
I don't actually know why you have so many capture groups in your pattern. I would have written it this way:
pattern = r'ACP(\d{3})(_x)?\.jpg'
You can use . to match any character except newline. Considering OP wants to rename all files to numbers only (ACP001.jpg -> 1.jpg), you can use following pattern and replace strings for that-
li=['ACP001.txt', 'ACP012.txt', 'ACP013_x.jpg'] # list of filenames
import re # built-in package for regular expressions
pattern = r'(ACP)(0*)(\d+)(.*)(\.\w+)'
replace = r'\3\5'
res = [re.sub(pattern, replace, st) for st in li]
print(res)
OUTPUT
['1.txt', '12.txt', '13.jpg']
This code works on all file extensions and removes the problem of multiple groups altogether.

Find information in text file given setpoints are known

So I know the setpoints <start point> and <end point> in the text file and I need to use these to find certain information between them which will be used and printed. I currently have .readlines() within a different function which is used within the new function to find the information.
You can try something like this:
flag = False
info = [] # your desired information will be appended as a string in list
with open(your_file, 'r') as file:
for line in file.readlines():
if '<start point>' in line: # Pointer reached the start point
flag = True
if '<end point>' in line: # Pointer reached the end point
flag = False
if flag: # this line is between the start point and endpoint
info.append(line)
>> info
['Number=12', 'Word=Hello']
This seems like a job for regular expressions. If you have not yet encountered regular expressions, they are an extremely powerful tool that can basically be used to search for a specific pattern in a text string.
For example the regular expression (or regex for short) Number=\d+ would find any line in the text document that has Number= followed by any number of number characters. The regex Word=\w+ would match any string starting with Word= and then followed by any number of letters.
In python you can use regular expression through the re module. For a great introduction to using regular expressions in python check out this chapter from the book Automate the Boring Stuff with Python. To test out regular expressions this site is great.
In this particular instance you would do something like:
import re
your_file = "test.txt"
with open(your_file,'r') as file:
file_contents = file.read()
number_regex = re.compile(r'Number=\d+')
number_matches = re.findall(number_regex, file_contents)
print(number_matches)
>>> ['Number=12']
This would return a list with all matches to the number regex. You could then do the same thing for the word match.

Reading regexes from file, in Python

I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

Python Script using Fileinput Module and with Regex substitution containing \n (multiple lines)

All,
I am relatively new to Python but have used other scripting languages with REGEX extensively. I need a script that will open a file, look for a REGEX pattern, replace the pattern and close the file. I have found that the below script works great, however, I dont know if the "for line in fileinput.input" command can accomodate for a regex pattern that exceeds a single line (i.e. the regex includes a carriage return). In my instance, it covers 2 lines. My test file read_it.txt looks like this
read_it.txt (contains just 3 lines)
ABA
CDC
EFE
The script is designed to open the file, recognize the pattern ABA\nCDC that is seen over 2 lines, then replace it with the word TEST.
If the pattern replace is successful, then the file should read as follows and contain now only 2 lines:
TEST
EFE
Knowing the answer to this will help greatly in using Python scripts to parse text files and modify them on the fly. I believe, but am not sure, that there may be a better Python construct that still allows for REGEX searches. So the question is:
1) Do I need to change something in the existing script that would change the behavior of the "for line" command to match a multi-line REGEX pattern?
2) Or do I need a different Python script that is better suited to a multi-line search?
Some things that may help but I currently dont know how to write them are:
1) fileinput "readline" option.
2) adding (?m) in the expression for multline
Please help!
Brent
SCRIPT
import sys
import fileinput
import re
for line in fileinput.input('C:\\Python34\\read_it.txt', inplace=1):
line = re.sub(r'A(B)A$\nCDC', r'TEST', line.rstrip())
print(line)
2) adding (?m) in the expression for multline
You can do this by adding re.M or flags=re.MULTILINE as an argument in re.sub
Example:-
re.sub(r'A(B)A$\nCDC', r'TEST', line.rstrip(), re.M)
or
re.sub(r'A(B)A$\nCDC', r'TEST', line.rstrip(), flags=re.MULTILINE)

Matching regex to set

I am looking for a way to match the beginning of a line to a regex and for the line to be returned afterwards. The set is quite extensive hence why I cannot simply use the method given on Python regular expressions matching within set. I was also wondering if regex is the best solution. I have read the http://docs.python.org/3.3/library/re.html alas, it does not seem to hold the answer. Here is what I have tried so far...
import re
import os
import itertools
f2 = open(file_path)
unilist = []
bases=['A','G','C','N','U']
patterns= set(''.join(per) for per in itertools.product(bases, repeat=5))
#stuff
if re.match(r'.*?(?:patterns)', line):
print(line)
unilist.append(next(f2).strip())
print (unilist)
You see, the problem is that I do not know how to refer to my set...
The file I am trying to match it to looks like:
#SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=50 TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT
+
hhhhhhhhhhghhghhhhhfhhhhhfffffeee[X]b[d[ed`[Y[^Y
You are going about it the wrong way.
You simply leave the set of characters to the regular expression:
re.search('[AGCNU]{5}', line)
matches any 5-character pattern built from those 5 characters; that matches the same 3125 different combinations you generated with your set line, but doesn't need to build all possible combinations up front.
Otherwise, your regular expression attempt had no correlation to your patterns variable, the pattern r'.*?(?:patterns)' would match 0 or more arbitrary characters, followed by the literal text 'patterns'.
According to what I've understood from your question, it seems to me that this could fit your need:
import re
sss = '''dfgsdfAUGNA321354354
!=**$=)"nNNUUG54788
=AkjhhUUNGffdffAAGjhff1245GGAUjkjdUU
.....cv GAUNAANNUGGA'''
print re.findall('^(.+?[AGCNU]{5})',sss,re.MULTILINE)

Categories