Regex - matching all text between two strings - python

I'm currently parsing a log file that has the following structure:
1) timestamp, preceded by # character and followed by \n
2) arbitrary # of events that happened after that timestamp and all followed by \n
3) repeat..
Here is an exmaple:
#100
04!
03!
02!
#1299
0L
0K
0J
0E
#1335
06!
0X#
0[#
b1010 Z$
b1x [$
...
Please forgive the seemingly cryptic values, they are encodings representing certain "events".
Note: Event encodings may also use the # character.
What I am trying to do is to count the number of events that happen at a certain time.
In other words, at time 100, 3 events happened.
I am trying to match all text between two timestamps - and count the number of events by simply counting the number of newlines enclosed in the matched text.
I'm using Python's regex engine, and I'm using the following expression:
pattern = re.compile('(#[0-9]{2,}.*)(?!#[0-9]+)')
Note: The {2,} is because I want timestamps with at least two digits.
I match a timestamp, continue matching any other characters until hitting another timestamp - ending the matching.
What this returns is:
#100
#1299
#1335
So, I get the timestamps - but none of the events data - what I really care about!
I'm thinking the reason for this is that the negative-lookbehind is "greedy" - but I'm not completely sure.
There may be an entirely different regex that makes this much simpler - open to any suggestions!
Any help is much appreciated!
-k

I think a regex is not a good tool for the job here. You can just use a loop..
>>> import collections
>>> d = collections.defaultdict(list)
>>> with open('/tmp/spam.txt') as f:
... t = 'initial'
... for line in f:
... if line.startswith('#'):
... t = line.strip()
... else:
... d[t].append(line.strip())
...
>>> for k,v in d.iteritems():
... print k, len(v)
...
#1299 4
#100 3
#1335 6

If you insist on a regex-based solution, I propose this:
>>> pat = re.compile(r'(^#[0-9]{2,})\s*\n((?:[^#].*\n)*)', re.MULTILINE)
>>> for t, e in pat.findall(s):
... print t, e.count('\n')
...
#100 3
#1299 4
#1335 6
Explanation:
(
^ anchor to start of line in multiline mode
#[0-9]{2,} line starting with # followed by numbers
)
\s* skip whitespace just in case (eg. Windows line separator)
\n new line
(
(?: repeat non-capturing group inside capturing group to capture
all repetitions
[^#].*\n line not starting with #
)*
)
You seemed to have misunderstood what negative lookahead does. When it follows .*, the regex engine first tries to consume as many characters as possible and only then checks the lookahead pattern. If the lookahead does not match, it will backtrack character by character until it does.
You could, however, use positive lookahead together with the non-greedy .*?. Here the .*? will consume characters until the lookahead sees either a # at start of a line, or the end of the whole string:
re.compile(r'(^#[0-9]{2,})\s*\n(.*?)(?=^#|\Z)', re.DOTALL | re.MULTILINE)

The reason is that the dot doesn't match newlines, so your expression will only match the lines containing the timestamp; the match won't go across multiple lines. You could pass the "dotall" flag to re.compile so that your expression will match across multiple lines. Since you say the "event encodings" might also contain a # character, you might also want to use the multiline flag and anchor your match with ^ at the beginning so it only matches the # at the beginning of a line.

You could just loop through the data line by line and have a dictionary that just stores the number of events associated with each timestamp; no regex required. For example:
with open('exampleData') as example:
eventCountsDict = {}
currEvent = None
for line in example:
if line[0] == '#': # replace this line with more specific timestamp details if event encodings can start with a '#'
eventCountsDict[line] = 0
currEvent = line
else:
eventCountsDict[currEvent] += 1
print eventCountsDict
That code prints {'#1299\n': 4, '#1335\n': 5, '#100\n': 3} for your example data (not counting the ...).

Related

Regex extract group inside optional group

I have strings of the form "identfier STEP=10" where the "STEP=10" part is optional. The goal is to detect both lines with or without the STEP part and to extract the numerical value of STEP in cases where it is part of the line. Now matching both cases is easy enough,
import re
pattern = ".*(STEP=[0-9]+)?"
re.match(pattern, "identifier STEP=10")
re.match(pattern, "identifier")
This detects both cases without problem. But I fail to extract the numerical value in one go. I tried the following,
import re
pattern = ".*(STEP=([0-9]+))?"
group0 = re.search(pattern, "identifier STEP=10").groups()
group1 = re.search(pattern, "identifier").groups()
And while it still does detect the lines, i only get
group0 = (None, None)
group1 = (None, None)
While i hoped to get something like
group0 = (None, "10")
group1 = (None, None)
Is regex not suited to do this in one go or am I simply using it wrong ? I am curious if there is a single regex call that returns what I want without doing a second pass after I have matched the line.
A possible solution will look like
import re
pattern = "^.*?(?:STEP=([0-9]+))?$"
group0 = re.search(pattern, "identifier STEP=10").groups()
group1 = re.search(pattern, "identifier").groups()
print(*group0)
print(*group1)
See the Python demo.
The ^.*?(?:STEP=([0-9]+))?$ regex matches
^ - start of string
.*? - zero or more chars other than line break chars as few as possible (i.e. the regex engine skips this pattern first and tries the subsequent patterns, and only comes back to use this when the subsequent patterns fail to match)
(?:STEP=([0-9]+))? - an optional non-capturing group: STEP= and then Group 1 capturing one or more ASCII digits
$ - end of string.
The .*(STEP=[0-9]+)? regex matches like this:
.* - grabs the whole line, from start to end
(STEP=[0-9]+)? - the group is quantified with * (meaning zero or more occurrences of the quantified pattern), so the regex engine, with its index being at the end of the line now, finds a match: an empty string at the string end, and the match is returned, with Group 1 text value as empty.
To be able to resolve such issues you must understand backtracking in regex (for example, see this YT video of mine to learn more about it).

How to match numeric characters with no white space following

I need to match lines in text document where the line starts with numbers and the numbers are followed by nothing.... I want to include numbers that have '.' and ',' separating them.
Currently, I have:
p = re.compile('\$?\s?[0-9]+')
for i, line in enumerate(letter):
m = p.match(line)
if s !=None:
print(m)
print(line)
Which gives me this:
"15,704" and "416" -> this is good, I want this
but also this:
"$40 million...." -> I do not want to match this line or any line where the numbers are followed by words.
I've tried:
p = re.compile('\$?\s?[0-9]+[ \t\n\r\f\v]')
But it doesn't work. One reason is that it turns out there is no white space after the numbers I'm trying to match.
Appreciate any tips or tricks.
If you want to match the whole string with a regex,
you have 2 choices:
Either call re.fullmatch(pattern, string) (note full in the function name).
It tries to match just the whole string.
Or put $ anchor at the end of your regex and call re.match(pattern, string).
It tries to find a match from the start of the string.
Actually, you could also add ^ at the start of regex and call re.search(pattern,
string), but it would be a very strange combination.
I have also a remark concerning how you specified your conditions, maybe in incomplete
way: You put e.g. $40 million string and stated that the only reason to reject
it is space and letters after $40.
So actually you should have written that you want to match a string:
Possibly starting with $.
After the $ there can be a space (maybe, I'm not sure).
Then there can be a sequence of digits, dots or commas.
And nothing more.
And one more remark concerning Python literals: Apparently you have forgotten to prepend the pattern with r.
If you use r-string literal, you do not have to double backslashes inside.
So I think the most natural solution is to call a function devoted just to
match the whole string (i.e. fullmatch), without adding start / end
anchors and the whole script can be:
import re
pat = re.compile(r'(?:\$\s?)?[\d,.]+')
lines = ["416", "15,704", "$40 million"]
for line in lines:
if pat.fullmatch(line):
print(line)
Details concerning the regex:
(?: - A non-capturing group.
\$ - Consisting of a $ char.
\s? - And optional space.
)? - End of the non-capturing group and ? stating that the whole
group group is optional.
[\d,.]+ - A sequence of digits, commas and dots (note that between [
and ] the dot represents itself, so no backslash quotation is needed.
If you would like to reject strings like 2...5 or 3.,44 (no consecutive
dots or commas allowed), change the last part of the above regex to:
[\d]+(?:[,.]?[\d]+)*
Details:
[\d]+ - A sequence of digits.
(?: - A non-capturing group.
[,.] - Either a comma or a dot (single).
[\d]+ - Another sequence of digits.
)* - End of the non-capturing group, it may occur several times.
With a little modification to your code:
letter = ["15,704", "$40 million"]
p = re.compile('^\d{1,3}([\.,]\d{3})*$') # Numbers separated by commas or points
for i, line in enumerate(letter):
m = p.match(line)
if m:
print(line)
Output:
15,704
You could use the following regex:
import re
pattern = re.compile('^[0-9,.]+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^[0-9,.]+\s*$ matches everything that is a digit a , or ., followed by zero or more spaces. If you want to match only numbers with one , or . use the following pattern: '^\d+[,.]?\d+\s*$', code:
import re
pattern = re.compile('^\d+[,.]?\d+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^\d+[,.]?\d+\s*$ matches everything that starts with a group of digits (\d+) followed by an optional , or . ([,.]?) followed by a group of digits, with an optional group of spaces \s*.

Python regex extract string between two braces, including new lines [duplicate]

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).
I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later).
I've tried with a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text.
I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).
The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)
If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
return title, aminoacid_sequence
find:
^>([^\n\r]+)[\n\r]([A-Z\n\r]+)
\1 = some_varying_text
\2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %s\nSequence:%s' % (m[0], m[1])
It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.
My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids ) to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.

Referencing previous group possible within the same regex?

I am trying to perform a regex in Python. I want to match on a file path that does not have a domain extension and additionally, I only want to get those file paths that have 20 characters max after the last '\' of the file path. For example, given the data:
c:\users\docs\cmd.exe
c:\users\docs\files\ewyrkfdisadfasdfaffsfdasfsafsdf
c:\users\docs\files\target
I would want to match on 'target', and not the other two lines. It should be noted that in my current situation, using the re module or python operations is not an option, as this regex is fed into the program (which uses re.match() ), so I have do to this within a regex string.
I have two regexes:
^([^.]+)$ will match the the last 2 lines
([^\\]{,20}$) will match 'cmd.exe' and 'target'
How can I combine these two into one regex? I tried backreferencing (?P=, etc), but couldn't get it to work. Is this even possible?
How about \\([^\\.]{1,20})(?:$|\n)? It seems to work for me.
\\ is escaped literal backslash.
( start of capture group.
[^\\.] match anything except literal backslash or literal dot character
{1,20} match class 1-20 times, as many times as possible (greedy).
) end the capture group.
(?: starts a non-capturing group
$ match the end of the string.
| is the 'or' operator for this group
\n matches a line-feed or newline character (ASCII 10)
) end of non-capturing group
To create this, I used https://regex101.com/#python which is a very good resource in my opinion because it explains every part of the regex and neatly shows the captured groups in real time.
>>> s = r"""c:\users\docs\cmd.exe
... c:\users\docs\files\ewyrkfdisadfasdfaffsfdasfsafsdf
... c:\users\docs\files\target""".split('\n')
>>> [re.match(r'.*\\([^.]{,20})$', x) for x in s]
[None, None, <_sre.SRE_Match object at 0x7f6ad9631558>]
also
>>> [re.findall(r'.*\\([^.]{,20})$', x) for x in s]
[[], [], ['target']]
This means:
.*\\ - grab everything up to and including the last \
([^.]{,20}) - make sure there are no . in the remaining upto 20 characters
$ - end of line
The () around the middle group indicate that it should be the group returned as the match

Slicing by start and stop string values in Python

I have a string in which there are certain values that I need to extract from it. For example: "FEFEWFSTARTFFFPENDDCDC". How could I make an expression that would take a slice from "START" all the way to "END"?
I tried doing this previously by creating functions which used a for loop and string.find("START") to locate the beginning and ends, but this didn't appear to work effectively and seemed overly complex. Is there an easier way to do this without using complex loops?
EDIT:
Forgot this part. What if there were different end values? In other words, instead of just ending with "END", the values "DONE" and "NOMORE" would also end it? And in addition to that, there were multiple starts and ends throughout the string. For example: "STARTFFEFFDONEFEWFSTARTFEFFENDDDW".
EDIT2: Sample run: Start value: ATG. End values: TAG,TAA,TGA
"Enter a string": TTATGTTTTAAGGATGGGGCGTTAGTT
TTT
GGGCGT
And
"Enter a string": TGTGTGTATAT
"No string found"
That's a perfect fit for a regular expression:
>>> import re
>>> s = "FEFEWFSTARTFFFPENDDCDCSTARTDOINVOIJHSDFDONEDFOIER"
>>> re.findall("START.*?(?:END|DONE|NOMORE)", s)
['STARTFFFPEND', 'STARTDOINVOIJHSDFDONE']
.* matches any number of characters (except newlines), the additional ? makes the quantifier lazy, telling it to match as few characters as possible. Otherwise, there would be only one match, namely STARTFFFPENDDCDCSTARTDOINVOIJHSDFDONE.
As #BurhanKhalid noted, if you add a capturing group, only the substring matched by that part of the regex will be captured:
>>> re.findall("START(.*?)(?:END|DONE|NOMORE)", s)
['FFFP', 'DOINVOIJHSDF']
Explanation:
START # Match "START"
( # Match and capture in group number 1:
.*? # Any character, any number of times, as few as possible
) # End of capturing group 1
(?: # Start a non-capturing group that matches...
END # "END"
| # or
DONE # "DONE"
| # or
NOMORE # "NOMORE"
) # End of non-capturing group
And if your real goal is to match gene sequences, you need to make sure that you always match triplets:
re.findall("ATG(?:.{3})*?(?:TA[AG]|TGA)", s)
a="FEFEWFSTARTFFFPENDDCDC"
a[a.find('START'):]
'STARTFFFPENDDCDC'
The simple way (no loop, no regex):
s = "FEFEWFSTARTFFFPENDDCDC"
tmp = s[s.find("START") + len("START"):]
result = tmp[:tmp.find("END")]
yourString = 'FEFEWFSTARTFFFPENDDCDC'
substring = yourString[yourString.find("START") + len("START") : yourString.find("END")]
Not that efficient but does work.
>>> s = "FEFEWFSTARTFFFPENDDCDC"
>>> s[s.index('START'):s.index('END')+len('END')]
'STARTFFFPEND'

Categories