Problems matching string between quotes but not if lines starts with % - python

I'm trying to re(encode) a decoded LC_TIME (locale) file. I want to match all strings that are between quotes, but not if they are part of a comment, comment lines start with a %.
r'"([^"]*)"' works fine to match all strings between quotes, but it doesn't check if it's part of a comment.
To clarify:
abday "Sun";"Mon";/
"Tue";"Wed";/
"Thu";"Fri";/
"Sat"
should result in seven matches
d_fmt "%m/%d/%Y"
should result in one match
% Appropriate time representation (%X)
% "%r"
should not result in a match
Note:
re.findall('^(?!%).*?"([^"]*)"', text, flags=re.M) almost does the trick, but it only matches ['Sun', 'Tue', 'Thu', 'Sat'] in the abday example.
See this link for tests on multiple cases.

Use negative lookahead assertion:
>>> text = '''
a "quoted" element
% "Comment"
"something else"
'''
>>> re.findall('^(?!%).*?"([^"]*)"', text, flags=re.M)
['quoted', 'something else']
>>> re.findall('^(?!%)[^"]*"([^"]*)"', text, flags=re.M)
['quoted', 'something else']

Your hang up will be matching a line starting in a " cause that prevents you from being able to just do: "^\s*[^%] ... "
You can solve that with this regex:
(?:^\s*[^%].*?\"|^\s*\")([^\"]*)\"
However that will only capture the first string in quotes. You can just add another capture on to this to grab the second string in quotes:
(?:^\s*[^%].*?\"|^\s*\")([^\"]*)\"(?:[^\"]*\"([^\"]*)\")?
But if you need to support capturing an indeterminate number of strings in quotes you'll need to use regex or something similar with a regex like this:
(?:^\s*[^%].*?\"|^\s*\")([^\"]*)\"(?:[^\"]*\"([^\"]*)\")*

Probably it could be simplier and easy-to-read to apply something like this:
import re
with open('sample.txt') as file:
text = ''.join((line for line in file if line[0] != '%'))
print re.findall('"(.*?)"', text, flags = re.M | re.S)
UPDATE #1:
with open('sample.txt') as file:
text = ''.join(line.lstrip('+').replace('/\n', '') for line in file)
for line in text.splitlines():
if line and line[0] != '%':
for item in re.findall('"(.*?)"', line):
print item

A rather simple regex can capture both the comments and the quoted strings, and then you just filter out the comments:
[quoted for comment, quoted in re.findall(r'(^%.*)|"([^"]*)"',t, re.M) if not comment]
If it is OK to ignore empty quoted strings, a slightly simpler version will do:
[quoted for quoted in re.findall(r'^%.*|"([^"]*)"',t, re.M) if not quoted]

Couldn't get it to work with a regex. Tried a different approach. Marking so called "unsafe spans" (comment lines that contain double quotes (that should not be (en|de)coded)). See this changeset for the implementation.

Related

Find something that does not match a pattern at the beginning of line

I am using regex in Python to find something at the beginning of line that does not match pattern "SCENE" and before colon. The text looks like this
SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n.
I need to find AQW, CER, DYU, EOI in this case.
I have tried
findall(r"^(?!SCENE)[^:]*, text, re.M)
I get AQW and EOI, but I get ddd\nDYU instead of DYU, ddd\nd\nEOI instead of EOI.
How could I get exactly AQW,CER,DYU,EOI?
You can try this to brteak your string to substrings and try to find there:
import re
line = "SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n."
lines = re.split("\\n([A-Z])", line)
lines = [a+b for a,b in zip(lines[1::2], lines[2::2])]
for line in lines:
if re.match(r"^(?!SCENE)[^:]*", line):
print(line.split(":")[0])
result is:
AQW
CER
DYU
EOI
This answer is not the best in terms of performance I assume
This could probably be simplified more, and I'm assuming that \n in your example string is a literal newline character.
This should match all of your use cases. It starts by looking for any number of characters that aren't SCENE preceding a : then it finds any characters after the colon that don't follow a newline and precede a : then the last . there is probably a way around, but the final character wasn't being properly matched because it was directly followed by the negative lookahead.
findall( r"([A-Z]+(?<!SCENE):(?:[\s\S](?!\n[A-Z]+:))+.)", text )
https://regex101.com/r/NwdUcR/2
EDIT: I realize the above may not be exactly what you're looking for. If you're looking to match just the letters before the colon you can use this:
findall( r"([A-Z]+(?<!SCENE)):", text )
I use
findall (r"\n(?!SCENE)(.+?):")
which works. The point is I did not realize that I can use parenthesis to select what I would like to display in the result.
You don't really need regex in this case.
Here is a solution using plain and simple str.split().
s = 'SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n.'
matches = [m.split('\n')[-1] for m in s.split(':') if 'SCENE' not in m]
>>> matches
['AQW', 'CER', 'DYU', 'EOI', '.']
If you want to exclude the last '.', you can use matches = [m.split('\n')[-1] for m in s.split(':') if (('SCENE' not in m) and (m[-1] != '.'))]
or simply matches = matches[:-1]

Python regex extract string between two braces, including new lines [duplicate]

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).
I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later).
I've tried with a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text.
I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).
The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)
If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
return title, aminoacid_sequence
find:
^>([^\n\r]+)[\n\r]([A-Z\n\r]+)
\1 = some_varying_text
\2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %s\nSequence:%s' % (m[0], m[1])
It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.
My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids ) to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.

Find and extract two substrings from string

I have some strings (in fact they are lines read from a file). The lines are just copied to some other file, but some of them are "special" and need a different treatment.
These lines have the following syntax:
someText[SUBSTRING1=SUBSTRING2]someMoreText
So, what I want is: When I have a line on which this "mask" can be applied, I want to store SUBSTRING1 and SUBSTRING2 into variables. The braces and the = shall be stripped.
I guess this consists of several tasks:
Decide if a line contains this mask
If yes, get the positions of the substrings
Extract the substrings
I'm sure this is a easy task with regex, however, I'm not used to it. I can write a huge monster function using string manipulation, but I guess this is not the "Python Way" to do this.
Any suggestions on this?
re.search() returns None if it doesn't find a match. \w matches an alphanumeric, + means 1 or more. Parenthesis indicate the capturing groups.
s = """
bla bla
someText[SUBSTRING1=SUBSTRING2]someMoreText"""
results = {}
for line_num, line in enumerate(s.split('\n')):
m = re.search(r'\[(\w+)=(\w+)\]', line)
if m:
results.update({line_num: {'first': m.group(0), 'second': m.group(1)}})
print(results)
^[^\[\]]*\[([^\]\[=]*)=([^\]\[=]*)\][^\]\[]*$
You can try this.Group 1and Group 2 has the two string you want.See demo.
https://regex101.com/r/pT4tM5/26
import re
p = re.compile(r'^[^\[\]]*\[([^\]\[=]*)=([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
test_str = "someText[SUBSTRING1=SUBSTRING2]someMoreText\nsomeText[SUBSTRING1=SUBSTRING2someMoreText\nsomeText[SUBSTRING1=SUBSTRING2]someMoreText"
re.findall(p, test_str)

Removing lines from a text file using python and regular expressions

I have some text files, and I want to remove all lines that begin with the asterisk (“*”).
Made-up example:
words
*remove me
words
words
*remove me
My current code fails. It follows below:
import re
program = open(program_path, "r")
program_contents = program.readlines()
program.close()
new_contents = []
pattern = r"[^*.]"
for line in program_contents:
match = re.findall(pattern, line, re.DOTALL)
if match.group(0):
new_contents.append(re.sub(pattern, "", line, re.DOTALL))
else:
new_contents.append(line)
print new_contents
This produces ['', '', '', '', '', '', '', '', '', '', '*', ''], which is no goo.
I’m very much a python novice, but I’m eager to learn. And I’ll eventually bundle this into a function (right now I’m just trying to figure it out in an ipython notebook).
Thanks for the help!
Your regular expression seems to be incorrect:
[^*.]
Means match any character that isn't a ^, * or .. When inside a bracket expression, everything after the first ^ is treated as a literal character. This means in the expression you have . is matching the . character, not a wildcard.
This is why you get "*" for lines starting with *, you're replacing every character but *! You would also keep any . present in the original string. Since the other lines do not contain * and ., all of their characters will be replaced.
If you want to match lines beginning with *:
^\*.*
What might be easier is something like this:
pat = re.compile("^[^*]")
for line in contents:
if re.search(pat, line):
new_contents.append(line)
This code just keeps any line that does not start with *.
In the pattern ^[^*], the first ^ matches the start of the string. The expression [^*] matches any character but *. So together this pattern matches any starting character of a string that isn't *.
It is a good trick to really think about when using regular expressions. Do you simply need to assert something about a string, do you need to change or remove characters in a string, do you need to match substrings?
In terms of python, you need to think about what each function is giving you and what you need to do with it. Sometimes, as in my example, you only need to know that a match was found. Sometimes you might need to do something with the match.
Sometimes re.sub isn't the fastest or the best approach. Why bother going through each line and replacing all of the characters, when you can just skip that line in total? There's no sense in making an empty string when you're filtering.
Most importantly: Do I really need a regex? (Here you don't!)
You don't really need a regular expression here. Since you know the size and position of your delimiter you can simply check like this:
if line[0] != "*":
This will be faster than a regex. They're very powerful tools and can be neat puzzles to figure out, but for delimiters with fixed width and position, you don't really need them. A regex is much more expensive than an approach making use of this information.
You don't want to use a [^...] negative character class; you are matching all characters except for the * or . characters now.
* is a meta character, you want to escape that to \*. The . 'match any character' syntax needs a multiplier to match more than one. Don't use re.DOTALL here; you are operating on a line-by-line basis but don't want to erase the newline.
There is no need to test first; if there is nothing to replace the original line is returned.
pattern = r"^\*.*"
for line in program_contents:
new_contents.append(re.sub(pattern, "", line))
Demo:
>>> import re
>>> program_contents = '''\
... words
... *remove me
... words
... words
... *remove me
... '''.splitlines(True)
>>> new_contents = []
>>> pattern = r"^\*.*"
>>> for line in program_contents:
... new_contents.append(re.sub(pattern, "", line))
...
>>> new_contents
['words\n', '\n', 'words\n', 'words\n', '\n']
You can do:
print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))
Example:
txt='''\
words
*remove me
words
words
*remove me '''
import StringIO
f=StringIO.StringIO(txt)
import re
print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))

REGEX (python) match or return a string after '?', but in a new line, til the end of that line

Here us what I'm trying to do... I have a string structured like this:
stringparts.bst? (carriage return)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99 (carriage return)
SPAM /198975/
I need it to match or return this:
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
What RegEx will do the trick?
I have tried this, but to no avail :(
bst\?(.*)\n
Thanks in advc
I tried this. Assuming the newline is only one character.
>>> s
'stringparts.bst?\n765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchks
yttsutcuan99\nSPAM /198975/'
>>> m = re.match('.*bst\?\s(.+)\s', s)
>>> print m.group(1)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
Your regex will match everything between the bst? and the first newline which is nothing. I think you want to match everything between the first two newlines.
bst\?\n(.*)\n
will work, but you could also use
\n(.*)\n
although it may not work for some other more specific cases
This is more robust against different kinds of line breaks, and works if you have a whole list of such strings. The $ and ^ represent the beginning and end of a line, but not the actual line break character (hence the \s+ sequence).
import re
BST_RE = re.compile(
r"bst\?.*$\s+^(.*)$",
re.MULTILINE
)
INPUT_STR = r"""
stringparts.bst?
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
SPAM /198975/
stringparts.bst?
another
SPAM /.../
"""
occurrences = BST_RE.findall(INPUT_STR)
for occurrence in occurrences:
print occurrence
This pattern allows additional whitespace before the \n:
r'bst\?\s*\n(.*?)\s*\n'
If you don't expect any whitespace within the string to be captured, you could use a simpler one, where \s+ consumes whitespace, including the \n, and (\S+) captures all the consecutive non-whitespace:
r'bst\?\s+(\S+)'

Categories