Cannot get my regular expression to capture target [duplicate] - python

I need some help on declaring a regex. My inputs are like the following:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>
The required output is:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100.
and there are many other lines in the txt files
with such tags
I've tried this:
#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
line2 = line.replace('<[1> ', '')
line = line2.replace('</[1> ', '')
line2 = line.replace('<[1>', '')
line = line2.replace('</[1>', '')
print line
I've also tried this (but it seems like I'm using the wrong regex syntax):
line2 = line.replace('<[*> ', '')
line = line2.replace('</[*> ', '')
line2 = line.replace('<[*>', '')
line = line2.replace('</[*>', '')
I dont want to hard-code the replace from 1 to 99.

This tested snippet should do it:
import re
line = re.sub(r"</?\[\d+>", "", line)
Edit: Here's a commented version explaining how it works:
line = re.sub(r"""
(?x) # Use free-spacing mode.
< # Match a literal '<'
/? # Optionally match a '/'
\[ # Match a literal '['
\d+ # Match one or more digits
> # Match a literal '>'
""", "", line)
Regexes are fun! But I would strongly recommend spending an hour or two studying the basics. For starters, you need to learn which characters are special: "metacharacters" which need to be escaped (i.e. with a backslash placed in front - and the rules are different inside and outside character classes.) There is an excellent online tutorial at: www.regular-expressions.info. The time you spend there will pay for itself many times over. Happy regexing!

str.replace() does fixed replacements. Use re.sub() instead.

I would go like this (regex explained in comments):
import re
# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")
# <\/{0,}\[\d+>
#
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»
subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>"""
result = pattern.sub("", subject)
print(result)
If you want to learn more about regex I recomend to read Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan.

The easiest way
import re
txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>'
out = re.sub("(<[^>]+>)", '', txt)
print out

replace method of string objects does not accept regular expressions but only fixed strings (see documentation: http://docs.python.org/2/library/stdtypes.html#str.replace).
You have to use re module:
import re
newline= re.sub("<\/?\[[0-9]+>", "", line)

don't have to use regular expression (for your sample string)
>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'
>>> for w in s.split(">"):
... if "<" in w:
... print w.split("<")[0]
...
this is a paragraph with
in between
and then there are cases ... where the
number ranges from 1-100
.
and there are many other lines in the txt files
with
such tags

import os, sys, re, glob
pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
retline = pattern.sub(replacementStringMatchesPattern, "", line)
sys.stdout.write(retline)
print (retline)

Related

Find a date in a text and add a annotation [duplicate]

I need some help on declaring a regex. My inputs are like the following:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>
The required output is:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100.
and there are many other lines in the txt files
with such tags
I've tried this:
#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
line2 = line.replace('<[1> ', '')
line = line2.replace('</[1> ', '')
line2 = line.replace('<[1>', '')
line = line2.replace('</[1>', '')
print line
I've also tried this (but it seems like I'm using the wrong regex syntax):
line2 = line.replace('<[*> ', '')
line = line2.replace('</[*> ', '')
line2 = line.replace('<[*>', '')
line = line2.replace('</[*>', '')
I dont want to hard-code the replace from 1 to 99.
This tested snippet should do it:
import re
line = re.sub(r"</?\[\d+>", "", line)
Edit: Here's a commented version explaining how it works:
line = re.sub(r"""
(?x) # Use free-spacing mode.
< # Match a literal '<'
/? # Optionally match a '/'
\[ # Match a literal '['
\d+ # Match one or more digits
> # Match a literal '>'
""", "", line)
Regexes are fun! But I would strongly recommend spending an hour or two studying the basics. For starters, you need to learn which characters are special: "metacharacters" which need to be escaped (i.e. with a backslash placed in front - and the rules are different inside and outside character classes.) There is an excellent online tutorial at: www.regular-expressions.info. The time you spend there will pay for itself many times over. Happy regexing!
str.replace() does fixed replacements. Use re.sub() instead.
I would go like this (regex explained in comments):
import re
# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")
# <\/{0,}\[\d+>
#
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»
subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>"""
result = pattern.sub("", subject)
print(result)
If you want to learn more about regex I recomend to read Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan.
The easiest way
import re
txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>'
out = re.sub("(<[^>]+>)", '', txt)
print out
replace method of string objects does not accept regular expressions but only fixed strings (see documentation: http://docs.python.org/2/library/stdtypes.html#str.replace).
You have to use re module:
import re
newline= re.sub("<\/?\[[0-9]+>", "", line)
don't have to use regular expression (for your sample string)
>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'
>>> for w in s.split(">"):
... if "<" in w:
... print w.split("<")[0]
...
this is a paragraph with
in between
and then there are cases ... where the
number ranges from 1-100
.
and there are many other lines in the txt files
with
such tags
import os, sys, re, glob
pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
retline = pattern.sub(replacementStringMatchesPattern, "", line)
sys.stdout.write(retline)
print (retline)

regex capture info in text file after multiple blank lines

I open a complex text file in python, match everything else I need with regex but am stuck with one search.
I want to capture the numbers after the 'start after here' line. The space between the two rows is important and plan to split later.
start after here: test
5.7,-9.0,6.2
1.6,3.79,3.3
Code:
text = open(r"file.txt","r")
for line in text:
find = re.findall(r"start after here:[\s]\D+.+", line)
I tried this here https://regexr.com/ and it seems to work but it is for Java.
It doesn't find anything. I assume this is because I need to incorporate multiline but unsure how to read file in differently or incorporate. Have been trying many adjustments to regex but have not been successful.
import re
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
m = re.search(r'start after here:([^\n])+\n+(.*)', test_str)
new_str = m[2]
m = re.search(r'(-?\d*\.\d*,?\s*)+', new_str)
print(m[0])
The pattern start after here:[\s]\D+.+ matches the literal words and then a whitespace char using [\s] (you can omit the brackets).
Then 1+ times not a digit is matched, which will match until before 5.7. Then 1+ times any character except a newline will be matched which will match 5.7,-9.0,6.2 It will not match the following empty line and the next line.
One option could be to match your string and match all the lines after that do not start with a decimal in a capturing group.
\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*
The values including the empty line are in the first capturing group.
For example
import re
regex = r"\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*"
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
matches = re.findall(regex, test_str)
print(matches)
Result
['5.7,-9.0,6.2\n\n1.6,3.79,3.3']
Regex demo | Python demo
If you want to match the decimals (or just one or more digits) before the comma you might split on 1 or more newlines and use:
[+-]?(?:\d+(?:\.\d+)?|\.\d+)(?=,|$)
Regex demo

delete whitespace in regular expression

I'm learning python and also english. And I have a problem that might be easy, but I can't solve it. I have a folder of .txt's, I was able to extract by regular expression a sequence of 17 numbers of each one.I need to rename each file with the sequence I extracted from .txt
import os
import re
path_txt = (r'C:\Users\usuario\Desktop\files')
name_files = os.listdir(path_txt)
for TXT in name_files:
with open(path_txt + '\\' + TXT, "r") as content:
search = re.search(r'(\d{5}\.?\d{4}\.?\d{3}\.?\d{2}\.?\d{2}\-?\d)', content.read())
if search is not None:
print(search.group(0))
f = open(os.path.join( "Processes" , search.group(0) + ".txt"), "w")
for line in content:
print(line)
f.write(line)
f.close()
there are .txt where the sequences appear with spaces between characters, and my regular expression can not find them (example: 00372.2004 .442.02.00-1, 00572.2008.872.02.00- 5)
edit: They are serial numbers, were typed, so sometimes they appear with "." and "-" and other times without them. Sometimes spaces appear because of typos.
You want this regex:
search = re.search(r'(\d{5}.*\d{4}.*\d{3}.*\d{2}.*\d{2}-.*\d)', content.read())
Dot . is any character. By putting \ in front of the dot you escaped it and searched for dots and not any character.
You can use \D in your regular expression to match any non-numeric character (including white space) and + to match one or more (or * to match zero or more), so you could rewrite your expression as:
pattern = r'(\d{5}\D+\d{4}\D+\d{3}\D+\d{2}\D+\d{2}\D+\d)'
re.findall(pattern, '00372.2004 .442.02.00-1, 00572.2008.872.02.00- 5')
# ['00372.2004 .442.02.00-1', '00572.2008.872.02.00- 5']
Note I am using re.findall to find every match in the string and return them in a list.

Python regex extract string between two braces, including new lines [duplicate]

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).
I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later).
I've tried with a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text.
I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).
The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)
If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
return title, aminoacid_sequence
find:
^>([^\n\r]+)[\n\r]([A-Z\n\r]+)
\1 = some_varying_text
\2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %s\nSequence:%s' % (m[0], m[1])
It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.
My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids ) to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.

Removing lines from a text file using python and regular expressions

I have some text files, and I want to remove all lines that begin with the asterisk (“*”).
Made-up example:
words
*remove me
words
words
*remove me
My current code fails. It follows below:
import re
program = open(program_path, "r")
program_contents = program.readlines()
program.close()
new_contents = []
pattern = r"[^*.]"
for line in program_contents:
match = re.findall(pattern, line, re.DOTALL)
if match.group(0):
new_contents.append(re.sub(pattern, "", line, re.DOTALL))
else:
new_contents.append(line)
print new_contents
This produces ['', '', '', '', '', '', '', '', '', '', '*', ''], which is no goo.
I’m very much a python novice, but I’m eager to learn. And I’ll eventually bundle this into a function (right now I’m just trying to figure it out in an ipython notebook).
Thanks for the help!
Your regular expression seems to be incorrect:
[^*.]
Means match any character that isn't a ^, * or .. When inside a bracket expression, everything after the first ^ is treated as a literal character. This means in the expression you have . is matching the . character, not a wildcard.
This is why you get "*" for lines starting with *, you're replacing every character but *! You would also keep any . present in the original string. Since the other lines do not contain * and ., all of their characters will be replaced.
If you want to match lines beginning with *:
^\*.*
What might be easier is something like this:
pat = re.compile("^[^*]")
for line in contents:
if re.search(pat, line):
new_contents.append(line)
This code just keeps any line that does not start with *.
In the pattern ^[^*], the first ^ matches the start of the string. The expression [^*] matches any character but *. So together this pattern matches any starting character of a string that isn't *.
It is a good trick to really think about when using regular expressions. Do you simply need to assert something about a string, do you need to change or remove characters in a string, do you need to match substrings?
In terms of python, you need to think about what each function is giving you and what you need to do with it. Sometimes, as in my example, you only need to know that a match was found. Sometimes you might need to do something with the match.
Sometimes re.sub isn't the fastest or the best approach. Why bother going through each line and replacing all of the characters, when you can just skip that line in total? There's no sense in making an empty string when you're filtering.
Most importantly: Do I really need a regex? (Here you don't!)
You don't really need a regular expression here. Since you know the size and position of your delimiter you can simply check like this:
if line[0] != "*":
This will be faster than a regex. They're very powerful tools and can be neat puzzles to figure out, but for delimiters with fixed width and position, you don't really need them. A regex is much more expensive than an approach making use of this information.
You don't want to use a [^...] negative character class; you are matching all characters except for the * or . characters now.
* is a meta character, you want to escape that to \*. The . 'match any character' syntax needs a multiplier to match more than one. Don't use re.DOTALL here; you are operating on a line-by-line basis but don't want to erase the newline.
There is no need to test first; if there is nothing to replace the original line is returned.
pattern = r"^\*.*"
for line in program_contents:
new_contents.append(re.sub(pattern, "", line))
Demo:
>>> import re
>>> program_contents = '''\
... words
... *remove me
... words
... words
... *remove me
... '''.splitlines(True)
>>> new_contents = []
>>> pattern = r"^\*.*"
>>> for line in program_contents:
... new_contents.append(re.sub(pattern, "", line))
...
>>> new_contents
['words\n', '\n', 'words\n', 'words\n', '\n']
You can do:
print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))
Example:
txt='''\
words
*remove me
words
words
*remove me '''
import StringIO
f=StringIO.StringIO(txt)
import re
print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))

Categories