Define a paragraph as a multi-line string delimited on both side with double new lines ('\n\n'). if there exist a paragraph which contains a certain string ('BAD'), i want to replace that paragraph (i.e. any text containing BAD up to the closest preceding and following double newlines) with some other token ('GOOD'). this should be with a python 3 regex.
i have text such as:
dfsdf\n
sdfdf\n
\n
blablabla\n
blaBAD\n
bla\n
\n
dsfsdf\n
sdfdf
should be:
dfsdf\n
sdfdf\n
\n
GOOD\n
\n
dsfsdf\n
sdfdf
Here you are:
/\n\n(?:[^\n]|\n(?!\n))*BAD(?:[^\n]|\n(?!\n))*/g
OK, to break it down a little (because it's nasty looking):
\n\n matches two literal line breaks.
(?:[^\n]|\n(?!\n))* is a non-capturing group that matches either a single non-line break character, or a line break character that isn't followed by another. We repeat the entire group 0 or more times (in case BAD appears at the beginning of the paragraph).
BAD will match the literal text you want. Simple enough.
Then we use the same construction as above, to match the rest of the paragraph.
Then, you just replace it with \n\nGOOD, and you're off to the races.
Demo on Regex101
Firstly, you're mixing actual newlines and '\n' characters in your example, I assume that you only meant either. Secondly, let me challenge your assumption that you need regex for this:
inp = '''dfsdf
sdadf
blablabla
blaBAD
bla
dsfsdf
sdfdf'''
replaced = '\n\n'.join(['GOOD' if 'BAD' in k else k for k in inp.split('\n\n')])
The result is
print(replaced)
'dfsdf\nsdadf\n\nGOOD\n\ndsfsdf\nsdfdf'
Related
I am trying to use regex to match something meets the following conditions:
do not contain a "//" string
contain Chinese characters
pick up those Chinese characters
I read line by line from a file:
f = open("test.js", 'r')
lines = f.readlines()
for line in lines:
matches = regex.findall(line)
if matches:
print(matches)
First I tried to match Chinese characters using following pattern:
re.compile(r"[\u4e00-\u9fff]+")
it works and give me the output:
['下载失成功']
['下载失败']
['绑定监听']
['该功能暂未开放']
Then I tried to exclude the "//" with the following pattern and combine it to the above pattern:
re.compile(r"^(?=^(?:(?!//).)*$)(?=.*[\u4e00-\u9fff]+).*$")
it gives me the output:
[' showToastByText("该功能暂未开放");']
which is almost right but what I want is only the Chinese characters part.
I tried to add "()" but just can not pick up the part that I want.
Any advice will be appreciated, thanks :)
You don't need so complex regex for just negating // in your input and capturing the Chinese characters that appear in sequence together. For discarding the lines containing // just this (?!.*//) negative look ahead is enough and for capturing the Chinese text, you can capture with this regex [^\u4e00-\u9fff]*([\u4e00-\u9fff]+) and your overall regex becomes this,
^(?!.*//)[^\u4e00-\u9fff]*([\u4e00-\u9fff]+)
Where you can extract Chinese characters from first grouping pattern.
Explanation of above regex:
^ - Start of string
(?!.*//) - Negative look ahead that will discard the match if // is present in the line anywhere ahead
[^\u4e00-\u9fff]* - Optionally matches zero or more non-Chinese characters
([\u4e00-\u9fff]+) - Captures Chinese characters one or more and puts then in first grouping pattern.
Demo
Edit: Here are sample codes showing how to capture text from group1
import re
s = ' showToastByText("该功能暂未开放");'
m = re.search(r'^(?!.*//)[^\u4e00-\u9fff]*([\u4e00-\u9fff]+)',s)
if (m):
print(m.group(1))
Prints,
该功能暂未开放
Online Python Demo
Edit: For extracting multiple occurrence of Chinese characters as mentioned in comment
As you want to extract multiple occurrence of Chinese characters, you can check if the string does not contain // and then use findall to extract all the Chinese text. Here is a sample code demonstrating same,
import re
arr = ['showToastByText("该功能暂未开放");','//showToastByText("该功能暂未开放");','showToastByText("未开放");','showToastByText("该功能暂xxxxxx未开放");']
for s in arr:
if (re.match(r'\/\/', s)):
print(s, ' --> contains // hence not finding')
else:
print(s, ' --> ', re.findall(r'[\u4e00-\u9fff]+',s))
Prints,
showToastByText("该功能暂未开放"); --> ['该功能暂未开放']
//showToastByText("该功能暂未开放"); --> contains // hence not finding
showToastByText("未开放"); --> ['未开放']
showToastByText("该功能暂xxxxxx未开放"); --> ['该功能暂', '未开放']
Online Python demo
You don't need a positive lookahead to get the chinese characters (as it will not match anything). So we can rewrite that part to make a lazy match for .* until it finds the desired characters.
As such, using:
^(?=^(?:(?!//).)*$).*?([\u4e00-\u9fff]+).*$
Your first capture group will be the chinese characters
I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).
I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later).
I've tried with a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text.
I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).
The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)
If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
return title, aminoacid_sequence
find:
^>([^\n\r]+)[\n\r]([A-Z\n\r]+)
\1 = some_varying_text
\2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %s\nSequence:%s' % (m[0], m[1])
It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.
My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids ) to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.
Here us what I'm trying to do... I have a string structured like this:
stringparts.bst? (carriage return)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99 (carriage return)
SPAM /198975/
I need it to match or return this:
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
What RegEx will do the trick?
I have tried this, but to no avail :(
bst\?(.*)\n
Thanks in advc
I tried this. Assuming the newline is only one character.
>>> s
'stringparts.bst?\n765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchks
yttsutcuan99\nSPAM /198975/'
>>> m = re.match('.*bst\?\s(.+)\s', s)
>>> print m.group(1)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
Your regex will match everything between the bst? and the first newline which is nothing. I think you want to match everything between the first two newlines.
bst\?\n(.*)\n
will work, but you could also use
\n(.*)\n
although it may not work for some other more specific cases
This is more robust against different kinds of line breaks, and works if you have a whole list of such strings. The $ and ^ represent the beginning and end of a line, but not the actual line break character (hence the \s+ sequence).
import re
BST_RE = re.compile(
r"bst\?.*$\s+^(.*)$",
re.MULTILINE
)
INPUT_STR = r"""
stringparts.bst?
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
SPAM /198975/
stringparts.bst?
another
SPAM /.../
"""
occurrences = BST_RE.findall(INPUT_STR)
for occurrence in occurrences:
print occurrence
This pattern allows additional whitespace before the \n:
r'bst\?\s*\n(.*?)\s*\n'
If you don't expect any whitespace within the string to be captured, you could use a simpler one, where \s+ consumes whitespace, including the \n, and (\S+) captures all the consecutive non-whitespace:
r'bst\?\s+(\S+)'
I want to print the lines between specific string, my string is as follows:
my_string = '''
##start/file1
file/images/graphs/main
file/images/graphs
file/graphs
##start/new
new/pattern/symbol
new/pattern/
##start/info/version
version/info/main
version/info/minor
##start
values/key
values
...
... '''
In this string i want to search for "main" and print it as:
##start/file1/file/images/graphs/main
##start/info/version/version/info/main
How can i do this?
I tried to find the lines between two ##start and search for main.
Try something like:
def get_mains(my_string):
section = ''
for line in my_string.split('\n'):
if line[0:7] == "##start":
section = line
continue
if 'main' in line:
yield '/'.join([section, line])
for main in get_mains(my_string):
print main
There is a way to do this with Python's Regular Expressions Parser called regex for short.
Basically, regex is this whole language for searching through a string for certain patterns. If I have the string 'Hello, World', it would match the regex pattern 'llo, Wor', because it contains an ell followed by an ell followed by an o followed by a comma and a space and a capital double-you and so on. On the surface it just looks like a substring test. The real power of regex comes with special characters. If I have the string 'Hello, World' again, it also matches the pattern 'Hello, \w\w\w\w\w', because \w is a special character that stands for any letter in the alphabet (plus a few extras). So 'Hello, Bobby', 'Hello, World', 'Hello, kitty' all match the pattern 'Hello, \w\w\w\w\w', because \w can stand in for any letter. There are many more of these 'special characters' and they are all very useful. To actually answer your question,
I constructed a pattern that matches
##start\textICareAbout
file_I_don't_care
file_I_don't_care
file_I_care_about\main
which is
r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)')
The leading r makes the string a raw string (so we don't have to double backslash newlines, see the linked webpage). Then, everything in parenthesis becomes a group. Groups are peices of texts that we want to be able to recall later. There are two groups. The first one is (##start{line}), the second one is (.*main). The first group matches anything that starts with ##start and continues for a whole line, so lines like
##start/file1 or ##start/new
The second group matches lines that end in main, because .* matches every character except newlines. In between the two groups there is {line}*, which means 'match any thing that is a complete line, and match any number of them'. So tying it all together, we have:
match anything that starts with ##start, then we match any number of lines, and then we match any line that ends in main.
import re
# define my_string here
pattern = re.compile(r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)'))
for match in pattern.findall(my_string):
string = match[0][:-1] # don't want the trailing \n
string += '/'
string += match[1]
print string
For your example, it outputs
##start/file1/file/images/graphs/main
##start/new/version/info/main
So Regex is pretty cool and other languages have it too. It is a very powerful tool, and you should learn how to use it here.
Also just a side note, I use the .format function, because I think it looks much cleaner and easier to read, so
'hello{line}world'.format(line=r'(?:.*\n)') just becomes evaluated to 'hello(?:.*\n)world', and it would match
hello
Any Text Here. Anything at all. (just for one line)
world
I have a pattern which is looking for word1 followed by word2 followed by word3 with any number of characters in between.
My file however contains many random newline and other white space characters - which means that between word 1 and 2 or word 2 and 3 there could be 0 or more words and/or 0 or more newlines randomly
Why isn't this code working? (Its not matching anything)
strings = re.findall('word1[.\s]*word2[.\s]*word3', f.read())
[.\s]* - What I mean by this - find either '.'(any char) or '\s'(newline char) multiple times(*)
The reason why your reg ex is not working is because reg ex-es only try to match on a single line. They stop when they find a new line character (\n) and try to match the pattern on the new line starting from the beginning of the pattern.
In order to make the reg ex ignore the newline character you must add re.DOTALL as a third parameter to the findall function:
strings = re.findall('word1.*?word2.*?word3', f.read(), re.DOTALL)
You have two problems:
1) . doesn't mean anything special inside brackets [].
Change your [] to use () instead, like this: (.|\s)
2) \ doesn't mean what you think it does inside regular strings.
Try using raw strings:
re.findall(r'word1 ..blah..')
Notice the r prefix of the string.
Putting them together:
strings = re.findall(r'word1(.|\s)*word2(.|\s)*word3', f.read())
However, do note that this changes the returned list.