This question already has answers here:
replacing only single instances of a character with python regexp
(4 answers)
Closed 2 years ago.
i have text that has excessive line breaks that i want to remove. the goal is to remove the single line breaks \n BUT leave the double line breaks \n\n (indicating new paragraph), this is ok.
i created this regex to isolate the single breaks and try to sub for a nothing, an empty space, even a backspace '\b' but nothing works. the goal is to NOT have the sentence break on the single \n and have the sentence naturally continue on screen or do a self word wrap but not forced to a new line for a single \n. The consecutive linebreaks \n\n (see the end of sentence) are ok.
i added the * * so you can see them easier. the regex is supposed to capture the single \n (\\n) only when it is in front of 2 consecutive letters (?<=[a-z][a-z])
text = "more information*\n*on options concepts and strategies.*\n* Also,*\n* George Fontanills publishes*\n*several options *\n*learning tools that deal*\n*primarily with the Delta Neutral approach.*\n\n*Page 14 shows and example of the tools"
text1= re.sub( r"(?<=[a-z][a-z])(\\n)" , " ", text)
import re
text = "more information*\n*on options concepts and strategies. Also, George Fontanills publishes*\n*several options learning tools that deal*\n*primarily with the Delta Neutral approach.*\n\n*Page 14 shows and example of the tools a\n"
text = text.replace("*", "")
text1= re.sub(r'(?<=[a-z., ]{2})\n(?!\n)', '', text)
print(text1)
Explanation:
Match a single character present in the list below [a-z]{2}. {2} Quantifier — Matches exactly 2 times.
\n matches a line-feed (newline) character (ASCII 10)
Negative Lookahead (?!\n). Assert that the Regex below does not match.
You could use {amount} to indicate amount \n do you want remove:
text = text.replace('*', '')
re.sub(r"(?<=\w{2})(\n{1})", " ", text)
Related
I want to replace all dot except the ones between digit or followed by specific text with \n using Python.
Input: I have a meeting at 8.30. I will be at meet.com. Bye.
Output: I have a meeting at 8.30 \n I will be at meet.com \n Bye \n
Here is my trying code:
def replace_dot_for_original_sentence(text):
dot = "."
for char in text:
if char in dot:
if not re.match(r'(?<=\d)\.(?=\d)', text) or not re.match(r'\.(com|org|net|co|id)', text):
text = text.replace(char, "\n")
return text
What it does is replace all the dots to \n. I have tried also using re.search, I think there's problem in the if conditions? Any ideas?
We can try using re.sub here:
inp = "I have a meeting at 8.30. I will be at meet.com. Bye."
output = re.sub(r'\.(?:\s+|$)', ' \n ', inp)
print(output) # I have a meeting at 8.30 \n I will be at meet.com \n Bye \n
Try this pattern:
import re
def replace_dot_for_original_sentence(text):
text = re.sub(r'\.\s+|\.$', '\n', text)
return text
print(replace_dot_for_original_sentence('I have a meeting at 8.30. I will be at meet.com. Bye.'))
Output
I have a meeting at 8.30
I will be at meet.com
Bye
re.match(r'(?<=\d)\.(?=\d)', text) is not None if text in its entirety is a match for a period with digits before or after it (and nothing else). That's never the case, so it's always None and not re.match(r'(?<=\d)\.(?=\d)', text) is always True.
Similar, re.match(r'\.(com|org|net|co|id)', text) is always True, unless the text is just something like .com.
You then proceed to text = text.replace(char, "\n"), across the entire text - so even if your condition worked, this would still just replace the lot of them if the condition correctly decided that something needed replacing.
If you want every period that's not followed by any of com|org|net|co|id and also not followed by \d (because you do want to replace the period after 8.30, a probably also don't want to replace the . in something like '$.30' for example), this works:
def replace_dot_for_original_sentence(text):
return re.sub(r"(?s)\.(?!\d)(?!com|org|net|co|id)", "\n", text)
The whole for loop thing doesn't do anything, it just ensures your code only runs if there's a period in the string in a very roundabout way.
Note that the expression still needs some work. For example, because you have co in there, .coupons, .cooking and .courses (to name but a few) are now matched and skipped as well. While stuff like .co.uk still gets cut off in the middle.
If it works for your data set, great. But don't view it as even a halfway decent way to detect the end of URLs.
You can put the two conditions inside a single negative lookahead like that:
\.(?!(?:com?|org|net|id)\b|(?<=\d\.)\d)
and nothing forbids to put a lookbehind inside it to check the condition with digits.
demo
def replace_dot_for_original_sentence(text):
return re.sub(r'\.(?!(?:com?|org|net|id)\b|(?<=\d\.)\d)', "\n", text)
I would like to delete a sentence that is after a hash found in a sentence. This process should happen on all lines that have pound signs, for example:
abcde#efg hijk
aaaabbbcc
ghij#kloa.bcd
It will look like this
abcde#
aaaabbbcc
ghij#
I made the code below with re.findall, but when it finds an empty space, it does not delete the rest, look:
text = 'abcde#efg hijk \n\n ghij#kloa.bcd'
result=re.findall(r'#(\w+.\w+\s+)', text)
>>['efg hijk \n\n ']
Does anyone have any ideas?
I'd use
re.findall(r'^.*?(?:$|#)', text, re.M)
to match all of the substrings you want to keep and
re.findall(r'(?<=#).*$', text, re.M)
to match all of the substrings you want to reject.
Both use the MULTILINE flag and end-of-line $ or # characters as boundaries.
Use caution when there are multiple #s in a line.
I have this string: "!#€f#$#"
I want to use regex to remove all special characters at the beginning and the end and stop when i encounter the first character that's excluded.lets say the characters ["$", "€"] are excluded, the result should be "€f#$". Also, i have different lists of characters that are excluded from the beginning and different at the end.
text = "!#€f#$#"
newtext = re.sub("\W*$", "", text)
This only affects the ending characters and it removes ALL specials without exceptions
You may use
import re
text = "!#€f#$#"
newtext = re.sub(r"^[^\w$€]+|[^\w$€]+$", "", text)
print(newtext)
See the Python demo
Details
^[^\w$€]+ - start of string (^) and 1 or more chars other than word chars, $ and €([^\w$€]+`)
| - or
[^\w$€]+$ - 1 or more chars other than word chars, $ and €and end of string ($`).
I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).
I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later).
I've tried with a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text.
I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).
The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)
If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
return title, aminoacid_sequence
find:
^>([^\n\r]+)[\n\r]([A-Z\n\r]+)
\1 = some_varying_text
\2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %s\nSequence:%s' % (m[0], m[1])
It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.
My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids ) to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.
Define a paragraph as a multi-line string delimited on both side with double new lines ('\n\n'). if there exist a paragraph which contains a certain string ('BAD'), i want to replace that paragraph (i.e. any text containing BAD up to the closest preceding and following double newlines) with some other token ('GOOD'). this should be with a python 3 regex.
i have text such as:
dfsdf\n
sdfdf\n
\n
blablabla\n
blaBAD\n
bla\n
\n
dsfsdf\n
sdfdf
should be:
dfsdf\n
sdfdf\n
\n
GOOD\n
\n
dsfsdf\n
sdfdf
Here you are:
/\n\n(?:[^\n]|\n(?!\n))*BAD(?:[^\n]|\n(?!\n))*/g
OK, to break it down a little (because it's nasty looking):
\n\n matches two literal line breaks.
(?:[^\n]|\n(?!\n))* is a non-capturing group that matches either a single non-line break character, or a line break character that isn't followed by another. We repeat the entire group 0 or more times (in case BAD appears at the beginning of the paragraph).
BAD will match the literal text you want. Simple enough.
Then we use the same construction as above, to match the rest of the paragraph.
Then, you just replace it with \n\nGOOD, and you're off to the races.
Demo on Regex101
Firstly, you're mixing actual newlines and '\n' characters in your example, I assume that you only meant either. Secondly, let me challenge your assumption that you need regex for this:
inp = '''dfsdf
sdadf
blablabla
blaBAD
bla
dsfsdf
sdfdf'''
replaced = '\n\n'.join(['GOOD' if 'BAD' in k else k for k in inp.split('\n\n')])
The result is
print(replaced)
'dfsdf\nsdadf\n\nGOOD\n\ndsfsdf\nsdfdf'