Replace all text between 2 strings python - python

Lets say I have:
a = r''' Example
This is a very annoying string
that takes up multiple lines
and h#s a// kind{s} of stupid symbols in it
ok String'''
I need a way to do a replace(or just delete) and text in between "This" and "ok" so that when I call it, a now equals:
a = "Example String"
I can't find any wildcards that seem to work. Any help is much appreciated.

You need Regular Expression:
>>> import re
>>> re.sub('\nThis.*?ok','',a, flags=re.DOTALL)
' Example String'

Another method is to use string splits:
def replaceTextBetween(originalText, delimeterA, delimterB, replacementText):
leadingText = originalText.split(delimeterA)[0]
trailingText = originalText.split(delimterB)[1]
return leadingText + delimeterA + replacementText + delimterB + trailingText
Limitations:
Does not check if the delimiters exist
Assumes that there are no duplicate delimiters
Assumes that delimiters are in correct order

The DOTALL flag is the key. Ordinarily, the '.' character doesn't match newlines, so you don't match across lines in a string. If you set the DOTALL flag, re will match '.*' across as many lines as it needs to.

Use re.sub : It replaces the text between two characters or symbols or strings with desired character or symbol or string.
format: re.sub('A?(.*?)B', P, Q, flags=re.DOTALL)
where
A : character or symbol or string
B : character or symbol or string
P : character or symbol or string which replaces the text between A and B
Q : input string
re.DOTALL : to match across all lines
import re
re.sub('\nThis?(.*?)ok', '', a, flags=re.DOTALL)
output : ' Example String'
Lets see an example with html code as input
input_string = '''<body> <h1>Heading</h1> <p>Paragraph</p><b>bold text</b></body>'''
Target : remove <p> tag
re.sub('<p>?(.*?)</p>', '', input_string, flags=re.DOTALL)
output : '<body> <h1>Heading</h1> <b>bold text</b></body>'
Target : replace <p> tag with word : test
re.sub('<p>?(.*?)</p>', 'test', input_string, flags=re.DOTALL)
otput : '<body> <h1>Heading</h1> test<b>bold text</b></body>'

a=re.sub('This.*ok','',a,flags=re.DOTALL)

If you want first and last words:
re.sub(r'^\s*(\w+).*?(\w+)$', r'\1 \2', a, flags=re.DOTALL)

Related

find and replace the word on a string based on the condition

hello dear helpful ppl at stackoverflow ,
I have couple questions about manipulating a string in python ,
first question:-
if I have a string like :
'What's the use?'
and I want to locate the first letter after 'the'
like (What's the use?) the letter is u
how I could do it in the best way possible ?
second question:-
if I want to change something on this string based on the first letter i found in the (First question)
how I could do it ?
and thanks for helping !
You could use a regex replacement to remove all content up and including the first the (along with any following whitespace). Then, just access the first character from that output.
inp = 'What''s the use?'
inp = re.sub(r'^.*?\bthe\b\s*', '', inp)
print("First character after first 'the' is: " + inp[0])
This prints:
First character after first 'the' is: u
Another re take:
import re
sample = "What is the use?"
pattern = r"""
(?<=\bthe\b) # look-behind to ensure 'the' is there. This is non-capturing.
\s+ # one or more whitespace characters
(\w) # Only one alphanumeric or underscore character
"""
# re.X is for verbose, which handles multi-line patterns
m = re.search(pattern, sample, flags = re.X).groups(1)
if not m is None:
print(f"First character after first 'the' is: {m[0]}")
You can find the index of 'u' by using the str.index() method. Then you can extract string before and after using slice operation.
s = "What's the use?"
character_index = s.lower().index('the ') + 4
print(character_index)
# 11
print(s[:character_index] + '*' + s[character_index+1:])
# What's the *se?

How to remove all non-alphanumerical characters except when part of a word [duplicate]

I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.

Replace substrings with items from list

Basically, I have a string that has multiple double-whitespaces like this:
"Some text\s\sWhy is there no punctuation\s\s"
I also have a list of punctuation marks that should replace the double-whitespaces, so that the output would be this:
puncts = ['.', '?']
# applying some function
# output:
>>> "Some text. Why is there no punctuation?"
I have tried re.sub(' +', puncts[i], text) but my problem here is that I don't know how to properly iterate through the list and replace the 1st double-whitespace with the 1st element in puncts, the 2nd double-whitespace with the 2nd element in puncts and so on.
If we're still using re.sub(), here's one possible solution that follows this basic pattern:
Get the next punctuation character.
Replace only the first occurrence of that character in text.
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s(?=\s)', i, text, 1)
The call to re.sub() returns a string, and basically says "find all series of two whitespace characters, but only replace the first whitespace character with a punctuation character." The final argument "1" makes it so that we only replace the first instance of the double whitespace, and not all of them (default behavior).
If the positive lookahead (the part of the regex that we want to match but not replace) confuses you, you can also do without it:
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s\s', i + " ", text, 1)
This yields the same output.
There will be a leftover whitespace at the end of the sentence, but if you're stingy about that, a simple text.rstrip() should take care of that one.
Further explanation
Your first try of using regex ' +' doesn't work because that regex matches all instances where there is at least one whitespace — that is, it will match everything, and then also replace all of it with a punctuation character. The above solutions account for the double-whitespace in their respective regexes.
You can do it simply using the replace method!
text = "Some text Why is there no punctuation "
puncts = ['.', '?']
for i in puncts:
text = text.replace(" ", i, 1) #notice the 1 here
print(text)
Output : Some text.Why is there no punctuation?
You can use re.split() to break the string into substrings between the double spaces and intersperse the punctuation marks using join:
import re
string = "Some text Why is there no punctuation "
iPunct = iter([". ","? "])
result = "".join(x+next(iPunct,"") for x in re.split(r"\s\s",string))
print(result)
# Some text. Why is there no punctuation?

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

Python regex substituting in only the first character when compiling from a list

I'm creating a django filter for inserting 'a' tags into a given string from a list.
This is what I have so far:
def tag_me(text):
tags = ['abc', 'def', ...]
tag_join = "|".join(tags)
regex = re.compile(r'(?=(.))(?:'+ tag_join + ')', flags=re.IGNORECASE)
return regex.sub(r'\1', text)
Example:
tag_me('some text def')
Returns:
'some text d'
Expected:
'some text def'
The issue lies in the regex.sub as it matches but returns only the first character. Is there a problem with the way I'm capturing/using \1 on the last line ?
Note that the sequence (?: ...) in the question specifically turns off capture. See re documentation (about 1/5 thru page) which (with emphasis added) says:
(?:...) A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
As noted in previous answer, '('+ tag_join + ')' works, or use the suggested "|".join(re.escape(tag) for tag in tags) version if escapes are used in the target text.
You're capturing the (.) part, which is only one character.
I'm not sure I follow your regular expression - the simplified version r'('+ tag_join + ')' works fine for your example.
Note that if there's a chance of anything other than alphanumeric characters in your tag names, you'll want to do this:
tag_join = "|".join(re.escape(tag) for tag in tags)
Simply do
import re
def tag_me(text):
tags = ['abc', 'def']
reg = re.compile("|".join(tags).join('()'),
flags=re.IGNORECASE)
return reg.sub(r'\1', text)
print ' %s' % tag_me('some text def')
print 'wanted: some text def'
That's because you write a non-captured group (?:....) that you must then put this disturbing (?=(.)) in front.
This should do it
def tag_me(text):
tags = ['abc', 'def', ]
tag_join = "|".join(tags)
pattern = r'('+tag_join+')'
regex = re.compile(pattern, flags=re.IGNORECASE)
return regex.sub(r'\1', text)

Categories