Python regex replace each match with itself plus a new line - python

I have a long regex with many alternations and I want to be able to replace each match from the regex with itself followed by a new line ('\n').
What is the most efficient way to do so with re.sub()?
Here is a simple example:
s = 'I want to be able to replace many words, especially in this sentence, since it will help me solve by problem. That makes sense right?'
pattern = re.compile(r'words[,]|sentence[,]|problem[.]')
for match in matches:
re.sub(pattern, match + '\n', match)
I know this for loop will not work, I am just hoping to clarify what I am trying to solve here. Thanks in advance for any help. I may be missing something very straightforward.

To replace a whole match with itself you may use a replacement backreference \g<0>. However, you want to replace and store the matches inside a variable. You need to pass a callback method as a replacement argument to re.sub, and return the whole match value (match.group()) with a newline appended to the value:
import re
matches = [] # Variable to hold the matches
def repl(m): # m is a match data object
matches.append(m.group()) # Add a whole match value
return "{}\n".format(m.group()) # Return the match and a newline appended to it
s = 'I want to be able to replace many words, especially in this sentence, since it will help me solve by problem. That makes sense right?'
pattern = re.compile(r'words[,]|sentence[,]|problem[.]')
s = re.sub(pattern, repl, s)
print(s)
print(matches)
See the Python demo

Just like this?
text ='I want to be able to replace many words, especially in this sentence, since it will help me solve by problem. That makes sense right?'
text_list = tex t.replace('.',',').strip(',|.|?').split(',')
##Remove the beginning and end symbols.And split by ','
print (text_list)
for i in text_list:
ii=i.split(',')
print(ii)
Result
['I want to be able to replace many words', ' especially in this sentence', ' since it will help me solve by problem', ' That makes sense right']
['I want to be able to replace many words']
[' especially in this sentence']
[' since it will help me solve by problem']
[' That makes sense right']

the second parameter of re.sub can either be a string or a callable that takes in the match instance and returns a string. so do this
def break_line(match):
return "\n" + match.group()
text = re.sub(pattern, break_line, text)

Related

Python - Remove word only from within a sentence

I am trying to remove a specific word from within a sentence, which is 'you'. The code is as listed below:
out1.text_condition = out1.text_condition.replace('you','')
This works, however, it also removes it from within a word that contains it, so when 'your' appears, it removes the 'you' from within it, leaving 'r' standing. Can anyone help me figure out what I can do to just remove the word, not the letters from within another string?
Thanks!
In order to replace whole words and not substrings, you should use a regular expression (regex).
Here is how to replace a whole word with the module re:
import re
def replace_whole_word_from_string(word, string, replacement=""):
regular_expression = rf"\b{word}\b"
return re.sub(regular_expression, replacement, string)
string = "you you ,you your"
result = replace_whole_word_from_string("you", string)
print(result)
Output:
, your
Explanation:
The two \b are what we call "word boundaries". The advantage over str.replace is that it will take into account the punctuation too.
In order to create the regular expression, here we use Literal String Interpolation (also called "f-strings", https://www.python.org/dev/peps/pep-0498/).
To create a "f-string", we add the prefix f.
We also use the prefix r, in order to create a "raw string". We use a raw string in order to avoid escaping the backslash in \b.
Without the prefix r, we would have written regular_expression = f"\\b{word}\\b".
If you had used string.replace(' you ', ' '), you would have received this (wrong) output:
you ,you your
A very simple solution is to replace the word with spaces around it with one space:
out1.text_condition = out1.text_condition.replace(' you ', ' ')
But note that it wouldn't remove for example you. (in the end of the sentence) or you,, etc.
Easiest way is probably just to assume there are spaces before and after the word:
out1.text_condition = out1.text_condition.replace(' you ','')

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

Print the line between specific pattern

I want to print the lines between specific string, my string is as follows:
my_string = '''
##start/file1
file/images/graphs/main
file/images/graphs
file/graphs
##start/new
new/pattern/symbol
new/pattern/
##start/info/version
version/info/main
version/info/minor
##start
values/key
values
...
... '''
In this string i want to search for "main" and print it as:
##start/file1/file/images/graphs/main
##start/info/version/version/info/main
How can i do this?
I tried to find the lines between two ##start and search for main.
Try something like:
def get_mains(my_string):
section = ''
for line in my_string.split('\n'):
if line[0:7] == "##start":
section = line
continue
if 'main' in line:
yield '/'.join([section, line])
for main in get_mains(my_string):
print main
There is a way to do this with Python's Regular Expressions Parser called regex for short.
Basically, regex is this whole language for searching through a string for certain patterns. If I have the string 'Hello, World', it would match the regex pattern 'llo, Wor', because it contains an ell followed by an ell followed by an o followed by a comma and a space and a capital double-you and so on. On the surface it just looks like a substring test. The real power of regex comes with special characters. If I have the string 'Hello, World' again, it also matches the pattern 'Hello, \w\w\w\w\w', because \w is a special character that stands for any letter in the alphabet (plus a few extras). So 'Hello, Bobby', 'Hello, World', 'Hello, kitty' all match the pattern 'Hello, \w\w\w\w\w', because \w can stand in for any letter. There are many more of these 'special characters' and they are all very useful. To actually answer your question,
I constructed a pattern that matches
##start\textICareAbout
file_I_don't_care
file_I_don't_care
file_I_care_about\main
which is
r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)')
The leading r makes the string a raw string (so we don't have to double backslash newlines, see the linked webpage). Then, everything in parenthesis becomes a group. Groups are peices of texts that we want to be able to recall later. There are two groups. The first one is (##start{line}), the second one is (.*main). The first group matches anything that starts with ##start and continues for a whole line, so lines like
##start/file1 or ##start/new
The second group matches lines that end in main, because .* matches every character except newlines. In between the two groups there is {line}*, which means 'match any thing that is a complete line, and match any number of them'. So tying it all together, we have:
match anything that starts with ##start, then we match any number of lines, and then we match any line that ends in main.
import re
# define my_string here
pattern = re.compile(r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)'))
for match in pattern.findall(my_string):
string = match[0][:-1] # don't want the trailing \n
string += '/'
string += match[1]
print string
For your example, it outputs
##start/file1/file/images/graphs/main
##start/new/version/info/main
So Regex is pretty cool and other languages have it too. It is a very powerful tool, and you should learn how to use it here.
Also just a side note, I use the .format function, because I think it looks much cleaner and easier to read, so
'hello{line}world'.format(line=r'(?:.*\n)') just becomes evaluated to 'hello(?:.*\n)world', and it would match
hello
Any Text Here. Anything at all. (just for one line)
world

Match single quotes from python re

How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')
The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']
I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary
^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.
Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"

python regular expression replace

I'm trying to change a string that contains substrings such as
the</span></p>
<p><span class=font7>currency
to
the currency
At the line break is CRLF
The words before and after the code change. I only want to replace if the second word starts with a lower case letter. The only thing that changes in the code is the digit after 'font'
I tried:
p = re.compile('</span></p>\r\n<p><span class=font\d>([a-z])')
res = p.sub(' \1', data)
but this isn't working
How should I fix this?
Use a lookahead assertion.
p = re.compile('</span></p>\r\n<p><span class=font\d>(?=[a-z])')
res = p.sub(' ', data)
I think you should use the flag re.DOTALL, which means it will "see" nonprintable characters, such as linebreaks, as if they were regular characters.
So, first line of your code would become :
p = re.compile('</span></p>..<p><span class=font\d>([a-z])', re.DOTALL)
(not the two unescaped dots instead of the linebreak).
Actually, there is also re.MULTILINE, everytime I have a problem like this one of those end up solving the problem.
Hope it helps.
This :
result = re.sub("(?si)(.*?)</?[A-Z][A-Z0-9]*[^>]*>.*</?[A-Z][A-Z0-9]*[^>]*>(.*)", r"\1 \2", subject)
Applied to :
the</span></p>
<p><span class=font7>currency
Produces :
the currency
Although I would strongly suggest against using regex with xml/html/xhtml. THis generic regex will remove all elements and capture any text before / after to groups 1,2.

Categories