I want to replace only specific word in one string. However, some other words have that word inside but I don't want them to be changed.
For example, for the below string I only want to replace x with y in z string. how to do that?
x = "112-224"
y = "hello"
z = "This is the number 112-224 not #112-224"
When I do re.sub(r'\b' + x + r'\b', y, z) I am getting 'This is the number hello not #hello'. So basically doesn't work with this regex. I am really not good with this regex stuff. What's the right way to do that? so, i can get This is the number hello not #112-224.
How about this:
pattern = r'(?<=[\w\s\n\r\^])'+x+r'(?=[\w\s\n\r$])'
With the complete code being:
x = "112-234"
y = "hello"
z = "112-234this is 112-234 not #112-234"
pattern = r'(?<=[\w\s\n\r\^])'+x+r'(?=[\w\s\n\r$])'
Here, I'm using a positive lookbehind and a positive lookahead in regex, which you can learn more about here
The regex states the match should be preceded by a word character, space, newline or the start of the line, and should be followed by a space, word character newline or the end of the line.
Note: Don't forget to escape out the carat ^ in the lookbehind, otherwise you'll end up negating everything in the square brackets.
Using a lookahead:
re.sub("\d{3}-\d{3}(?=\s)",y,z)
'This is the number hello not #112-224'
The above assumes that the digits will always be at most three.
Alternatively:
re.sub("\d.*\d(?=\s)","hello",z)
'This is the number hello not #112-224'
Related
New to regex.
Aim- To match a whole word which might have either '.' or '-' with it at the end. I want to keep it for the .start() and .end() position calculation.
txt = "The indian in. Spain."
pattern = "in."
x = re.search(r"\b" + pattern + r"\b" , txt)
print(x.start(), x.end())
I want the position for 'in.' word, as highlighted "The indian in. Spain.". The expression I have used gives error for a Nonetype object.
What would be the expression to match the '.' in the above code? Same if '-' is present instead of '.'
There are two issues here.
In regex . is special. It means "match one of any character". However, you are trying to use it to match a regular period. (It will indeed match that, but it will also match everything else.) Instead, to match a period, you need to use the pattern \.. And to change that to match either a period or a hyphen, you can use a class, like [-.].
You are using \b at the end of your pattern to match the word boundary, but \b is defined as being the boundary between a word character and a non-word character, and periods and spaces are both non-word characters. This means that Python won't find a match. Instead, you could use a lookahead assertion, which will match whatever character you want, but won't consume the string.
Now, to match a whole word - any word - you can do something like \w+, which matches one or more word characters.
Also, it is quite possible that there won't be a match anyway, so you should check whether a match occurred using an if statement or a try statement. Putting it all together:
txt = "The indian in. Spain."
pattern = r"\w+[-.]"
x = re.search(r"\b" + pattern + r"(?=\W)", txt)
if x:
print(x.start(), x.end())
Edit
There is one problem with the lookahead assertion above - it won't match the end of the string. This means that if your text is The rain in Spain. then it won't match Spain., as there is no non-word character following the final period.
To fix this, you can use a negative lookahead assertion, which matches when the following text does not include the pattern, and also does not consume the string.
x = re.search(r"\b" + pattern + r"(?!\w)", txt)
This will match when the character after the word is anything other than a word character, including the end of the string.
I want a regex search to end when it reaches ". ", but not when it reaches "."; I'm aware of using [^...] to exclude single characters, and have been using this to stop my search when it reaches a certain character. This does not work with strings though, as [^. ] stops when it reaches either character. Say I've got the code
import re
def main():
my_string = "The value of the float is 2.5. The int's value is 2.\n"
re.search("[^.]*", my_string)
main()
Which gives a match object with the string
"The value of the float is 2"
How can I change this so that it only stops after the string ". "?
Bonus question, is there any way to tell regex to stop whenever it reaches one of multiple strings? Using the above code as an example, if I wanted the search to end when it found the string ". " or the string ".\n", how would I go about it? Thanks!
To match from the start of a string till the . followed with whitespace, use
^(.*?)\.\s
If you want to only require a space or newline after a dot, use either of (the second is best if you have single chars only, use alternation if there are multicharacter alternatives)
^(.*?)\.(?: |\n)
^(.*?)\.[ \n]
See the regex demo.
Details
^ - start of a string
(.*?) - Capturing group 1: any 0+ chars other than linebreak chars, as few as possible
\. - a literal . char
\s - a whitespace char
(?: |\n) / [ \n] - a non-capturing group matching either a space or (|) a newline.
Python demo:
import re
my_string = "The value of the float is 2.5. The int's value is 2.\n"
m = re.search("^(.*?)\.\s", my_string) # Try to find a match
if m: # If there is a match
print(m.group(1)) # Show Group 1 value
NOTE If there can be line breaks in the input, pass re.S or re.DOTALL flag:
m = re.search("^(.*?)\.\s", my_string, re.DOTALL)
Besides classic approach explained by Wiktor, also splitting may be interesting solution in this case.
>>> my_string
"The value of the float is 2.5. The int's value is 2.\n"
>>> re.split('\. |\.\n', my_string)
['The value of the float is 2.5', "The int's value is 2", '']
If you want to include periods at the end of the sentence, you can do something like this:
['{}.'.format(sentence) for sentence in re.split('\. |\.\n', my_string) if sentence]
To handle multiple empty spaces between the sentences:
>>> str2 = "The value of the float is 2.5. The int's value is 2.\n\n "
>>> ['{}.'.format(sentence)
for sentence in re.split('\. \s*|\.\n\s*', str2)
if sentence
]
['The value of the float is 2.5.', "The int's value is 2."]
I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)
So I want to write a regex that matches with a word that is one character less than the word. So for example:
wordList = ['inherit', 'inherent']
for word in wordList:
if re.match('^inhe....', word):
print(word)
And in theory, it would print both inherit and inherent, but I can only get it to print inherent. So how can I match with a word one letter short without just erasing one of the dots (.)
(Edited)
For matching only inherent, you could use .{4}:
re.match('^inhe.{4}', word)
Or ....$:
re.match('^inhe....$')
A regex may not be the best tool here, if you just want to know if word Y starts with the first N-1 letters of word X, do this:
if Y.startswith( X[:-1] ):
# Do whatever you were trying to do.
X[:-1] gets all but the last character of X (or the empty string if X is the empty string).
Y.startswith( 'blah' ) returns true if Y starts with 'blah'.
I would like to replace strings like 'HDMWhoSomeThing' to 'HDM Who Some Thing' with regex.
So I would like to extract words which starts with an upper-case letter or consist of upper-case letters only. Notice that in the string 'HDMWho' the last upper-case letter is in the fact the first letter of the word Who - and should not be included in the word HDM.
What is the correct regex to achieve this goal? I have tried many regex' similar to [A-Z][a-z]+ but without success. The [A-Z][a-z]+ gives me 'Who Some Thing' - without 'HDM' of course.
Any ideas?
Thanks,
Rukki
#! /usr/bin/env python
import re
from collections import deque
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'HDMWhoSomeMONKEYThingXYZ'))
result = []
while len(chunks):
buf = chunks.popleft()
if len(buf) == 0:
continue
if re.match(r'^[A-Z]$', buf) and len(chunks):
buf += chunks.popleft()
result.append(buf)
print ' '.join(result)
Output:
HDM Who Some MONKEY Thing XYZ
Judging by lines of code, this task is a much more natural fit with re.findall:
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z][a-z]*)'
print ' '.join(re.findall(pattern, 'HDMWhoSomeMONKEYThingX'))
Output:
HDM Who Some MONKEY Thing X
Try to split with this regular expression:
/(?=[A-Z][a-z])/
And if your regular expression engine does not support splitting empty matches, try this regular expression to put spaces between the words:
/([A-Z])(?![A-Z])/
Replace it with " $1" (space plus match of the first group). Then you can split at the space.
one liner :
' '.join(a or b for a,b in re.findall('([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))',s))
using regexp
([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))
So 'words' in this case are:
Any number of uppercase letters - unless the last uppercase letter is followed by a lowercase letter.
One uppercase letter followed by any number of lowercase letters.
so try:
([A-Z]+(?![a-z])|[A-Z][a-z]*)
The first alternation includes a negative lookahead (?![a-z]), which handles the boundary between an all-caps word and an initial caps word.
May be '[A-Z]*?[A-Z][a-z]+'?
Edit: This seems to work: [A-Z]{2,}(?![a-z])|[A-Z][a-z]+
import re
def find_stuff(str):
p = re.compile(r'[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
m = p.findall(str)
result = ''
for x in m:
result += x + ' '
print result
find_stuff('HDMWhoSomeThing')
find_stuff('SomeHDMWhoThing')
Prints out:
HDM Who Some Thing
Some HDM Who Thing