I want to write a regex which will match a string only if the string consists of two capital letters.
I tried - [A-Z]{2}, [A-Z]{2, 2} and [A-Z][A-Z] but these only match the string 'CAS' while I am looking to match only if the string is two capital letters like 'CA'.
You could use anchors:
^[A-Z]{2}$
^ matches the beginning of the string, while $ matches its end.
Note in your attempts, you used [A-Z]{2, 2} which should actually be [A-Z]{2,2} (without space) to mean the same thing as the others.
You need to add word boundaries,
\b[A-Z]{2}\b
DEMO
Explanation:
\b Matches between a word character and a non-word character.
[A-Z]{2} Matches exactly two capital letters.
\b Matches between a word character and a non-word character.
You could try:
\b[A-Z]{2}\b
\b matches a word boundary.
Try =
^[A-Z][A-Z]$
Just added start and end points for the string.
Related
I am trying to find the words in string not starting or ending with letters 'aıoueəiöü'. But regex fails to find words when I use this code:
txt = "Nasa has fixed a problem with malfunctioning equipment on a new rocket designed to take astronauts to the Moon."
re.findall(r"\b[^aıoueəiöü]\w+[^aıoueəiöü]\b",txt)
Instead, it works fine when whitespace character \s is added in negation part:
re.findall(r"\b[^aıoueəiöü\s]\w+[^aıoueəiöü\s]\b",txt)
I cannot understand the issue in first example of code, why should I specify whitespace characters too?
Note that [^aıoueəiöü] matches any char other than a, ı, o, u, e, ə, i, ö and ü. It can match a whitespace, a digit, punctuation, etc.
Also, you regex matches strings of at least three chars, you need to adjust it to match one and two char strings, too.
You do not have to rely on excluding whitespace from the pattern. Since you only want to match word chars other than vowels, add \W rather than \s:
\b[^\Waıoueəiöü](?:\w*[^\Waıoueəiöü])?\b
See the regex demo.
Details:
\b - a word boundary
[^\Waıoueəiöü] - any word char except a letter from the aıoueəiöü set
(?:\w*[^\Waıoueəiöü])? - an optional occurrence of
\w* - any zero or more word chars
[^\Waıoueəiöü] - any word char except a letter from the aıoueəiöü set
\b - a word boundary
I need to match 'words' (string of characters with no spaces) that might have the word near at the beginning and/or the end and have only digits in the middle.
Examples: near3 4near near2near
It should not match words like nearing3 4nearsighted near3ness nearsighted
I tried this: x = re.match(r"((\bnear)|(near\b))(\d)", txt)
It works for this word: near3 and this word: near4near but not for this word 2near
You can match optional near followed by digits and near OR match near and digits using an alternation using the pipe |
You can surround the alternation with a non capture group and add word boundaries \b at both sides of the pattern to prevent a partial word match.
If you want to match a single digit, you can use only \d instead.
\b(?:(?:near)?\d+near|near\d+)\b
Regex demo
I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
To match alphanumeric strings or only letter words you may use the following pattern with re:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
| - or
[^\W\d_]+ - either any 1+ Unicode letters
NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).
It reveals useful here to compose:
Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.
Some further reading here.
Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.
I have a regex that matches all three characters words in a string:
\b[^\s]{3}\b
When I use it with the string:
And the tiger attacked you.
this is the result:
regex = re.compile("\b[^\s]{3}\b")
regex.findall(string)
[u'And', u'the', u'you']
As you can see it matches you as a word of three characters, but I want the expression to take "you." with the "." as a 4 chars word.
I have the same problem with ",", ";", ":", etc.
I'm pretty new with regex but I guess it happens because those characters are treated like word boundaries.
Is there a way of doing this?
Thanks in advance,
EDIT
Thaks to the answers of #BrenBarn and #Kendall Frey I managed to get to the regex I was looking for:
(?<!\w)[^\s]{3}(?=$|\s)
If you want to make sure the word is preceded and followed by a space (and not a period like is happening in your case), then use lookaround.
(?<=\s)\w{3}(?=\s)
If you need it to match punctuation as part of words (such as 'in.') then \w won't be adequate, and you can use \S (anything but a space)
(?<=\s)\S{3}(?=\s)
As described in the documentation:
A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.
So if you want a period to count as a word character and not a word boundary, you can't use \b to indicate a word boundary. You'll have to use your own character class. For instance, you can use a regex like \s[^\s]{3}\s if you want to match 3 non-space characters surrounded by spaces. If you still want the boundary to be zero-width (i.e., restrict the match but not be included in it), you could use lookaround, something like (?<=\s)[^\s]{3}(?=\s).
This would be my approach. Also matches words that come right after punctuations.
import re
r = r'''
\b # word boundary
( # capturing parentheses
[^\s]{3} # anything but whitespace 3 times
\b # word boundary
(?=[^\.,;:]|$) # dont allow . or , or ; or : after word boundary but allow end of string
| # OR
[^\s]{2} # anything but whitespace 2 times
[\.,;:] # a . or , or ; or :
)
'''
s = 'And the tiger attacked you. on,bla tw; th: fo.tes'
print re.findall(r, s, re.X)
output:
['And', 'the', 'on,', 'bla', 'tw;', 'th:', 'fo.', 'tes']
Example;
X=This
Y=That
not matching;
ThisWordShouldNotMatchThat
ThisWordShouldNotMatch
WordShouldNotMatch
matching;
AWordShouldMatchThat
I tried (?<!...) but seems not to be easy :)
^(?!This).*That$
As a free-spacing regex:
^ # Start of string
(?!This) # Assert that "This" can't be matched here
.* # Match the rest of the string
That # making sure we match "That"
$ # right at the end of the string
This will match a single word that fulfills your criteria, but only if this word is the only input to the regex. If you need to find words inside a string of many other words, then use
\b(?!This)\w*That\b
\b is the word boundary anchor, so it matches at the start and at the end of a word. \w means "alphanumeric character. If you also want to allow non-alphanumerics as part of your "word", then use \S instead - this will match anything that's not a space.
In Python, you could do words = re.findall(r"\b(?!This)\w*That\b", text).