regex match exact pattern within string - python

if I have the following string 'some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888' and I want to find 15 digit numbers (so only 151283917503423) how do I make it so that it doesn't match the bigger number and also deal with the possibility that the string can just be '151283917503423' therefore I cannot identify it by it possibly containing spaces on both sides?
serial = re.compile('[0-9]{15}')
serial.findall('some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888')
this returns both 66666666666666666667867866 and 151283917503423 but I only want the latter

Use word boundaries:
serial = re.compile(r'\b[0-9]{15}\b')
\b Matches the empty string, but only at the beginning or end of a
word. A word is defined as a sequence of alphanumeric or underscore
characters, so the end of a word is indicated by whitespace or a
non-alphanumeric, non-underscore character. Note that formally, \b is
defined as the boundary between a \w and a \W character (or vice
versa), or between \w and the beginning/end of the string, so the
precise set of characters deemed to be alphanumeric depends on the
values of the UNICODE and LOCALE flags. For example, r'\bfoo\b'
matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or
'foo3'. Inside a character range, \b represents the backspace
character, for compatibility with Python’s string literals.

You need to use word boundaries to ensure you don't match unwanted text on either side of your match:
>>> serial = re.compile(r'\b\d{15}\b')
>>> serial.findall('some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888')
['151283917503423']

Include word boundaries. Let s be your string. You can use
>>> re.findall(r'\b\d{15}\b' ,s)
['151283917503423']
where \b asserts a word boundary (^\w|\w$|\W\w|\w\W)

Since word boundaries \b contain 2 assertions each, I would use a single assertion
instead.
(?<![0-9])[0-9]{15}(?![0-9])
should be quicker?

Related

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
To match alphanumeric strings or only letter words you may use the following pattern with re:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
| - or
[^\W\d_]+ - either any 1+ Unicode letters
NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).
It reveals useful here to compose:
Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.
Some further reading here.
Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

Regex to get non-alphanumeric strings between alphanumeric strings

Let say I have this string:
Alpha+*&Numeric%$^String%%$
I want to get the non-alphanumeric characters that are between alphanumeric characters:
+*& %$^
I have this regex: [^0-9a-zA-Z]+ but it's giving me
+* %$^ %%$
which includes the tailing non-alphanumeric characters which I do not want. I have also tried [0-9a-zA-Z]([^0-9a-zA-Z])+[0-9a-zA-Z] but it's giving me
a+*&N c%$^S
which include the characters a, N, c and S
If you don't mind including the _ character as alpha-numeric data, you can extract all your non-alpha-numeric-data with this:
some_string = "A+*&N%$^S%%$"
import re
result = re.findall(r'\b\W+\b', some_string) # sets result to: ['+*&', '%$^']
Note my use of \b instead of something like \w or [^\W].
\w and [^\W] each match one character, so if your alpha-numeric string (between the text you want) is exactly one character, then what you think should be the next match won't match.
But since \b is a zero-width "word boundary," it doesn't care how many alpha-numeric characters there are, as long as there is at least one.
The only problem with your second attempt is the location of the + qualifier--it should be inside of the parentheses. You can also use the word character class \w and its inverse \W to pull out these items, which is the same as your second regex but includes underscores _ as parts of words:
import re
s = "Alpha+*&Numeric%$^String%%$"
print(re.findall(r"\w(\W+)\w", s)) # adds _ character
print(re.findall(r"[0-9a-zA-Z]([^0-9a-zA-Z]+)[0-9a-zA-Z]", s)) # your version fixed
print(re.findall(r"(?i)[0-9A-Z]([^0-9A-Z]+)[0-9A-Z]", s)) # same as above
Output:
['+*&', '%$^']
['+*&', '%$^']
['+*&', '%$^']

Python regex number or whitespace before and after string

I am learning regular expressions and have below question.
I referred the page and got below information
\b Matches the empty string, but only at the beginning or end of a
word. A word is defined as a sequence of alphanumeric or underscore
characters, so the end of a word is indicated by whitespace or a
non-alphanumeric, non-underscore character. Note that formally, \b is
defined as the boundary between a \w and a \W character (or vice
versa), or between \w and the beginning/end of the string, so the
precise set of characters deemed to be alphanumeric depends on the
values of the UNICODE and LOCALE flags. For example, r'\bfoo\b'
matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or
'foo3'. Inside a character range, \b represents the backspace
character, for compatibility with Python’s string literals.
Code:
import re
abc="A \ncat and a rat"+ "\ncan't be friends."
print (abc)
if re.search(r'\bcat\b',abc):
print ("Found")
else:
print ("not found")
I would like to find all cases where
there has to be either number or white space before and after my string.
So '1cat4', 'cat', '1cat ', ' cat ', '(cat)' should return positive when I search for 'cat'.
How should I update my code?
Looks like you want to find any cat surrounded by non-alphabethic characters or at the beginning or end of the text:
abc="cat. A \ncat and a rat\ncan't be friends, how about 1cat23 and concatenate?"
re.findall(r'(?:[^a-zA-Z]|^)(cat)(?:[^a-zA-Z]|$)',abc)
#['cat', 'cat', 'cat']
Here are the contexts of the found cats:
re.findall(r'(?:[^a-zA-Z]|^)cat(?:[^a-zA-Z]|$)',abc)
#['cat.', '\ncat ', '1cat2']
Unfortunately, this regex does not recognize herds of cats ("catcat", "cat cat", and the like). If this is an issue, you can add more clauses to the regex.

Regex to replace characters unless they're inside of a word?

How do I replace a set of characters in a string unless they're part of a word? For example, if I have the text "ur the wurst person ur", I want to replace "ur" with "youre". So the final text would be "youre the wurst person youre". I don't want the "ur" inside of wurst to be changed because it's inside of a word. Is there a generic regex way to do this in python? I don't want to have to worry if "ur" has a space before or after, etc., only if it's part of another word. Thanks!
What I've tried so far is a simple
result = re.sub("ur", "youare", text)
but this also replaces the "ur" inside of "wurst". If I use the word boundaries as in
result = re.sub(r"\bur\b", "youare", text)
it will miss the last occurrence of "ur" in the string.
Without using regular expressions...
You could split the string at each space with string.split() and then, in a list comprehension, replace words 'ur' with 'youre'. This may look something like:
s = "ur the wurst person ur"
result = " ".join(['youre' if w == 'ur' else w for w in s.split()])
Hope this helps!
result = re.sub(r'\bur\b', r'youare', "ur the wurst person ur")
from the python documentation:
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

Why inconsistent regular expression "\bpattern\b" behavior in Python?

I am using Python 3 to demonstrate. There is an example string:
a = "learning is learn and elearn"
s = "#wen is # and wen#"
I want to do exact match of "learn" and "#", i.e., not extracting learning (or #wen) or elearn (or wen#). Therefore, I should get 'learn' and '#'.
re.findall(r'\blearn\b', a) # works
['learn']
or
re.sub(r'\blearn\b', 'z', a) # works
'learning is z and elearn'
re.findall(r'\b#\b', s) # not working
[]
or
re.sub(r'\b#\b', 'z', s) # not working
'#wen is # and wen#'
From the docs:
\b Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string
In your example, # is a nonalphanumeric (and non-underscore) character surrounded by other nonalphanumeric characters. Because there are no word characters, there is no word boundary, so \b will not match.

Categories