Check in python if self designed pattern matches - python

I have a pattern which looks like:
abc*_def(##)
and i want to look if this matches for some strings.
E.x. it matches for:
abc1_def23
abc10_def99
but does not match for:
abc9_def9
So the * stands for a number which can have one or more digits.
The # stands for a number with one digit
I want the value in the parenthesis as result
What would be the easiest and simplest solution for this problem?
Replace the * and # through regex expression and then look if they match?
Like this:
pattern = pattern.replace('*', '[0-9]*')
pattern = pattern.replace('#', '[0-9]')
pattern = '^' + pattern + '$'
Or program it myself?

Based on your requirements, I would go for a regex for the simple reason it's already available and tested, so it's easiest as you were asking.
The only "complicated" thing in your requirements is avoiding after def the same digit you have after abc.
This can be done with a negative backreference. The regex you can use is:
\babc(\d+)_def((?!\1)\d{1,2})\b
\b captures word boundaries; if you enclose your regex between two \b
you will restrict your search to words, i.e. text delimited by space,
punctuations etc
abc captures the string abc
\d+ captures one or more digits; if there is an upper limit to the number of digits you want, it has to be \d{1,MAX} where MAX is your maximum number of digits; anyway \d stands for a digit and + indicates 1 or more repetitions
(\d+) is a group: the use of parenthesis defines \d+ as something you want to "remember" inside your regex; it's somehow similar to defining a variable; in this case, (\d+) is your first group since you defined no other groups before it (i.e. to its left)
_def captures the string _def
(?!\1) is the part where you say "I don't want to repeat the first group after _def. \1 represents the first group, while (?!whatever) is a check that results positive is what follows the current position is NOT (the negation is given by !) whatever you want to negate.
Live demo here.

I had the hardest time getting this to work. The trick was the $
#!python2
import re
yourlist = ['abc1_def23', 'abc10_def99', 'abc9_def9', 'abc955_def9', 'abc_def9', 'abc9_def9288', 'abc49_def9234']
for item in yourlist:
if re.search(r'abc[0-9]+_def[0-9][0-9]$', item):
print item, 'is a match'

You could match your pattern like:
abc\d+_def(\d{2})
abc Match literally
\d+ Match 1 or more digits
_ Match underscore
def - Match literally
( Capturing group (Your 2 digits will be in this group)
\d{2} Match 2 digits
) Close capturing group
Then you could for example use search to check for a match and use .group(1) to get the digits between parenthesis.
Demo Python
You could also add word boundaries:
\babc\d+_def(\d{2})\b

Related

Wny it does not give all positive numbers in the string? Regex in Python

I don't understand why it only gives 125, the first number only, why it does not give all positive numbers in that string? My goal is to extract all positive numbers.
import re
pattern = re.compile(r"^[+]?\d+")
text = "125 -898 8969 4788 -2 158 -947 599"
matches = pattern.finditer(text)
for match in matches:
print(match)
Try using the regular expression
-\d+|(\d+)
Disregard the matches. The strings representing non-negative integers are saved in capture group 1.
Demo
The idea is to match but not save to a capture group what you don't want (negative numbers), and both match and save to a capture group what you do want (non-negative numbers).
The regex attempts to match -\d+. If that succeeds the regex engine's internal string pointer is moved to just after the last digit matched. If -\d+ is not matched an attempt is made to match the second part of the alternation (following |). If \d+ is matched the match is saved to capture group 1.
Any plus signs in the string can be disregarded.
For a fuller description of this technique see The Greatest Regex Trick Ever. (Search for "Tarzan"|(Tarzan) to get to the punch line.)
The following pattern will only match non negative numbers:
pattern = re.compile("(?:^|[^\-\d])(\d+)")
pattern.findall(text)
OUTPUT
['125', '8969', '4788', '158', '599']
For the sake of completeness another idea by use of \b and a lookbehind.
\b(?<!-)\d+
See this demo at regex101
Your pattern ^[+]?\d+ is anchored at the start of the string, and will give only that match at the beginning.
Another option is to assert a whitspace boundary to the left, and match the optional + followed by 1 or more digits.
(?<!\S)\+?\d+\b
(?<!\S) Assert a whitespace boundary to the left
\+? Match an optional +
\d+\b Match 1 or more digits followed by a word bounadry
Regex demo
Use , to sperate the numbers in the string.

Regex to find N characters between underscore and period

I have a filename having numerals like test_20200331_2020041612345678.csv.
So I just want to read only first 8 characters from the number between last underscore and .csv using a regex.
For e.g: From the file name test_20200331_2020041612345678.csv --> i want to read only 20200416 using regex.
Regex tried: (?<=_)(\d+)(?=\.)
But it is returning the full number between underscore and period i.e 2020041612345678
Also, when tried quantifier like (?<=_)(\d{8})(?=\.) its not matching with any string
The (?<=_)(\d{8})(?=\.) does not work because the (?=\.) positive lookahead requires the presence of a . char immediately to the right of the current location, i.e. right after the eigth digit, but there are more digits in between.
You may add \d* before \. to match any amount of digits after the required 8 digits, use
(?<=_)\d{8}(?=\d*\.)
Or, with a capturing group, you do not even need lookarounds (just make sure you access Group 1 when a match is obtained):
_(\d{8})\d*\.
See the regex demo
Python demo:
import re
s = "test_20200331_2020041612345678.csv"
m = re.search(r"(?<=_)\d{8}(?=\d*\.)", s)
# m = re.search(r"_(\d{8})\d*\.", s) # capturing group approach
if m:
print(m.group()) # => 20200416
# print(m.group(1)) # capturing group approach

Python regex with one operator or the other, but not both

I want to use a regular expression to match a lowercase letter followed by either a + and a digit or a - and a digit, or both, but not 2 times the same operator.
To be clear, these are acceptable
a
a+1
a-2
a+3-4
a-5+6
while these are not acceptable
a+1+2
a-3-4
My current expression is
r = re.compile(r"[a-z]{1}([+-]\d){0,2}?$")
which allows both the non-acceptable strings. How can I specify that if one operator has already been used, it cannot appear twice?
You can use a backreference within a negative lookahead (the overall regex will have to change a little bit though):
[a-z](?:([+-])\d(?:(?!\1)[+-]\d)?)?$
regex101 demo
Instead of ([+-]\d){0,2}?, I have made the possible two repeats like this: ([+-])\d(?:(?!\1)[+-]\d)?, the first occurrence of operator and number being ([+-])\d and the second (?:(?!\1)[+-]\d)?.
In the first occurrence, the regex is storing the matched value (either + or -) and in the second, it is making sure this matched value is not matched (?!\1)[+-] ((?! ... ) is the syntax for negative lookahead so that [+-] cannot be something that this negative lookahead matches)
Try this:
[a-z](?!(\+\d\+\d)|(\-\d\-\d))((\+|\-)\d)*
And verbose version (which is better, use it):
[a-z] # find this
(?! # not followed by:
(\+\d\+\d) | (\-\d\-\d) # (this or that)
)
(
(\+|\-)\d # followed by this
)* # 0 or more times
You can branch these in two scenario's so:
r = re.compile(r'^[a-z]([+]\d([-]\d)?|[-]\d([+]\d)?)?$')
(regex101)
So we basically have two branches here:
[+]\d([-]\d)?: we start with a +, a digit and optionally a - and a digit; and
[-]\d([+]\d)?: we start with a -, a digit and optionally a + and a digit.
We then make a union between the the two, and make this optional as well.
Try this one:
(?!(.+?\+){2,}|(.*?\-){2,})[a-z][\d+-]*
Demo at regex 101
Explanation:
(?!(.+?\+){2,}|(.*?\-){2,}) negative look ahead asserts that there are not more than two occurrence of + or -
[a-z] matches lower case character
[\d+-]* matches zero or more digits, + or -

Matching a number in a file with Python

I have about 15,000 files I need to parse which could contain one or more strings/numbers from a list I have. I need to separate the files with matching strings.
Given a string: 3423423987, it could appear independently as "3423423987", or as "3423423987_1" or "3423423987_1a", "3423423987-1a", but it could also be "2133423423987". However, I only want to detect the matching sequence where it is not a part of another number, only when it has a suffix of some sort.
So 3423423987_1 is acceptable, but 13423423987 is not.
I'm having trouble with regex, haven't used it much to be honest.
Simply speaking, if I simulate this with a list of possible positives and negatives, I should get 7 hits, for the given list. I would like to extract the text till the end of the word, so that I can record that later.
Here's my code:
def check_text_for_string(text_to_parse, string_to_find):
import re
matches = []
pattern = r"%s_?[^0-9,a-z,A-Z]\W"%string_to_find
return re.findall(pattern, text_to_parse)
if __name__ =="__main__":
import re
word_to_match = "3423423987"
possible_word_list = [
"3423423987_1 the cake is a lie", #Match
"3423423987sdgg call me Ishmael", #Not a match
"3423423987 please sir, can I have some more?", #Match
"3423423987", #Match
"3423423987 ", #Match
"3423423987\t", #Match
"adsgsdzgxdzg adsgsdag\t3423423987\t", #Match
"1233423423987", #Not a match
"A3423423987", #Not a match
"3423423987-1a\t", #Match
"3423423987.0", #Not a match
"342342398743635645" #Not a match
]
print("%d words in sample list."%len(possible_word_list))
print("Only 7 should match.")
matches = check_text_for_string("\n".join(possible_word_list), word_to_match)
print("%d matched."%len(matches))
print(matches)
But clearly, this is wrong. Could someone help me out here?
It seems you just want to make sure the number is not matched as part of a, say, float number. You then need to use lookarounds, a lookbehind and a lookahead to disallow dots with digits before and after.
(?<!\d\.)(?:\b|_)3423423987(?:\b|_)(?!\.\d)
See the regex demo
To also match the "prefixes" (or, better call them "suffixes" here), you need to add something like \S* (zero or more non-whitespaces) or (?:[_-]\w+)? (an optional sequence of a - or _ followed with 1+ word chars) at the end of the pattern.
Details:
(?<!\d\.) - fail the match if we have a digit and a dot before the current position
(?:\b|_) - either a word boundary or a _ (we need it as _ is a word char)
3423423987 - the search string
(?:\b|_) - ibid
(?!\.\d) - fail the match if a dot + digit is right after the current position.
So, use
pattern = r"(?<!\d\.)(?:\b|_)%s(?:\b|_)(?!\.\d)"%string_to_find
See the Python demo
If there can be floats like Text with .3423423987 float value, you will need to also add another lookbehind (?<!\.) after the first one: (?<!\d\.)(?<!\.)(?:\b|_)3423423987(?:\b|_)(?!\.\d)
You can use this pattern:
(?:\b|^)3423423987(?!\.)(?=\b|_|$)
(?:\b|^) asserts that there are no other numbers to the left
(?!\.) asserts the number isn't followed by a dot
(?=\b|_|$) asserts the number is followed by a non word character, an underscore or nothing

What does "?:" mean in a Python regular expression?

Below is the Python regular expression. What does the ?: mean in it? What does the expression do overall? How does it match a MAC address such as "00:07:32:12:ac:de:ef"?
re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}), string)
It (?:...) means a set of non-capturing grouping parentheses.
Normally, when you write (...) in a regex, it 'captures' the matched material. When you use the non-capturing version, it doesn't capture.
You can get at the various parts matched by the regex using the methods in the re package after the regex matches against a particular string.
How does this regular expression match MAC address "00:07:32:12:ac:de:ef"?
That's a different question from what you initially asked. However, the regex part is:
([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})
The outer most pair of parentheses are capturing parentheses; what they surround will be available when you use the regex against a string successfully.
The [\dA-Fa-f]{2} part matches a digit (\d) or the hexadecimal digits A-Fa-f], in a pair {2}, followed by a non-capturing grouping where the matched material is a colon or dash (: or -), followed by another pair of hex digits, with the whole repeated exactly 5 times.
p = re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}))
m = p.match("00:07:32:12:ac:de:ef")
if m:
m.group(1)
The last line should print the string "00:07:32:12:ac:de" because that is the first set of 6 pairs of hex digits (out of the seven pairs in total in the string). In fact, the outer grouping parentheses are redundant and if omitted, m.group(0) would work (it works even with them). If you need to match 7 pairs, then you change the 5 into a 6. If you need to reject them, then you'd put anchors into the regex:
p = re.compile(^([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})$)
The caret ^ matches the start of string; the dollar $ matches the end of string. With the 5, that would not match your sample string. With 6 in place of 5, it would match your string.
Using ?: as in (?:...) makes the group non-capturing during replace. During find it does'nt make any sense.
Your RegEx means
r"""
( # Match the regular expression below and capture its match into backreference number 1
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
(?: # Match the regular expression below
[:-] # Match a single character present in the list below
# The character “:”
# The character “-”
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
){5} # Exactly 5 times
)
"""
Hope this helps.
It does not change the search process. But it affects the retrieval of the group after the match has been found.
For example:
Text:
text = 'John Wick'
pattern to find:
regex = re.compile(r'John(?:\sWick)') # here we are looking for 'John' and also for a group (space + Wick). the ?: makes this group unretrievable.
When we print the match - nothing changes:
<re.Match object; span=(0, 9), match='John Wick'>
But if you try to manually address the group with (?:) syntax:
res = regex.finditer(text)
for i in res:
print(i)
print(i.group(1)) # here we are trying to retrieve (?:\sWick) group
it gives us an error:
IndexError: no such group
Also, look:
Python docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
the link to the re page in docs:
https://docs.python.org/3/library/re.html
(?:...) means a non cature group. The group will not be captured.

Categories