How to python regex match the following? - python

1<assume tab here>Algebra I<assume tab here>START
1.1 What are the Basic Numbers? 1-1
For each of the two lines above, how do I regex match only the number up to and including the "?". In essence, I want the following groups:
["1", "Algebra I"]
["1.1", "What are the Basic Numbers?"]
Matching everything up to and including a question mark, or up to a "tab character".
How can I do this with a single regex?

Here's an easy regex:
^([\d.]+)\s*([^\t?]+\??)
Group 1 is the numbers, Group 2 contains the text.
To retrieve one single match:
match = re.search(r"^([\d.]+)\s*([^\t?]+\??)", s)
if match:
mynumbers = match.group(1)
myline = match.group(2)
To iterate over the matches, get groups 1 and 2 from:
reobj = re.compile(r"^([\d.]+)\s*([^\t?]+\??)", re.MULTILINE)
for match in reobj.finditer(s):
# matched text: match.group()

Here you go:
(\d(?:\.\d)*)\s+(?:(.*?\?|.*?)\t)
For explanation: (\d(?:\.\d)*) matches a number followed by zero or more .\d's. this is followed by one or more whitespace characters followed by anything (that is lazy and not greedy) with (.*?) which is followed by either ? or \t in a non-capturing group.
Output:
string1 = "1.1 What are the Basic Numbers? 1-1"
string2 = '1\tAlgebra I\tSTART'
m = re.match(pattern, string2)
m.group(1)
#'1'
m.group(2)
#'Algebra I'
m = re.match(pattern, string1)
m.group(1)
#'1.1'
m.group(2)
#'What are the Basic Numbers?'
EDIT: added non-capturing groups.
EDIT#2: fixed it to include question mark
EDIT#3 fixed no of groups.

Related

Repeat entire group 0 or more times (one or more words separated by +'s)

I am trying to match words separated with the + character as input from a user in python and check if each of the words is in a predetermined list. I am having trouble creating a regular expression to match these words (words are comprised of more than one A-z characters). For example, an input string foo should match as well as foo+bar and foo+bar+baz with each of the words (not +'s) being captured.
So far, I have tried a few regular expressions but the closest I have got is this:
/^([A-z+]+)\+([A-z+]+)$/
However, this only matches the case in which there are two words separated with a +, I need there to be one or more words. My method above would have worked if I could somehow repeat the second group (\+([A-z+]+)) zero or more times. So hence my question is: How can I repeat a capturing group zero or more times?
If there is a better way to do what I am doing, please let me know.
You could write the pattern as:
(?i)[A-Z]+(?:\+[A-Z]+)*$
Explanation
(?i) Inline modifier for case insensitive
[A-Z]+ Match 1+ chars A-Z
(?:\+[A-Z]+)* Optionally repeat matching + and again 1+ chars A-Z
$ End of string
See a regex101 demo for the matches:
For example
import re
predeterminedList = ["foo", "bar"]
strings = ["foo", "foo+bar", "foo+bar+baz", "test+abc"]
pattern = r"(?i)[A-Z]+(?:\+[A-Z]+)*$"
for s in strings:
m = re.match(pattern, s)
if m:
words = m.group().split("+")
intersect = bool(set(words) & set(predeterminedList))
fmt = ','.join(predeterminedList)
if intersect:
print(f"'{s}' contains at least one of '{fmt}'")
else:
print(f"'{s}' contains none of '{fmt}'")
Another option could be created a dynamic pattern listing the alternatives:
(?i)^(?:[A-Z]+\+)*(?:foo|bar)(?:\+[A-Z]+)*$
Example
import re
predeterminedList = ["foo", "bar"]
strings = ["foo", "foo+bar", "foo+bar+baz", "test+abc"]
pattern = rf"(?i)^(?:[A-Z]+\+)*(?:{'|'.join(predeterminedList)})(?:\+[A-Z]+)*$"
for s in strings:
m = re.match(pattern, s)
fmt = ','.join(predeterminedList)
if m:
print(f"'{s}' contains at least one of '{fmt}'")
else:
print(f"'{s}' contains none of '{fmt}'")
Both will output:
'foo' contains at least one of 'foo,bar'
'foo+bar' contains at least one of 'foo,bar'
'foo+bar+baz' contains at least one of 'foo,bar'
'test+abc' contains none of 'foo,bar'
I would recommend slightly different approach using lookarounds:
Pattern: (?<=^|\+)(?=foo|baz)[^+]+
Pattern explanation:
(?<=^|\+) - positive lookbehind - assert that preceeding text is neither ^ (beginning of string) or + (our 'word delimiter').
(?=foo|baz) - positive lookahead - assert that following text match one of words (from predefined list)
[^+]+ - match one or more characters other from +
Regex demo

What would be the regex pattern for the following?

I have multiple regex strings in format:-
Example:
A='AB.224-QW-2018'
B='AB.876-5-LS-2018'
C='AB.26-LS-18'
D='AB-123-6-LS-2017'
E='IA-Mb-22L-AB.224-QW-2018-IA-Mb-22L'
F='ZX-ss-12L-AB-123-6-LS-2017-BC-22'
G='AB.224-2018'
H=''AB.224/QW/2018'
I=''AB/224/2018'
J='AB-10-HDB-231-NCLT-1-2017 AD-42-HH-2019'
K=''AB-1-HDB-NCLT-1-2016 AD-42-HH-2020'
L='AB-1-HDB-NCLT-1-2016/(AD-42-HH-2020)
I want a regex pattern to get the output for the numbers that occur after the alphabets(that appear at the start) as well as the first alphabets. And at last years that are mentioned at last.
There are some strings which contain 876-5,123-6 in B and D respectively.
I don't want the single number that appear after -.
My code :
re.search(r"\D*\d*\D*(AB)\D*(\d+)\D*(20)?(\d{2})\D*\d*\D*)
Another attempt
re.search(r"D*\d*\D*(AB)\D*(\d+)\D*\d?\D*(20)?(\d{2})D*\d*\D*)
Both attempts will not work for all of them.
Any pattern to match all strings?
I have created groups in regex pattern and extracted them as
d.group(1)+"/"+d.group(2)+"/"+d.group(4). So output is expected as following if a regex pattern matches for all of them.
Expected Output
A='AB/224/18'
B='AB/876/18'
C='AB/26/18'
D='AB/123/17'
E='AB/224/18'
F='AB/123/17'
G='AB/224/18'
H='AB/224/18'
I='AB/224/18'
J='AB/10/17'
K='AB/1/16'
L='AB/1/16'
You could use 3 capture groups:
\b(AB)\D*(\d+)\S*?(?:20)?(\d\d)\b
\b A word boundary to prevent a partial word match
(AB) Capture AB in group 1
\D* Match optional non digits
(\d+) Capture 1+ digits in group 2
\S*? Optionally match non whitespace characters, as least as possible
(?:20)? Optionally match 20
(\d\d) Capture 2 digits in group 3
\b A word boundary
Regex demo
For example using re.finditer which returns Match objects that each hold the group values.
Using enumerate you can loop the matches. Every item in the iteration returns a tuple, where the first value is the count (that you don't need here) and the second value contains the Match object.
import re
pattern = r"\b(AB)\D*(\d+)\S*?(?:20)?(\d\d)\b"
s = ("A='AB.224-QW-2018'\n"
"B='AB.876-5-LS-2018'\n"
"C='AB.26-LS-18'\n"
"D='AB-123-6-LS-2017'\n"
"IA-Mb-22L-AB.224-QW-2018-IA-Mb-22L' F='ZX-ss-12L-AB-123-6-LS-2017-BC-22\n"
"A='AB.224-QW-2018'\n"
"B='AB.876-5-LS-2018'\n"
"C='AB.26-LS-18'\n"
"D='AB-123-6-LS-2017'\n"
"E='IA-Mb-22L-AB.224-QW-2018-IA-Mb-22L'\n"
"F='ZX-ss-12L-AB-123-6-LS-2017-BC-22'\n"
"G='AB.224-2018'\n"
"H='AB.224/QW/2018'\n"
"I='AB/224/2018'")
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(m.group(1) + "/" + m.group(2) + "/" + m.group(3))
Output
AB/224/18
AB/876/18
AB/26/18
AB/123/17
AB/224/18
AB/123/17
AB/224/18
AB/876/18
AB/26/18
AB/123/17
AB/224/18
AB/123/17
AB/224/18
AB/224/18
AB/224/18
Can't you just look for the last two digits, irrespective of dashes and "20" prefix? Like
(AB)[.-](\d+).*(\d\d)
I've tested in Sublime Text - works for me, it returns the same output you mentioned as desired.

Python Regex - get words around match

I want to get the words before and after my match. I could use string.split(' ') - but as I already use regex, isn't there a much better way using only regex?
Using a match object, I can get the exact location. However, this location is character indexed.
import re
myString = "this. is 12my90\nExample string"
pattern = re.compile(r"(\b12(\w+)90\b)",re.IGNORECASE | re.UNICODE)
m = pattern.search(myString)
print("Hit: "+m.group())
print("Indix range: "+str(m.span()))
print("Words around match: "+myString[m.start()-1:m.end()+1]) # should be +/-1 in _words_, not characters
Output:
Hit: 12my90 Indix
range: (9, 15)
Words around match: 12my90
For getting the matching word and the word before, I tried:
pattern = re.compile(r"(\b(w+)\b)\s(\b12(\w+)90\b)",re.IGNORECASE |
re.UNICODE)
Which yields no matches.
In the second pattern you have to escape the w+ like \w+.
Apart from that, there is a newline in your example which you can match using another following \s
Your pattern with 3 capturing groups might look like
(\b\w+\b)\s(\b12\w+90\b)\s(\b\w+\b)
Regex demo
You could use the capturing groups to get the values
print("Words around match: " + m.group(1) + " " + m.group(3))
new line character is missing
regx = r"(\w+)\s12(\w+)90\n(\w+)"

python regexp 3 the same pairs of [0-9A-Fa-f]

I need a regex for some color which can be described like this:
starts with #
then 3 the same pairs of hex characters (0-9, a-f, A-F). aA and Aa are also the same pairs
Now i have #(([0-9A-Fa-f]){2}){3}
How can I make regexp for the SAME pairs of hex characters?
Some examples of the matching strings:
"#FFFFFF",
"#000000",
"#aAAaaA",
"#050505",
"###93#0b0B0b1B34"
Strings like "#000100" shouldn't match
With re.search() function:
import re
s = '#aAAaaA'
match = re.search(r'#([0-9a-z]{2})\1\1', s, re.I)
result = match if not match else match.group()
print(result)
\1 - points to the 1st parenthesized group (...)
re.I - IGNORECASE regex flag
You may use the following regex with a capturing group and a backreference:
#([0-9A-Fa-f]{2})\1{2}
See the regex demo
Details
# - a #
([0-9A-Fa-f]{2}) - Group 1: 2 hex chars
\1{2} - 2 consecutive occurrences of the same value as captured in Group 1.
NOTE: the case insensitive flag is required to make the \1 backreference match Group 1 contents in a case insensitive way. Bear in mind we need to use a raw string literal to define the regex to avoid overescaping the backreferences.
See the Python demo:
import re
strs = ["#FFFFFF","#000000","#aAAaaA","#050505","###93#0b0B0b1B34", "#000100"]
for s in strs:
m = re.search(r'#([0-9A-Fa-f]{2})\1{2}', s, flags=re.I)
if m:
print("{} MATCHED".format(s))
else:
print("{} DID NOT MATCH".format(s))
Results:
#FFFFFF MATCHED
#000000 MATCHED
#aAAaaA MATCHED
#050505 MATCHED
###93#0b0B0b1B34 MATCHED
#000100 DID NOT MATCH

Python Regex: Symbol + in every letter in the same word

I am using Python.
I want to make a regex that allos the following examples:
Day
Dday
Daay
Dayy
Ddaay
Ddayy
...
So, each letter of a word, one or more times.
How can I write it easily? Exist an expression that make it easy?
I have a lot of words.
Thanks
We can try using the following regex pattern:
^([A-Za-z])\1*([A-Za-z])\2*([A-Za-z])\3*$
This matches and captures a single letter, followed by any number of occurrences of this letter. The \1 you see in the above pattern is a backreference which represents the previous matched letter (and so on for \2 and \3).
Code:
word = "DdddddAaaaYyyyy"
matchObj = re.match( r'^([A-Za-z])\1*([A-Za-z])\2*([A-Za-z])\3*$', word, re.M|re.I)
if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)
print "matchObj.group(3) : ", matchObj.group(3)
else:
print "No match!!"
Demo
To match a character one or more times you can use the + quantifier. To build the full pattern dynamically you would need to split the word to characters and add a + after each of them:
pattern = "".join(char + "+" for char in word)
Then just match the pattern case insensitively.
Demo:
>>> import re
>>> word = "Day"
>>> pattern = "".join(char + "+" for char in word)
>>> pattern
'D+a+y+'
>>> words = ["Dday", "Daay", "Dayy", "Ddaay", "Ddayy"]
>>> all(re.match(pattern, word, re.I) for word in words)
True
Try /d+a+y+/gi:
d+ Matches d one or more times.
a+ Matches a one or more times.
y+ Matches y one or more times.
As per my original comment, the below does exactly what I explain.
Since you want to be able to use this on many words, I think this is what you're looking for.
import re
word = "day"
regex = r"^"+("+".join(list(word)))+"+$"
test_str = ("Day\n"
"Dday\n"
"Daay\n"
"Dayy\n"
"Ddaay\n"
"Ddayy")
matches = re.finditer(regex, test_str, re.IGNORECASE | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
This works by converting the string into a list, then converting it back to string, joining it on +, and appending the same. The resulting regex will be ^d+a+y+$. Since the input you presented is separated by newline characters, I've added re.MULTILINE.

Categories