Regex having optional groups with non-capturing groups - python

I have an Regex with multiple optional and Non-Capturing Groups. All of these groups can occur, but don't have to. The Regex should use Non-Capturing Groups to return the whole string.
When I set the last group also as optional, the Regex will have several grouped results. When I set the first group as not-optional, the Regex matches. Why is that?
The input will be something like input_text = "xyz T1 VX N1 ", expected output T1 VX N1.
regexs = {
"allOptional": 'p?(?:T[X0-4]?)?\\s?(?:V[X0-2])?\\s?(?:N[X0-3])?',
"lastNotOptional": 'p?(?:T[X0-4]?)?\\s?(?:V[X0-2])?\\s?(?:N[X0-3])',
"firstNotOptional": 'p?(?:T[X0-4]?)\\s?(?:V[X0-2])?\\s?(?:N[X0-3])?',
}
for key, regex in regexs.items():
matches = re.findall(regex, input_text)
# Results
allOptional = ['', '', '', ' ', 'T1 VX N1', '']
lastNotOptional = ['T1 VX N1']
firstNotOptional = ['T1 VX N1']
Thanks in advance!

I suggest
\b(?=\w)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?\b(?<=\w)
See the regex demo.
Alternative for this is a combination of lookarounds that make sure the match is immediately preceded with a whitespace char or start of string, and the first char of a match is a whitespace char, and another lookaround combination (at the end of the pattern) to make sure the match end char is a non-whitespace and then a whitespace or end of string follows:
(?<!\S)(?=\S)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?(?!\S)(?<=\S)
See this regex demo.
The main point here are two specific word/whitespace boundaries:
\b(?=\w) at the start makes sure the word boundary position is matched, that is immediately followed with a word char
\b(?<=\w) at the end asserts the position at the word boundary, with a word char immediately on the left
(?<!\S)(?=\S) - a position that is at the start of string, or immediately after a whitespace and that is immediately followed with a non-whitespace char
(?!\S)(?<=\S) - a position that is at the end of string, or immediately before a whitespace and that is immediately preceded with a non-whitespace char.
See a Python demo:
import re
input_text = "xyz T1 VX N1 G1"
pattern = r'\b(?=\w)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?\b(?<=\w)'
print(re.findall(pattern, input_text))
# => ['T1 VX N1']

Related

Regex for for creating an acronym

I'm trying to build a function that will collect an acronym using only regular expressions.
Example:
Data Science = DS
I'm trying to do 3 steps:
Find the first letter of each word
Translate every single letter to uppercase.
Group
Unfortunately I get errors.
I repeat that I need to use the regular expression functionality.
Regular expression for creating an acronym.
some_words = 'Data Science'
all_words_select = r'(\b\w)'
word_upper = re.sub(all_words_select, some_words.upper(), some_words)
print(word_upper)
result:
DATA SCIENCEata DATA SCIENCEcience
Why is the text duplicated?
I plan to get: DATA SCIENCE
You don't need regex for the problem you have stated. You can just split the words on space, then take the first character and convert it to the upper case, and finally join them all.
>>> ''.join(w[0].upper() for w in some_words.split(' '))
>>> 'DS'
You need to deal with special condition such as word starting with character other than alphabets, with something like if w[0].isalpha()
The another approach using re.sub and negative lookbehind:
>>> re.sub(r'(?<!\b).|\s','', some_words)
'DS'
Use
import re
some_words = 'Data Science'
all_words_select = r'\b(?![\d_])(\w)|.'
word_upper = re.sub(all_words_select, lambda z: z.group(1).upper() if z.group(1) else '', some_words, flags=re.DOTALL)
print(word_upper)
See Python proof.
EXPLANATION
Match a letter at the word beginning => capture (\b(?![\d_])(\w))
Else, match any character (|.)
Whenever capture is not empty replace with a capital variant (z.group(1).upper())
Else, remove the match ('').
Pattern:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[\d_] any character of: digits (0-9), '_'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
. any character except \n

A word starting with t but ends with other than e

I am trying to create a regex that starts with t or T and doesn't end with e letter. I tried the code below so far, but it's not giving me the desirable result. Could anyone show me what is exactly missing here?
my_str = my_file.read()
word = re.findall("[tT].*[^e]$", my_str)
print(word)
You can use
\bt(?:[a-z]*[a-df-z])?\b
\bt[a-z]*\b(?<!e)
Just for completeness, here is a regex to match any word starting with a Cyrillic т and not ending with a Cyrillic е:
\bт[^\W\d_]*\b(?<!е)
See the regex demo #1, regex demo #2 and a Cyrillic regex demo.
If you need a case insensitive matching, add re.I:
re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I)
And a note on word boundaries: if the words can be glued to _ or digits, use letter boundaries rather than word boundaries:
r'(?<![a-z])t(?:[a-z]*[a-df-z])?(?![a-z])'
r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)' # Unicode letter boundaries
Regex details
\b - word boundary (start of string or a position immediately after a char other than a digit, letter, underscore)
(?<![a-z]) ((?<![^\W\d_]) is a Unicode aware equivalent) - a negative lookbehind that matches a location that is not immediately preceded with a letter
t - a t letter
(?:[a-z]*[a-df-z])? - an optional non-capturing group matching 0 or more letters and then a letter other than e
\b - word boundary
(?![a-z]) ((?![^\W\d_]) is a Unicode aware equivalent) - a negative lookahead that matches a location that is not immediately followed with a letter.
Also,
\bt[a-z]*\b(?<!e) matches a word boundary, t, any zero or more lowercase ASCII letters (any ASCII letters with re.I), then a word boundary marks the end of a word and the negative lookbehind (?<!e) fails the match if there is e at the end of the word
[^\W\d_]* - matches zero or more more Unicode letters.
See a Python demo:
import re
text = r't, train => main,teene!'
cyr_text = r'таня тане работе'
print( re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bt[a-z]*\b(?<!e)', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bт[^\W\d_]*\b(?<!е)', cyr_text, re.I) )
# => ['таня']
print( re.findall(r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)', cyr_text, re.I) )
# => ['таня']
There is also another way of doing it:
re.findall(r"\b[Tt]+[a-zA-Z]*[^Ee\s]\b", my_str)
Maybe:
[\W]([Tt]\w*[^e])[\W]
Any non word character followed by (capture: Tt, some optional word characters, not e) followed by first non word character

Starts with anything but not space and ends with extensions like (.png, .jpg, .mp4, .avi, .flv)

I need to get all files with media like extension( .png, .jpg, .mp4, .avi, .flv ) in a list by using regex.What i had tried is Below
import re
st = '''
/mnt/data/Content:
ManifestFile.txt kompass-tags_e2d5dac5ba548a1206b5d40f58e448e4 tmp_content
default_55a655f340908dce55d10a191b6a0140 price-tags_b3c756dda783ad0691163a900fb5fe15
/mnt/data/Content/default_55a655f340908dce55d10a191b6a0140:
LayoutFile_34450b33c8b44af409abb057ddedfdfe.txt blank_decommissioned.jpeg tmp_content
ManifestFile.txt blank_unregistered.png
/mnt/data/Content/default_55a655f340908dce55d10a191b6a0140/tmp_content:
/mnt/data/Content/kompass-tags_e2d5dac5ba548a1206b5d40f58e448e4:
0001111084948-kompass-LARGE.avi 0076738703404-kompass-LARGE.png LayoutFile_7c1b3793e49204982e0e41923303c17b.txt
0001111087321-kompass-LARGE.jpg 0076738703419-kompass-LARGE.mp4 ManifestFile.txt
0001111087325-kompass-LARGE.png 0076738703420-kompass-LARGE.png tmp_content
/mnt/data/Content/kompass-tags_e2d5dac5ba548a1206b5d40f58e448e4/tmp_content:
/mnt/data/Content/price-tags_b3c756dda783ad0691163a900fb5fe15:
0001111084948-consumer-large.png 0076738703404-consumer-large.png LayoutFile_a694b1e05d08705aaf4dd589ac61d493.txt
0001111087321-consumer-large.png 0076738703419-consumer-large.avi ManifestFile.txt
0001111087325-consumer-large.mp4 0076738703420-consumer-large.png tmp_content
/mnt/data/Content/price-tags_b3c756dda783ad0691163a900fb5fe15/tmp_content:
/mnt/data/Content/tmp_content:
'''
patt = '^.*(.png|.jpg|.gif|.bmp|.jpeg|.mp4|.avi|.flv)'
patt = '^.*$.png'
fList = re.findall(patt, st)
print fList
I have very less idea about regex please help.
The ^.*(.png|.jpg|.gif|.bmp|.jpeg|.mp4|.avi|.flv) pattern matches the start of a string, then any 0+ chars other than line break chars as many as possible and then the extensions with any single char before them (an unescaped . matches any char but a line break char). So, this can't work for you since . matches too much here and ^ only yields a match at the start of the string.
The ^.*$.png pattern only matches the start of the string, any 0+ chars other than line break chars then the end of string and any char + png - this is a pattern that will never match any string.
Judging by your description you need
patt = r'\S+\.(?:png|jpe?g|gif|bmp|mp4|avi|flv)\b'
See the regex demo.
Details
\S+ - 1+ non-whitespace chars
\. - a literal dot
(?:png|jpe?g|gif|bmp|mp4|avi|flv) - a non-capturing group (i.e. what it captures won't be part of the list returned by re.findall) matching any of the mentioned extenstions
\b - a word boundary (actually, it is optional, but it will make sure you match an extension above as a whole word).
See the Python demo:
import re
st = '<YOUR_STRING_HERE>'
patt = r'\S+\.(?:png|jpe?g|gif|bmp|mp4|avi|flv)\b'
fList = re.findall(patt, st)
for s in fList:
print(s)
yielding
blank_decommissioned.jpeg
blank_unregistered.png
0001111084948-kompass-LARGE.avi
0076738703404-kompass-LARGE.png
0001111087321-kompass-LARGE.jpg
0076738703419-kompass-LARGE.mp4
0001111087325-kompass-LARGE.png
0076738703420-kompass-LARGE.png
0001111084948-consumer-large.png
0076738703404-consumer-large.png
0001111087321-consumer-large.png
0076738703419-consumer-large.avi
0001111087325-consumer-large.mp4
0076738703420-consumer-large.png
You can use the RegEx \S+\.(?:png|jpg|gif|bmp|jpeg|mp4|avi|flv)
\S+ matches any non white-space char at least one time
\. matches a dot
(?: ... ) is a non capturing group
(png|jpg|gif|bmp|jpeg|mp4|avi|flv matches your defined extensions
Demo.
Try this:
patt = '[^ \n]+?\.(?:png|jpg|gif|bmp|jpeg|mp4|avi|flv)'
[^ \n] is a negated character class, allowing no spaces or newlines.
The dot (.) is a special character and needs to be escaped with a backslash.
Try it online here.

Regex capture all words in a line before the last 2

I have a bunch of line data I need to capture like so:
Level production data TD Index
Total Agriculture\Production data TS Index
I need to capture everything before the last two words, for example in this case my regex output should be Level production data for the first match. How can I do this while also assuming varying number of words before the TD Index. Thanks!
Try this regex:
^.*(?=(?:\s+\S+){2}$)
Click for Demo
Explanation:
^ - asserts the start of the string
.* - matches 0+ occurrences of any character except a newline character
(?=(?:\s+\S+){2}$) - positive lookahead to validate that current position is followed by 2 words (1+ white space followed by 1+ occurrences of non-whitespace)X2 just before the end of the string
You can try this:
import re
s = ["Level production data TD Index", "Total Agriculture\Production data TS Index"]
new_s = [re.findall('[\w\s\W]{1,}(?=\s\w+\s\w+$)', i)[0] for i in s]
Output:
['Level production data', 'Total Agriculture\\Production data']
Code
See regex in use here
.*(?= \S+ \S+)
Alternatively: .*(?= [\w\/]+ [\w\/]+) replacing \S with what you define as your valid word character set.
You can also add + after the spaces if there is a possibility of more than 1 space being present as such: .*(?= +\S+ +\S+)
Usage
See code in use here
import re
r = r".*(?= \S+ \S+)"
l = [
"Level production data TD Index",
"Total Agriculture\\Production data TS Index"
]
for s in l:
m = re.match(r, s)
if m:
print m.group(0)
Explanation
.* Match any character any number of times
(?= \S+ \S+) Positive lookahead ensuring what follows matches
Match a literal space
\S+ Match any non-whitespace character one or more times
Match a literal space
\S+ Match any non-whitespace character one or more times

Matching newline and any character with Python regex

I have a text like
var12.1
a
a
dsa
88
123!!!
secondVar12.1
The string between var and secondVar may be different (and there may be different count of them).
How can I dump it with regexp?
I'm trying something something like this to no avail:
re.findall(r"^var[0-9]+\.[0-9]+[\n.]+^secondVar[0-9]+\.[0-9]+", str, re.MULTILINE)
You can grab it with:
var\d+(?:(?!var\d).)*?secondVar
See demo. re.S (or re.DOTALL) modifier must be used with this regex so that . could match a newline. The text between the delimiters will be in Group 1.
NOTE: The closest match will be matched due to (?:(?!var\d).)*? tempered greedy token (i.e. if you have another var + a digit after var + 1+ digits then the match will be between the second var and secondVar.
NOTE2: You might want to use \b word boundaries to match the words beginning with them: \bvar(?:(?!var\d).)*?\bsecondVar.
REGEX EXPLANATION
var - match the starting delimiter
\d+ - 1+ digits
(?:(?!var\d).)*? - a tempered greedy token that matches any char, 0 or more (but as few as possible) repetitions, that does not start a char sequence var and a digit
secondVar - match secondVar literally.
IDEONE DEMO
import re
p = re.compile(r'var\d+(?:(?!var\d).)*?secondVar', re.DOTALL)
test_str = "var12.1\na\na\ndsa\n\n88\n123!!!\nsecondVar12.1\nvar12.1\na\na\ndsa\n\n88\n123!!!\nsecondVar12.1"
print(p.findall(test_str))
Result for the input string (I doubled it for demo purposes):
['12.1\na\na\ndsa\n\n88\n123!!!\n', '12.1\na\na\ndsa\n\n88\n123!!!\n']
You're looking for the re.DOTALL flag, with a regex like this: var(.*?)secondVar. This regex would capture everything between var and secondVar.

Categories