Why '.*[.].*$' matches with filename.extension? [duplicate] - python

This question already has answers here:
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 4 years ago.
While looking into the regular expression, I found the following example:
>>> import re
>>> p = re.compile('.*[.].*$')
>>> m = p.search('foo.bar')
>>> print(m.group())
foo.bar
I don't understand the process in which it recognizes simple filename with extensions like foo.bar, abc.xyz, my_files.txt. I thought this code would work like this:
. matches with any character.
* causes to match 0 or more repetitions.
By 1. and 2., the whole string(foo.bar) matches with .*.
[.] tries to find character ., but there are no characters left.
.*$ doesn't do anything.
No matches found.
I wonder how this code actually works.

The expression * causes the regex engine to match as much as possible, not everything.
Typically, the regex engine will match through the end of line like you describe, but then backtrack to an earlier position until it can proceed with the rest of the match.
Maybe think of it sort of like a labyrinth solver which explores every possible junction of the labyrinth systematically until if finds an exit, or exhausts the search space.

Related

How does this regex remove punctuation pattern work? [duplicate]

This question already has answers here:
Carets in Regular Expressions
(2 answers)
Closed 11 months ago.
I'm currently learning a bit of regex in python in a course I'm doing online and I'm struggling to understand a particular expression - I've been searching the python re docs and not sure why I'm returning the non-punctuation elements rather than the punctuation.
The code is:
import re
test_phrase = "This is a sentence, with! unnecessary: punctuation."
punc_remove = re.findall(r'[^,!:]+',test_phrase)
punc_reomve
OUTPUT: ['This is a sentence',' with',' unnecessary',' punctuation.']
I think I understand what each character does. I.e. [] is a character set, and ^ means starts with. So anything starting with ,!: will be returned? (or at least that's how I'm probably mistakingly interpreting it) And the + will return one of more of the pattern. But why is the output not returning something like:
OUTPUT: [', with','! unnecessary',': punctuation.']
Any explanation really appreciated!
Inside a character class, a ^ does not mean ‘start with’: it means ‘not’. So the RegEx matches sequences of one or more non-,1: characters.

Part of string matches any element of a list using regex [duplicate]

This question already has answers here:
How to match any string from a list of strings in regular expressions in python?
(5 answers)
Closed 2 years ago.
I want to find out if some part of a string is contained in a list of strings I allow, using regex.
For example, I would like to check if
"12_SUMMER_3456"
matches:
"\d*_(SUMMER|FALL|WINTER|SPRING)_\d*"
The thing is, I want to replace "(SUMMER|FALL|WINTER|SPRING)" with a list:
lst = ["SUMMER", "FALL", "WINTER", "SPRING"]
and then, use lst instead of explicitly give it's elements inside the regex.
Is there something like that in Python?
You can join and interpolate the list.
regex = r"\d*_(" + '|'.join(lst) + r")_\d*"
As an aside, that's equivalent to
regex = r"_(" + '|'.join(lst) + r")_"
If something may or may not be there at the boundary of the match, you can simplify by taking it out; the regex will still match whether or not it's there. (If you are capturing or anchoring the match, then of course these are necessary for other reasons.)
The third-party regex library lets you say
regex = r"\d*_L<lst>_\d*"
where again of course the \d* are redundant.
You need to pip install regex and then obvisously import regex instead of import re.

Python: Overlapping regex search [duplicate]

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 4 years ago.
So if I create a program in python (3.7) that looks like this:
import re
regx = re.compile("test")
print(regx.findall("testest"))
and run it, then I will get:
["test"]
Even though there are two instances of "test" it's only showing me one which I think is because a letter from the first "test" is being used in the second "test". How can I make a program that will give me ["test", "test"] as a result instead?
You will want to use a capturing group with a lookahead (?=(regex_here)):
import re
regx = re.compile("(?=(test))")
print(regx.findall("testest"))
>>> ['test', 'test']
Regex expressions are greedy. They consume as much of the target string as possible. Once consumed, a character is not examined again, so overlapping patterns are not found.
To do this you need to use a feature of python regular expressions called a look ahead assertion. You will look for instances of the character t where it is followed by est. The look ahead does not consume parts of the string.
import re
regx = re.compile('t(?=est)')
print([m.start() for m in regx.finditer('testest')])
[0,3]
More details on this page: https://docs.python.org/3/howto/regex.html

How can I find multiple of the same format in Python? [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
For a little idea of what the project is, I'm trying to make a markup language that compiles to HTML/CSS. I plan on formatting links like this: #(link mask)[(link url)], and I want to find all occurrences of this and get both the link mask and the link url.
I tried using this code for it:
re.search("#(.*)\[(.*)\]", string)
But it started at the beginning of the first instance, and ended at the end of the last instance of a link. Any ideas how I can have it find all of them, in a list or something?
The default behavior of a regular expression is "greedy matching". This means each .* will match as many characters as it can.
You want them to instead match the minimal possible number of characters. To do that, change each .* into a .*?. The final question mark will make the pattern match the minimal number of characters. Because you anchor your pattern to a ] character, it will still match/consume the whole link correctly.
* is greedy: it matches as many characters as it can, e.g. up to the last right parenthesis in your document. (After all, . means "any character" and ) is 'any character" as much as any other character.)
You need the non-greedy version of *, which is *?. (Probably actually you should use +?, as I don't think zero-length matches would be very useful).

Using variable in regular expression in Python [duplicate]

This question already has answers here:
How to use a variable inside a regular expression?
(12 answers)
Closed 2 years ago.
I've looked at several posts and other forums to find an answer related to my question, but nothing has come up specific to what I need. As a heads up, I'm new to programming and don't possess the basic foundation that most would.
I know bash, little python, and decent with RE.
I'm trying to create a python script, using RE's to parse through data and give me an output that I need/want.
My output will consist of 4 values, all originating from one line. The line being read in is thrown together with no defined delimiter. (hence the reason for my program)
In order to find one of the 4 values, I have to say look for 123- and give me everything after that but stop here df5. The 123- is not constant, but defined by a regular expression that works, same goes for df5. I assigned both RE's to a variable. How can I use those variables to find what I want between the two... Please let me know if this makes sense.
import re
start = '123-'
stop = 'df5'
regex = re.compile('{0}(.*?){1}'.format(re.escape(start), re.escape(stop)))
Note that the re.escape() calls aren't necessary for these example strings, but it is important if your delimiters can ever include characters with a special meaning in regex (., *, +, ? etc.).
How about a pattern "%s(.*?)%s" % (oneTwoThree, dF5)? Then you can do a re.search on that pattern and use the groups function on the result.
Something on the lines of
pattern = "%s(.*?)%s" % (oneTwoThree, dF5)
matches = re.search(pattern, text)
if matches:
print matches.groups()
re.findall, if used instead of re.search, can save you the trouble of grouping the matches.

Categories