This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 4 years ago.
So if I create a program in python (3.7) that looks like this:
import re
regx = re.compile("test")
print(regx.findall("testest"))
and run it, then I will get:
["test"]
Even though there are two instances of "test" it's only showing me one which I think is because a letter from the first "test" is being used in the second "test". How can I make a program that will give me ["test", "test"] as a result instead?
You will want to use a capturing group with a lookahead (?=(regex_here)):
import re
regx = re.compile("(?=(test))")
print(regx.findall("testest"))
>>> ['test', 'test']
Regex expressions are greedy. They consume as much of the target string as possible. Once consumed, a character is not examined again, so overlapping patterns are not found.
To do this you need to use a feature of python regular expressions called a look ahead assertion. You will look for instances of the character t where it is followed by est. The look ahead does not consume parts of the string.
import re
regx = re.compile('t(?=est)')
print([m.start() for m in regx.finditer('testest')])
[0,3]
More details on this page: https://docs.python.org/3/howto/regex.html
Related
This question already has answers here:
Carets in Regular Expressions
(2 answers)
Closed 11 months ago.
I'm currently learning a bit of regex in python in a course I'm doing online and I'm struggling to understand a particular expression - I've been searching the python re docs and not sure why I'm returning the non-punctuation elements rather than the punctuation.
The code is:
import re
test_phrase = "This is a sentence, with! unnecessary: punctuation."
punc_remove = re.findall(r'[^,!:]+',test_phrase)
punc_reomve
OUTPUT: ['This is a sentence',' with',' unnecessary',' punctuation.']
I think I understand what each character does. I.e. [] is a character set, and ^ means starts with. So anything starting with ,!: will be returned? (or at least that's how I'm probably mistakingly interpreting it) And the + will return one of more of the pattern. But why is the output not returning something like:
OUTPUT: [', with','! unnecessary',': punctuation.']
Any explanation really appreciated!
Inside a character class, a ^ does not mean ‘start with’: it means ‘not’. So the RegEx matches sequences of one or more non-,1: characters.
This question already has answers here:
How to match any string from a list of strings in regular expressions in python?
(5 answers)
Closed 2 years ago.
I want to find out if some part of a string is contained in a list of strings I allow, using regex.
For example, I would like to check if
"12_SUMMER_3456"
matches:
"\d*_(SUMMER|FALL|WINTER|SPRING)_\d*"
The thing is, I want to replace "(SUMMER|FALL|WINTER|SPRING)" with a list:
lst = ["SUMMER", "FALL", "WINTER", "SPRING"]
and then, use lst instead of explicitly give it's elements inside the regex.
Is there something like that in Python?
You can join and interpolate the list.
regex = r"\d*_(" + '|'.join(lst) + r")_\d*"
As an aside, that's equivalent to
regex = r"_(" + '|'.join(lst) + r")_"
If something may or may not be there at the boundary of the match, you can simplify by taking it out; the regex will still match whether or not it's there. (If you are capturing or anchoring the match, then of course these are necessary for other reasons.)
The third-party regex library lets you say
regex = r"\d*_L<lst>_\d*"
where again of course the \d* are redundant.
You need to pip install regex and then obvisously import regex instead of import re.
This question already has answers here:
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 4 years ago.
While looking into the regular expression, I found the following example:
>>> import re
>>> p = re.compile('.*[.].*$')
>>> m = p.search('foo.bar')
>>> print(m.group())
foo.bar
I don't understand the process in which it recognizes simple filename with extensions like foo.bar, abc.xyz, my_files.txt. I thought this code would work like this:
. matches with any character.
* causes to match 0 or more repetitions.
By 1. and 2., the whole string(foo.bar) matches with .*.
[.] tries to find character ., but there are no characters left.
.*$ doesn't do anything.
No matches found.
I wonder how this code actually works.
The expression * causes the regex engine to match as much as possible, not everything.
Typically, the regex engine will match through the end of line like you describe, but then backtrack to an earlier position until it can proceed with the rest of the match.
Maybe think of it sort of like a labyrinth solver which explores every possible junction of the labyrinth systematically until if finds an exit, or exhausts the search space.
This question already has an answer here:
Python regular expressions, how to search for a word starting with uppercase?
(1 answer)
Closed 7 years ago.
I'm trying to get the following to work. I've looked at the Python documentation, and I still don't know how to fix it. I'm getting an AttributeError, what am I doing wrong?
import re
text = '>:{abcd|}+)_(#)_#_Mitch_(#<$)_)*zersx!)Pamela#(_+)('
m = re.match(r'(?P<name1>[A-Z][A-Za-z]*) (?P<name2>[A-Z][A-Za-z]*)', text)
m.group('name1')
If the above is incorrect, how do I get it to output
>>> m.group('name1') = 'Mitch'
You're forgetting to check that the regex actually matched anything. If it doesn't then both the .match() and .search() functions will return None.
It may be that the named group you are trying to reference was not actually matched in that string for that pattern.
Try to call groups on the returned value and you will see a tuple of all matched groups.
It is a Python-related issue: the unmatched groups are not initialized and thus fail the whole match. You need to explicitly state an empty alternative inside the group for it to be really optional and work as in other regex flavors. Also, you have _, not a space between, so I suggest using a [\s_] character class to match both alternatives:
So, in your case, you can do it like this:
(?P<name1>[A-Z][A-Za-z]*)[\s_](?P<name2>[A-Z][A-Za-z]*|)
^^^^^ ^^
See a regex demo
Sample IDEONE demo:
import re
p = re.compile(ur'(?P<name1>[A-Z][A-Za-z]*)[\s_](?P<name2>[A-Z][A-Za-z]*|)')
test_str = u">:{abcd|}+)_(#)_#_Mitch_(#<$)_)*zersx!)Pamela#(_+)("
match = re.search(p, test_str)
if match:
print(match.group("name1")) # => Mitch
This question already has answers here:
How to use a variable inside a regular expression?
(12 answers)
Closed 2 years ago.
I've looked at several posts and other forums to find an answer related to my question, but nothing has come up specific to what I need. As a heads up, I'm new to programming and don't possess the basic foundation that most would.
I know bash, little python, and decent with RE.
I'm trying to create a python script, using RE's to parse through data and give me an output that I need/want.
My output will consist of 4 values, all originating from one line. The line being read in is thrown together with no defined delimiter. (hence the reason for my program)
In order to find one of the 4 values, I have to say look for 123- and give me everything after that but stop here df5. The 123- is not constant, but defined by a regular expression that works, same goes for df5. I assigned both RE's to a variable. How can I use those variables to find what I want between the two... Please let me know if this makes sense.
import re
start = '123-'
stop = 'df5'
regex = re.compile('{0}(.*?){1}'.format(re.escape(start), re.escape(stop)))
Note that the re.escape() calls aren't necessary for these example strings, but it is important if your delimiters can ever include characters with a special meaning in regex (., *, +, ? etc.).
How about a pattern "%s(.*?)%s" % (oneTwoThree, dF5)? Then you can do a re.search on that pattern and use the groups function on the result.
Something on the lines of
pattern = "%s(.*?)%s" % (oneTwoThree, dF5)
matches = re.search(pattern, text)
if matches:
print matches.groups()
re.findall, if used instead of re.search, can save you the trouble of grouping the matches.