How does this regex remove punctuation pattern work? [duplicate] - python

This question already has answers here:
Carets in Regular Expressions
(2 answers)
Closed 11 months ago.
I'm currently learning a bit of regex in python in a course I'm doing online and I'm struggling to understand a particular expression - I've been searching the python re docs and not sure why I'm returning the non-punctuation elements rather than the punctuation.
The code is:
import re
test_phrase = "This is a sentence, with! unnecessary: punctuation."
punc_remove = re.findall(r'[^,!:]+',test_phrase)
punc_reomve
OUTPUT: ['This is a sentence',' with',' unnecessary',' punctuation.']
I think I understand what each character does. I.e. [] is a character set, and ^ means starts with. So anything starting with ,!: will be returned? (or at least that's how I'm probably mistakingly interpreting it) And the + will return one of more of the pattern. But why is the output not returning something like:
OUTPUT: [', with','! unnecessary',': punctuation.']
Any explanation really appreciated!

Inside a character class, a ^ does not mean ‘start with’: it means ‘not’. So the RegEx matches sequences of one or more non-,1: characters.

Related

Part of string matches any element of a list using regex [duplicate]

This question already has answers here:
How to match any string from a list of strings in regular expressions in python?
(5 answers)
Closed 2 years ago.
I want to find out if some part of a string is contained in a list of strings I allow, using regex.
For example, I would like to check if
"12_SUMMER_3456"
matches:
"\d*_(SUMMER|FALL|WINTER|SPRING)_\d*"
The thing is, I want to replace "(SUMMER|FALL|WINTER|SPRING)" with a list:
lst = ["SUMMER", "FALL", "WINTER", "SPRING"]
and then, use lst instead of explicitly give it's elements inside the regex.
Is there something like that in Python?
You can join and interpolate the list.
regex = r"\d*_(" + '|'.join(lst) + r")_\d*"
As an aside, that's equivalent to
regex = r"_(" + '|'.join(lst) + r")_"
If something may or may not be there at the boundary of the match, you can simplify by taking it out; the regex will still match whether or not it's there. (If you are capturing or anchoring the match, then of course these are necessary for other reasons.)
The third-party regex library lets you say
regex = r"\d*_L<lst>_\d*"
where again of course the \d* are redundant.
You need to pip install regex and then obvisously import regex instead of import re.

Why '.*[.].*$' matches with filename.extension? [duplicate]

This question already has answers here:
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 4 years ago.
While looking into the regular expression, I found the following example:
>>> import re
>>> p = re.compile('.*[.].*$')
>>> m = p.search('foo.bar')
>>> print(m.group())
foo.bar
I don't understand the process in which it recognizes simple filename with extensions like foo.bar, abc.xyz, my_files.txt. I thought this code would work like this:
. matches with any character.
* causes to match 0 or more repetitions.
By 1. and 2., the whole string(foo.bar) matches with .*.
[.] tries to find character ., but there are no characters left.
.*$ doesn't do anything.
No matches found.
I wonder how this code actually works.
The expression * causes the regex engine to match as much as possible, not everything.
Typically, the regex engine will match through the end of line like you describe, but then backtrack to an earlier position until it can proceed with the rest of the match.
Maybe think of it sort of like a labyrinth solver which explores every possible junction of the labyrinth systematically until if finds an exit, or exhausts the search space.

How can I find multiple of the same format in Python? [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
For a little idea of what the project is, I'm trying to make a markup language that compiles to HTML/CSS. I plan on formatting links like this: #(link mask)[(link url)], and I want to find all occurrences of this and get both the link mask and the link url.
I tried using this code for it:
re.search("#(.*)\[(.*)\]", string)
But it started at the beginning of the first instance, and ended at the end of the last instance of a link. Any ideas how I can have it find all of them, in a list or something?
The default behavior of a regular expression is "greedy matching". This means each .* will match as many characters as it can.
You want them to instead match the minimal possible number of characters. To do that, change each .* into a .*?. The final question mark will make the pattern match the minimal number of characters. Because you anchor your pattern to a ] character, it will still match/consume the whole link correctly.
* is greedy: it matches as many characters as it can, e.g. up to the last right parenthesis in your document. (After all, . means "any character" and ) is 'any character" as much as any other character.)
You need the non-greedy version of *, which is *?. (Probably actually you should use +?, as I don't think zero-length matches would be very useful).

Regex not working to get string between 2 strings. Python 27 [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
From this URL view-source:https://www.amazon.com/dp/073532753X?smid=A3P5ROKL5A1OLE
I want to get string between var iframeContent = and obj.onloadCallback = onloadCallback;
I have this regex iframeContent(.*?)obj.onloadCallback = onloadCallback;
But it does not work. I am not good at regex so please pardon my lack of knowledge.
I even tried iframeContent(.*?)obj.onloadCallback but it does not work.
It looks like you just want that giant encoded string. I believe yours is failing for two reasons. You're not running in DOTALL mode, which means your . won't match across multiple lines, and your regex is failing because of catastrophic backtracking, which can happen when you have a very long variable length match that matches the same characters as the ones following it.
This should get what you want
m = re.search(r'var iframeContent = \"([^"]+)\"', html_source)
print m.group(1)
The regex is just looking for any characters except double quotes [^"] in between two double quotes. Because the variable length match and the match immediately after it don't match any of the same characters, you don't run into the catastrophic backtracking issue.
I suspect that input string lies across multiple lines.Try adding re.M in search line (ie. re.findall('someString', text_Holder, re.M)).
You could try this regex too
(?<=iframeContent =)(.*)(?=obj.onloadCallback = onloadCallback)
you can check at this site the test.
Is it very important you use DOTALL mode, which means that you will have single-line

Using variable in regular expression in Python [duplicate]

This question already has answers here:
How to use a variable inside a regular expression?
(12 answers)
Closed 2 years ago.
I've looked at several posts and other forums to find an answer related to my question, but nothing has come up specific to what I need. As a heads up, I'm new to programming and don't possess the basic foundation that most would.
I know bash, little python, and decent with RE.
I'm trying to create a python script, using RE's to parse through data and give me an output that I need/want.
My output will consist of 4 values, all originating from one line. The line being read in is thrown together with no defined delimiter. (hence the reason for my program)
In order to find one of the 4 values, I have to say look for 123- and give me everything after that but stop here df5. The 123- is not constant, but defined by a regular expression that works, same goes for df5. I assigned both RE's to a variable. How can I use those variables to find what I want between the two... Please let me know if this makes sense.
import re
start = '123-'
stop = 'df5'
regex = re.compile('{0}(.*?){1}'.format(re.escape(start), re.escape(stop)))
Note that the re.escape() calls aren't necessary for these example strings, but it is important if your delimiters can ever include characters with a special meaning in regex (., *, +, ? etc.).
How about a pattern "%s(.*?)%s" % (oneTwoThree, dF5)? Then you can do a re.search on that pattern and use the groups function on the result.
Something on the lines of
pattern = "%s(.*?)%s" % (oneTwoThree, dF5)
matches = re.search(pattern, text)
if matches:
print matches.groups()
re.findall, if used instead of re.search, can save you the trouble of grouping the matches.

Categories