`(?P<name>...) ` and `\g<quote>` in re module - python

Upon reading the python regex module, (?P<name>...) usually confused me.
I knew P here stanrds for nothing but random things as foo bar zoo from the answer python - Named regular expression group "(?Pregexp)": what does "P" stand for? - Stack Overflow
(?P<name>...)
Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named.
Named groups can be referenced in three contexts. If the pattern is (?P['"]).*?(?P=quote) (i.e. matching a string quoted with either single or double quotes):
Mistakes often make in repl argument in re.sub to employ \g<quote>.
Since I try to be pythonic and explain the root things to others.
why use g instead of p in \g<quote> or why not use G in (?P<name>...)?
so there will be some consistent than chaos.

Related

How to use regular expression to remove all math expression in latex file

Suppose I have a string which consists of a part of latex file. How can I use python re module to remove any math expression in it?
e.g:
text="This is an example $$a \text{$a$}$$. How to remove it? Another random math expression $\mathbb{R}$..."
I would like my function to return ans="This is an example . How to remove it? Another random math expression ...".
Thank you!
Try this Regex:
(\$+)(?:(?!\1)[\s\S])*\1
Click for Demo
Code
Explanation:
(\$+) - matches 1+ occurrences of $ and captures it in Group 1
(?:(?!\1)[\s\S])* - matches 0+ occurrences of any character that does not start with what was captured in Group 1
\1 - matches the contents of Group 1 again
Replace each match with a blank string.
As suggested by #torek, we should not match 3 or more consecutive $, hence changing the expression to (\${1,2})(?:(?!\1)[\s\S])*\1
It's commonly said that regular expressions cannot count, which is kind of a loose way of describing a problem more formally discussed in Count parentheses with regular expression. See that for what this means.
Now, with that in mind, note that LaTeX math expressions can include nested sub-equations, which can include further nested sub-equations, and so on. This is analogous to the problem of detecting whether a closing parenthesis closes an inner parenthesized expression (as in (for instance) this example, where the first one does not) or an outer parenthesis. Therefore, regular expressions are not going to be powerful enough to handle the full general case.
If you're willing to do a less-than-complete job, you can construct a regular expression that finds $...$ and $$...$$. You will need to pay attention to the particular regular expression language available. Python's is essentially the same as Perl's here.
Importantly, these $-matchers will completely miss \begin{equation} ... \end{equation}, \begin{eqnarray} ... \end{eqnarray}, and so on. We've already noted that handling LaTeX expression parsing with a mere regular expression recognizer is inadequate, so if you want to do a good job—while ignoring the complexity of lower-level TeX manipulation of token types, where one can change any individual character's category code —you will want a more general parser. You can then tokenize \begin, {, }, and words, and match up the begin/end pairs. You can also tokenize $ and $$ and match those up. Since parsers can count, in exactly the way that regular expressions can't, you can do a much better job this way.

Regular Expression: Simple Syntax Highlighting (Python)

I have been struggling with creating a regular expression that will differentiate between an object definition and calling that object in Python. The purpose is syntax highlighting.
This is the situation which needs to be resolved: (numbers denote line)
0 def otherfunc(vars...):
1 pass
2
3 otherfunc(vars...)
I am interested in matching the name of the object, but not if preceded anywhere by def in the same line. The result on the above code should be:
"otherfunc", line: 3
Is regular expressions capable of doing something like this?
EDIT: I am only concerned with scanning/searching a single line at a time.
You could use negative lookbehind. This matches an atom that is not preceded by an atom. So in your case your looking for otherfunc which is not preceded by "def"
I'm use PCRE regex here.
(?<!def\s)otherfunc
I like Richards answer, however I would also take into considerarion the valid function name characters of phyton and intendation. So this is what I came up with:
(?<!(def\s))(?<=^|\s)[a-zA-Z_][\w_]*(?=\()
See this working sample on Rexex101
Explanation
Matches valid python function names if
(?<!(def\s)) they are not following a def and a whitespace and
(?<=^|\s) are either at the beginning of a line, or following a whitespace (this is the closest you get, since lookbehinds dont support wildcard specifiers) and
are followed by a opening bracket (()
Note that I am not an phyton dev, so for the sake of simplicity [a-zA-Z_][\w_]* matches valid phyton 2.x function names, you can extend this part of the expression to phyton 3.x which I have no clue of ;)

Named non-capturing group in python?

Is it possible to have named non-capturing group in python? For example I want to match string in this pattern (including the quotes):
"a=b"
'bird=angel'
I can do the following:
s = '"bird=angel"'
myre = re.compile(r'(?P<quote>[\'"])(\w+)=(\w+)(?P=quote)')
m = myre.search(s)
m.groups()
# ('"', 'bird', 'angel')
The result captures the quote group, which is not desirable here.
No, named groups are always capturing groups. From the documentation of the re module:
Extensions usually do not create a new group; (?P<name>...) is the
only exception to this rule.
And regarding the named group extension:
Similar to regular parentheses, but the substring matched by the group
is accessible within the rest of the regular expression via the
symbolic group name name
Where regular parentheses means (...), in contrast with (?:...).
You do need a capturing group in order to match the same quote: there is no other mechanism in re that allows you to do this, short of explicitly distinguishing the two quotes:
myre = re.compile('"{0}"' "|'{0}'" .format('(\w+)=(\w+)'))
(which has the downside of giving you four groups, two for each style of quotes).
Note that one does not need to give a name to the quotes, though:
myre = re.compile(r'([\'"])(\w+)=(\w+)\1')
works as well.
In conclusion, you are better off using groups()[1:] in order to get only what you need, if at all possible.

Django: An explanation of (\d+)

In the django book, and on the django website, \d+ is used to capture data from the url. The syntax is never explain, nor is the importance of D or the () beyond the fact that you can specify the number of characters in that part of the url. How, exactly/in what order, are these variables passed to the function? Hoe, exactly, does the syntax work? How do you implement it? Don't forget to explain the ()
Without further information, I'd guess it's a regular expression (or "regex" for short). These are a common string-processing mechanism used in many programming languages. In Python they are handled with the re module. If you want to learn more about them you might want to check out a general regex tutorial like http://www.regular-expressions.info/.
That is a regular expression. \d means digit, and + means "one or more". Putting it in parens specifies that it's a capturing group. The contents of each capturing group are passed to the handler function in the order they appear within the regex.
Python's regex library is re.
As another example, a more complex regex might be (\d+)/(\d+) which would capture two different sets of one or more digits separated by a slash, and pass them in as two arguments (the first digit string as the first argument, the second digit string as the second argument) to the handler function.
It refers to a digit regular expression used to capture the numeric ID from the URI.
Hi-REST conventions call for URI for GET/POST/... to be terminated by an ID of the resource, and in this cases it is looking for a numerical ID - \d+ being one or more numbers)
Enclosing it in parenthesis is simply a convention that assist Django in parsing the regex.
Example:
http://www.amazon.com/dp/0486653552

What does this Django regular expression mean? `?P`

I have the following regular expression (regex) in my urls.py and I'd like to know what it means. Specifically the (?P<category_slug> portion of the regex.
r'^category/(?P<category_slug>[-\w]+)/$
In django, named capturing groups are passed to your view as keyword arguments.
Unnamed capturing groups (just a parenthesis) are passed to your view as arguments.
The ?P is a named capturing group, as opposed to an unnamed capturing group.
http://docs.python.org/library/re.html
(?P<name>...) Similar to regular parentheses, but the substring
matched by the group is accessible within the rest of the regular
expression via the symbolic group name name. Group names must be valid
Python identifiers, and each group name must be defined only once
within a regular expression. A symbolic group is also a numbered
group, just as if the group were not named. So the group named id in
the example below can also be referenced as the numbered group 1.
(?P<name>regex) - Round brackets group the regex between them. They capture the text matched by the regex inside them that can be referenced by the name between the sharp brackets. The name may consist of letters and digits.
Copy paste from: http://www.regular-expressions.info/refext.html
(?P<category_slug>) creates a match group named category_slug.
The regex itself matches a string starting with category/ and then a mix of alphanumeric characters, the dash - and the underscore _, followed by a trailing slash.
Example URLs accepted by the regex:
category/foo/
category/foo_bar-baz/
category/12345/
category/q1e2_asdf/
In pattern matching,
Use this pattern for passing string
(?P<username2>[-\w]+)
This for interger value
(?P<user_id>[0-9]+)
New in version 3.6.
(?P<name>...)
Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named.
copy paste from Python3Regex

Categories