Python Regex; why does ignorecase change something? [duplicate] - python

$.validator.addMethod('AZ09_', function (value) {
return /^[a-zA-Z0-9.-_]+$/.test(value);
}, 'Only letters, numbers, and _-. are allowed');
When I use somehting like test-123 it still triggers as if the hyphen is invalid. I tried \- and --

Escaping using \- should be fine, but you can also try putting it at the beginning or the end of the character class. This should work for you:
/^[a-zA-Z0-9._-]+$/

Escaping the hyphen using \- is the correct way.
I have verified that the expression /^[a-zA-Z0-9.\-_]+$/ does allow hyphens. You can also use the \w class to shorten it to /^[\w.\-]+$/.
(Putting the hyphen last in the expression actually causes it to not require escaping, as it then can't be part of a range, however you might still want to get into the habit of always escaping it.)

The \- maybe wasn't working because you passed the whole stuff from the server with a string. If that's the case, you should at first escape the \ so the server side program can handle it too.
In a server side string: \\-
On the client side: \-
In regex (covers): -
Or you can simply put at the and of the [] brackets.

Generally with hyphen (-) character in regex, its important to note the difference between escaping (\-) and not escaping (-) the hyphen because hyphen apart from being a character themselves are parsed to specify range in regex.
In the first case, with escaped hyphen (\-), regex will only match the hyphen as in example /^[+\-.]+$/
In the second case, not escaping for example /^[+-.]+$/ here since the hyphen is between plus and dot so it will match all characters with ASCII values between 43 (for plus) and 46 (for dot), so will include comma (ASCII value of 44) as a side-effect.

\- should work to escape the - in the character range. Can you quote what you tested when it didn't seem to? Because it seems to work: http://jsbin.com/odita3

A more generic way of matching hyphens is by using the character class for hyphens and dashes ("\p{Pd}" without quotes). If you are dealing with text from various cultures and sources, you might find that there are more types of hyphens out there, not just one character. You can add that inside the [] expression

Related

Regex Match on String (DOI)

Hi I'm struggling to understand why my Regex isn't working.
I have URL's that have DOI's on them like so:
https://link.springer.com/10.1007/s00737-021-01116-5
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://onlinelibrary.wiley.com/doi/10.1111/jocn.13435
https://journals.sagepub.com/doi/pdf/10.1177/1062860613484171
https://onlinelibrary.wiley.com/resolve/openurl?genre=article&title=Natural+Resources+Forum&issn=0165-0203&volume=26&date=2002&issue=1&spage=3
https://dx.doi.org/10.1108/14664100110397304?nols=y
https://onlinelibrary.wiley.com/doi/10.1111/jocn.15833
https://www.tandfonline.com/doi/pdf/10.1080/03768350802090592?needAccess=true
And I'm using for example this Regex, but it always returns empty?
print(re.findall(r'/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i', 'https://dx.doi.org/10.1108/02652320410549638?nols=y'))
Where have I gone wrong?
It looks like you come from another programming language that has the notion of regex literals that are delimited with forward slashes and have the modifiers following the closing slash (hence /i).
In Python there is no such thing, and these slashes and modifier(s) are taken as literal characters. For flags like i you can use the optional flags parameter of findall.
Secondly, ^ will match the start of the input string, but evidently the URLs you have as input do not start with 10, so that has to go. Instead you could require that the 10 must follow a word break... i.e. it should not be preceded by an alphanumerical character (or underscore).
Similarly, $ will match the end of the input string, but you have URLs that continue with URL parameters, like ?nols=y, so again the part you are interested in does not go on until the end of the input. So that has to go too.
The dot has a special meaning in regex, but you clearly intended to match a literal dot, so it should be escaped.
Finally, alphanumerical characters can be matched with \w, which also matches both lower case and capital Latin letters, so you can shorten the character class a bit and do without any flags such as i (re.I).
This leaves us with:
print(re.findall(r'\b10\.\d{4,9}/[-.;()/:\w]+',
'https://dx.doi.org/10.1108/02652320410549638?nols=y'))

Django: urlpattern for username?

Up to now a team mate used this code for the url patterns of user names:
# urls.py
urlpatterns = patterns('...',
url(r'^user/(?P<username>[.-_\w]+)/foo', 'myapp.views.foo'),
....
there is a hidden bug: If the username contains a - the reversing would fail, since the beginning of the regex pattern [.-_ means "all chars from . to _".
What pattern can be used to match all valid usernames?
PS: I guess adding the - sign to the regex is not enough, if you want to match all possible user names in django.
Based on what I see in the AbstractUser model, I think a better regex to use to grab the username is (?P<username>[\w.#+-]+).
I don't think you should put any username validation in your URL pattern. Keep your validation in one place -- the place you create your accounts for the first time.
You should match anything the user supplies there, and pass that to a safe database function to look up the username and fail if it doesn't exist.
So, in your url pattern, let the browser send anything that is nonempty, and rely on your very smart database to tell you what you previously decided was valid or not.
url(r'^user/(?P<username>.+)/foo$', 'myapp.views.foo'),
Also, note the "$" on the end.
You can either move the hyphen to the start of the character class,
[-.\w]
or you can escape it with a backslash
[.\-\w]
Note I have removed the underscore, since it is included in \w. I am also assuming that you only want to accept ., - and \w, and you don't want to accept all the characters from . to _. That range includes characters like #, so you might want to check that all your usernames match the new regex.
You can use following way:
[-.\w](- use in left most)
or [.\-\w] (- use with backslash in any place)
or [.\w-] (- use in right most)
if you use special characters then best use \(backslash ) before any special characters (which are used in regex special char.).
For best use your regex will be ^user/(?P<username>[.\-_\w]+)/foo
First of all, it is not a bug but a feature well documented in the docs:
[]
Used to indicate a set of characters. In a set:
Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a-z]) or if it’s placed as the first or last character (e.g. [a-]), it will match a literal '-'.
So, using - between two literals will evaluate that regex as a character range:
re.compile("[a-0]+")
>> error: bad character range
re.findall("[.-_]+", "asdasd-asdasdad._?asdasd-")
>> ['._?']
As you see, python will always interperet - as a range indicator when used between characters in character sets.
As it is (also) stated in the docs, avoiding a range declaration is done by escaping the - with \- or placing it as the first or the last literal in the character set []
If you want to capture that character range including -, then try:
re.findall("[.-_\-]+", "asdasd-asdasdad._?asdasd-")
>> ['-', '._?', '-']
Note: \w is equal to [a-zA-Z0-9_] when LOCALE and UNICODE flags are not set. So you do not need to declare _ again
And in your situation:
url(r'^user/(?P<username>[-.\w]+)/foo', 'myapp.views.foo')
url(r'^user/(?P<username>[.\w-]+)/foo', 'myapp.views.foo')
url(r'^user/(?P<username>[.\-\w]+)/foo', 'myapp.views.foo')
Beyond the - usage, if you are using default Django Username styling, then #navneet35371 is right about the valid character set. You may alter your regex character set to include # and + and use
url(r'^user/(?P<username>[\w.#+-]+)/foo', 'myapp.views.foo')

How does the regex "\" character and grouping "()" character work together?

I am trying to see which statements the following pattern matches:
\(*[0­-9]{3}\)*-­*[0-­9]{3}­\d\d\d+
I am a little confused because the grouping characters () have a \ before it. Does this mean that the statement must have a ( and )? Would that mean the statements without ( or ) be unmatched?
Statements:
'404­678­2347'
'(123)­1247890'
'456­900­900'
'(678)­2001236'
'404123­1234'
'(404123­123'
Context is important:
re.match(r'\(', content) matches a literal parenthesis.
re.match(r'\(*', content) matches 0 or more literal parentheses, thus making the parens optional (and allowing more than one of them, but that's clearly a bug).
Since the intended behavior isn't "0 or more" but rather "0 or 1", this should probably be written r'\(?' instead.
That said, there's a whole lot about this regex that's silly. I'd consider instead:
[(]?\d{3}[)]?-?\d{6,}
Using [(]? avoids backslashes, and consequently is easier to read whether it's rendered by str() or repr() (which escapes backslashes).
Mixing [0-9] and \d is silly; better to pick one and stick with it.
Using * in place of ? is silly, unless you really want to match (((123))456-----7890.
\d{3}\d\d\d+ matches three digits, then three or more additional digits. Why not just match six or more digits in the first place?
Normally, the parentheses would act as grouping characters, however regex metacharacters are reduced simply to the raw characters when preceded by a backslash. From the Python docs:
As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
In your case, the statements don't need parentheses in order to match, as each \( and \) in the expression is followed by a *, which means that the previous character can be matched any number of times, including none at all. From the Python docs:
* doesn’t match the literal character *; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.
Thus the statements with or without parentheses around the first 3 digits may match.
Source: https://docs.python.org/2/howto/regex.html

REGEX in Python: what's wrong with (?<!\\)\".+(?<!\\)\"?

trying to parse JSON key names within quotes, including escaped quotes.
my thinking is: take anything between quotes not prefixed with \
(?<!\\)\".+(?<!\\)\"
where (?<!\\)\" should screen for " but not \" but Python complains about unbalanced parenthesis.
if I use (?<!\\\)\" Python is happy , but this doesn't work:
re.findall('(?<!\\\)\".+(?<!\\\)\"','"this is \"the\". key"."and this.is.the.child"')
leads:
['"this is "the". key"."and this.is.the.child"']
when I expect:
['"this is "the". key"', '"and this.is.the.child"']
split at the dot which is enclosed with " without escape.
I feel like i need an 'anything but not escaped double quote ' in the middle, but if
[^"] screens for anything but a double quote, I don't know how to negate the (?<!\\\)\" expression within a [ ] set that takes characters as literals.
i would want something like [^(?<!\\\)\"] but that doesn't work.
I tried things like [[^"]|(\")]+ (anything but a double quote, or a \" ) but that doesn't seem to work either...
Can;t seem to find the right way to do this...
Any ideas?
Thanks for help
EDIT:
My real goal is to be able to split full 'text' JSON key names to transform them into alphanum only values. The transform is irrelevant here, but the goal is to split the keys to represent the hierarchy properly. The keys are in text form.
EDIT 2:
even though OmnipotentEntity is most likely right, writing a parser will have to wait..
This solution below doesn't support the "\" or "\\" cases as indicated in his comments.
I settled with
"(?:\\"|[^"])*?"|(?<=\.)[^".]+?(?=\.)|^[^".]+?(?=\.)|(?<=\.)[^".]+?$
inspired by the answer from Avinash Raj
but adding support for keys that are not enclosed in double quotes:
no quotes beginning of line ending with .
.key.
and
.lastkey
when substituting [empty] with the same regex, one should find 1 less element than the number of found strings, or there is an error.
something like .. outside "" will fail that test
Fundamentally, using a regular expression to match quoted strings is impossible in the general case. JSON is not a regular language (all regular languages are LL(1) but not all LL(1) languages are regular, JSON is one of these), so it cannot be matched by a regular expression.
Avinash Raj's regular expression (?<!\\)".*?(?<!\\)", for instance, fails on the the case "\\". Because the quote is preceded by a \ but the backslash doesn't function as an escape. But you can't special case this situation because then "\\\"" will fail. And if you special case this situation, you can just use 4 \ and then 5 \ etc.
Lookbehinds aren't part of standard regular expressions so they can match more grammars than simply regular ones. So you might be able to come up with a regular expression that works in this case. However, I would recommend writing a parser instead, they are very easy to do for LL(1) grammars. It will be easier, more understandable, less brittle, and give you more leverage to deal with non-conformant JSON and give you the ability to write better diagnostic messages in this case.
Try to define your regex as raw string notation.
>>> s = r'"this is \"the\". key"."and this.is.the.child"'
>>> re.findall(r'"(?:\\"|[^"])*?"', s)
['"this is \\"the\\". key"', '"and this.is.the.child"']
DEMO
OR
>>> re.findall(r'(?<!\\)".*?(?<!\\)"', s)
['"this is \\"the\\". key"', '"and this.is.the.child"']
(?<!\\) called negative lookbehind which asserts that the match won't be preceded by a backslash.
" Matches a double quotes.
.*?(?<!\\)" Matches all the characters non-greedily upto the double quotes which is not preceded by a backslash.

Regex whitespace, brackets, and parens

I am trying to match a string in the following form:
require([ "foo/bar", "foo2/bar2" ])
Whitespace should be ignored entirely. I am using the following regex with little success:
require\\(\s*\\[[.\s]*\\]\\)
Any suggestions? I know that regex attempt is horrible...
EDIT: I am using Python!
If you are using Java or PHP with double-quoted strings or somethig similar, you need to double escape the \s as well. If not, then you need to remove all double backslashes instead (and make them single backslashes). Also note, that [.\s] matches only periods and whitespace (. loses its wildcard meaning within character classes). If you really want to match anything use [\s\S] instead.
Assuming double escaping is required in the language you use:
require\\(\\s*\\[[\\S\\s]*\\]\\)
Note that this will cause problems if this occurs multiple times in the same string. Then you would get a match from the first require([ to the last ]). To avoid this, disallow ] within the repetition. However, be aware that this in turn can cause problems if your strings within require may contain ] themselves:
require\\(\\s*\\[[^]]*\\]\\)

Categories