Django: urlpattern for username? - python

Up to now a team mate used this code for the url patterns of user names:
# urls.py
urlpatterns = patterns('...',
url(r'^user/(?P<username>[.-_\w]+)/foo', 'myapp.views.foo'),
....
there is a hidden bug: If the username contains a - the reversing would fail, since the beginning of the regex pattern [.-_ means "all chars from . to _".
What pattern can be used to match all valid usernames?
PS: I guess adding the - sign to the regex is not enough, if you want to match all possible user names in django.

Based on what I see in the AbstractUser model, I think a better regex to use to grab the username is (?P<username>[\w.#+-]+).

I don't think you should put any username validation in your URL pattern. Keep your validation in one place -- the place you create your accounts for the first time.
You should match anything the user supplies there, and pass that to a safe database function to look up the username and fail if it doesn't exist.
So, in your url pattern, let the browser send anything that is nonempty, and rely on your very smart database to tell you what you previously decided was valid or not.
url(r'^user/(?P<username>.+)/foo$', 'myapp.views.foo'),
Also, note the "$" on the end.

You can either move the hyphen to the start of the character class,
[-.\w]
or you can escape it with a backslash
[.\-\w]
Note I have removed the underscore, since it is included in \w. I am also assuming that you only want to accept ., - and \w, and you don't want to accept all the characters from . to _. That range includes characters like #, so you might want to check that all your usernames match the new regex.

You can use following way:
[-.\w](- use in left most)
or [.\-\w] (- use with backslash in any place)
or [.\w-] (- use in right most)
if you use special characters then best use \(backslash ) before any special characters (which are used in regex special char.).
For best use your regex will be ^user/(?P<username>[.\-_\w]+)/foo

First of all, it is not a bug but a feature well documented in the docs:
[]
Used to indicate a set of characters. In a set:
Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a-z]) or if it’s placed as the first or last character (e.g. [a-]), it will match a literal '-'.
So, using - between two literals will evaluate that regex as a character range:
re.compile("[a-0]+")
>> error: bad character range
re.findall("[.-_]+", "asdasd-asdasdad._?asdasd-")
>> ['._?']
As you see, python will always interperet - as a range indicator when used between characters in character sets.
As it is (also) stated in the docs, avoiding a range declaration is done by escaping the - with \- or placing it as the first or the last literal in the character set []
If you want to capture that character range including -, then try:
re.findall("[.-_\-]+", "asdasd-asdasdad._?asdasd-")
>> ['-', '._?', '-']
Note: \w is equal to [a-zA-Z0-9_] when LOCALE and UNICODE flags are not set. So you do not need to declare _ again
And in your situation:
url(r'^user/(?P<username>[-.\w]+)/foo', 'myapp.views.foo')
url(r'^user/(?P<username>[.\w-]+)/foo', 'myapp.views.foo')
url(r'^user/(?P<username>[.\-\w]+)/foo', 'myapp.views.foo')
Beyond the - usage, if you are using default Django Username styling, then #navneet35371 is right about the valid character set. You may alter your regex character set to include # and + and use
url(r'^user/(?P<username>[\w.#+-]+)/foo', 'myapp.views.foo')

Related

Regex Match on String (DOI)

Hi I'm struggling to understand why my Regex isn't working.
I have URL's that have DOI's on them like so:
https://link.springer.com/10.1007/s00737-021-01116-5
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://onlinelibrary.wiley.com/doi/10.1111/jocn.13435
https://journals.sagepub.com/doi/pdf/10.1177/1062860613484171
https://onlinelibrary.wiley.com/resolve/openurl?genre=article&title=Natural+Resources+Forum&issn=0165-0203&volume=26&date=2002&issue=1&spage=3
https://dx.doi.org/10.1108/14664100110397304?nols=y
https://onlinelibrary.wiley.com/doi/10.1111/jocn.15833
https://www.tandfonline.com/doi/pdf/10.1080/03768350802090592?needAccess=true
And I'm using for example this Regex, but it always returns empty?
print(re.findall(r'/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i', 'https://dx.doi.org/10.1108/02652320410549638?nols=y'))
Where have I gone wrong?
It looks like you come from another programming language that has the notion of regex literals that are delimited with forward slashes and have the modifiers following the closing slash (hence /i).
In Python there is no such thing, and these slashes and modifier(s) are taken as literal characters. For flags like i you can use the optional flags parameter of findall.
Secondly, ^ will match the start of the input string, but evidently the URLs you have as input do not start with 10, so that has to go. Instead you could require that the 10 must follow a word break... i.e. it should not be preceded by an alphanumerical character (or underscore).
Similarly, $ will match the end of the input string, but you have URLs that continue with URL parameters, like ?nols=y, so again the part you are interested in does not go on until the end of the input. So that has to go too.
The dot has a special meaning in regex, but you clearly intended to match a literal dot, so it should be escaped.
Finally, alphanumerical characters can be matched with \w, which also matches both lower case and capital Latin letters, so you can shorten the character class a bit and do without any flags such as i (re.I).
This leaves us with:
print(re.findall(r'\b10\.\d{4,9}/[-.;()/:\w]+',
'https://dx.doi.org/10.1108/02652320410549638?nols=y'))

Regular Expression Difficulties

So this is the link I have to extract:
http://www.hrmagazine.co.uk/article-details/finance-sector-dominates-working-families-benchmark
And this is what I have currently
.+\/article-details\/.+\-.+\-.+\-.+\-.+\-.+$
The issue, however, is it extracts any number of words and hyphens after the "/article-details/" part, rather than specifically 6 word titles with hyphens replacing the spaces above. So it would accept a bad result
http://www.hrmagazine.co.uk/article-details/finance-sector-dominates-working-families-benchmark-test
When I need it to only accept links like this format
http://www.hrmagazine.co.uk/article-details/one-two-three-four-five-six
What's the correct regular expression for this type of website? The current extractor I have in Scrapy/Spyder is the following
rules = (Rule(LinkExtractor(allow=['.+\/article-details\/.+\-.+\-.+\-.+\-.+\-.+$']), callback='parse_item', follow=True),)
Each of those .+ in your regex can match any number of ANY character - including hyphens. So your overall regex is just requiring a minimum of 5 hyphens, not an exact count. Use [^-]+ to match only non-hyphen characters.
Note that none of those backslashes in your regex are accomplishing anything - in no case is the following character something requiring escaping. Even if they were, you'd need to double the backslashes, or use a raw string r'whatever', so that the backslashes are being interpreted by the re module, rather than Python's string literal parsing rules.
Try replacing the . with something like [a-z]; . will also match hyphens, which is why its matching an unlimited number of words:
.+\/article-details\/[a-z]+\-[a-z]+\-[a-z]+\-[a-z]+\-[a-z]+\-[a-z]+$
If you need to match things like numbers, add them to the brackets as well ([a-z0-9], etc.).

Python Regex; why does ignorecase change something? [duplicate]

$.validator.addMethod('AZ09_', function (value) {
return /^[a-zA-Z0-9.-_]+$/.test(value);
}, 'Only letters, numbers, and _-. are allowed');
When I use somehting like test-123 it still triggers as if the hyphen is invalid. I tried \- and --
Escaping using \- should be fine, but you can also try putting it at the beginning or the end of the character class. This should work for you:
/^[a-zA-Z0-9._-]+$/
Escaping the hyphen using \- is the correct way.
I have verified that the expression /^[a-zA-Z0-9.\-_]+$/ does allow hyphens. You can also use the \w class to shorten it to /^[\w.\-]+$/.
(Putting the hyphen last in the expression actually causes it to not require escaping, as it then can't be part of a range, however you might still want to get into the habit of always escaping it.)
The \- maybe wasn't working because you passed the whole stuff from the server with a string. If that's the case, you should at first escape the \ so the server side program can handle it too.
In a server side string: \\-
On the client side: \-
In regex (covers): -
Or you can simply put at the and of the [] brackets.
Generally with hyphen (-) character in regex, its important to note the difference between escaping (\-) and not escaping (-) the hyphen because hyphen apart from being a character themselves are parsed to specify range in regex.
In the first case, with escaped hyphen (\-), regex will only match the hyphen as in example /^[+\-.]+$/
In the second case, not escaping for example /^[+-.]+$/ here since the hyphen is between plus and dot so it will match all characters with ASCII values between 43 (for plus) and 46 (for dot), so will include comma (ASCII value of 44) as a side-effect.
\- should work to escape the - in the character range. Can you quote what you tested when it didn't seem to? Because it seems to work: http://jsbin.com/odita3
A more generic way of matching hyphens is by using the character class for hyphens and dashes ("\p{Pd}" without quotes). If you are dealing with text from various cultures and sources, you might find that there are more types of hyphens out there, not just one character. You can add that inside the [] expression

Python regex to validate this format of string

I'm having trouble validating this type of input strings in Python.
The weekday have a variable number of characters.
Regular: 16Mar2009(mon), 17Mar2009(tues), 18Mar2009(wed)
Regular: 20Mar2009(fri), 21Mar2009(sat), 22Mar2009(sun)
Rewards: 26Mar2009(thur), 27Mar2009(fri), 28Mar2009(sat)
I want to validate the whole line, every line should have this specific format:
<name>: <date>(<weekday>), <date>(<weekday>), <date>(<weekday>)
Thanks in advance!
Try this: \w+: \d+\w+\(\w+\)(?:,\s*\d+\w+\(\w+\))*
Using programs like kiki-re you can test regexps easily.
Regular: 20Mar2009(fri), 21Mar2009(sat), 22Mar2009(sun)
your_regex = r'^[A-Za-z]+:\s+(?:\d{1,2}[A-Za-z]{3}\d{4}\([A-Za-z]{3}\),\s+){2}
\d{1,2}[A-Za-z]{3}\d{4}\([A-Za-z]{3}\)$'
To see how this works, see the picture and explanations below:
NB: You could use \w, but the character class corresponding to \w is [a-zA-Z0-9_].
r means interpret as a raw string
^ means begins with
[A-Za-z] means any character that is in the character class A,B,C...Z and a, b, c, ...z
+ means 1 or more of the preceding
: matches the literal colon
\s means whitespace
(?:...) means group, but do not capture (see capturing groups for the alternative (...)
{x,y} means that there must exist [x,y] of whatever precedes this
{x} means that there must exist exactly 'x' of whatever came before this
\( and \) mean ( and ), but need to be preceded by \ since parenthesis are special characters in regular expressions.
$ means ends with
While this may not be exactly what you want, it works for your input and you now hopefully have the tools to change it to fit your needs. You should consider edges cases, though, for example in the proposed solution above, you could easily match: blah: 99zzz0000... Good luck!

Trying to find all instances of a keyword NOT in comments or literals?

I'm trying to find all instances of the keyword "public" in some Java code (with a Python script) that are not in comments or strings, a.k.a. not found following //, in between a /* and a */, and not in between double or single quotes, and which are not part of variable names-- i.e. they must be preceded by a space, tab, or newline, and must be followed by the same.
So here's what I have at the moment--
//.*\spublic\s.*\n
/\*.*\spublic\s.*\*/
".*\spublic\s.*"
'.*\spublic\s.*'
Am I messing this up at all?
But that finds exactly what I'm NOT looking for. How can I turn it around and search the inverse of the sum of those four expressions, as a single regex?
I've figured out this probably uses negative look-ahead and look-behind, but I still can't quite piece it together. Also, for the /**/ regex, I'm concerned that .* doesn't match newlines, so it would fail to recognize that this public is in a comment:
/*
public
*/
Everything below this point is me thinking on paper and can be disregarded. These thoughts are not fully accurate.
Edit:
I daresay (?<!//).*public.* would match anything not in single line comments, so I'm getting the hang of things. I think. But still unsure how to combine everything.
Edit2:
So then-- following that idea, I |ed them all to get--
(?<!//).*public.*|(?<!/\*).*public.\*/(?!\*/)|(?<!").*public.*(?!")|(?<!').*public.*(?!')
But I'm not sure about that. //public will not be matched by the first alternate, but it will be matched by the second. I need to AND the look-aheads and look-behinds, not OR the whole thing.
I'm sorry, but I'll have to break the news to you, that what you are trying to do is impossible. The reason is mostly because Java is not a regular language. As we all know by now, most regex engines provide non-regular features, but Python in particular is lacking something like recursion (PCRE) or balancing groups (.NET) which could do the trick. But let's look into that in more depth.
First of all, why are your patterns not as good as you think they are? (for the task of matching public inside those literals; similar problems will apply to reversing the logic)
As you have already recognized, you will have problems with line breaks (in the case of /*...*/). This can be solved by either using the modifier/option/flag re.S (which changes the behavior of .) or by using [\s\S] instead of . (because the former matches any character).
But there are other problems. You only want to find surrounding occurrences of the string or comment literals. You are not actually making sure that they are specifically wrapped around the public in question. I'm not sure how much you can put onto a single line in Java, but if you had an arbitrary string, then later a public and then another string on a single line, then your regex would match the public because it can find the " before and after it. Even if that is not possible, if you have two block comments in the same input, then any public between those two block comments would cause a match. So you would need to find a way to assert only that your public is really inside "..." or /*...*/ and not just that these literals can be found anywhere to left of right of it.
Next thing: matches cannot overlap. But your match includes everything from the opening literal until the ending literal. So if you had "public public" that would cause only one match. And capturing cannot help you here. Usually the trick to avoid this is to use lookarounds (which are not included in the match). But (as we will see later) the lookbehind doesn't work as nicely as you would think, because it cannot be of arbitrary length (only in .NET that is possible).
Now the worst of all. What if you have " inside a comment? That shouldn't count, right? What if you have // or /* or */ inside a string? That shouldn't count, right? What about ' inside "-strings and " inside '-strings? Even worse, what about \" inside "-string? So for 100% robustness you would have to do a similar check for your surrounding delimiters as well. And this is usually where regular expressions reach the end of their capabilities and this is why you need a proper parser that walks the input string and builds a whole tree of your code.
But say you never have comment literals inside strings and you never have quotes inside comments (or only matched quotes, because they would constitute a string, and we don't want public inside strings anyway). So we are basically assuming that every of the literals in question is correctly matched, and they are never nested. In that case you can use a lookahead to check whether you are inside or outside one of the literals (in fact, multiple lookaheads). I'll get to that shortly.
But there is one more thing left. What does (?<!//).*public.* not work? For this to match it is enough for (?<!//) to match at any single position. e.g. if you just had input // public the engine would try out the negative lookbehind right at the start of the string, (to the left of the start of the string), would find no //, then use .* to consume // and the space and then match public. What you actually want is (?<!//.*)public. This will start the lookbehind from the starting position of public and look all the way to the left through the current line. But... this is a variable-length lookbehind, which is only supported by .NET.
But let's look into how we can make sure we are really outside of a string. We can use a lookahead to look all the way to the end of the input, and check that there is an even number of quotes on the way.
public(?=[^"]*("[^"]*"[^"]*)*$)
Now if we try really hard we can also ignore escaped quotes when inside of a string:
public(?=[^"]*("(?:[^"\\]|\\.)*"[^"]*)*$)
So once we encounter a " we will accept either non-quote, non-backslash characters, or a backslash character and whatever follows it (that allows escaping of backslash-characters as well, so that in "a string\\" we won't treat the closing " as being escaped). We can use this with multi-line mode (re.M) to avoid going all the way to the end of the input (because the end of the line is enough):
public(?=[^"\r\n]*("(?:[^"\r\n\\]|\\.)*"[^"\r\n]*)*$)
(re.M is implied for all following patterns)
This is what it looks for single-quoted strings:
public(?=[^'\r\n]*('(?:[^'\r\n\\]|\\.)*'[^'\r\n]*)*$)
For block comments it's a bit easier, because we only need to look for /* or the end of the string (this time really the end of the entire string), without ever encountering */ on the way. That is done with a negative lookahead at every single position until the end of the search:
public(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))
But as I said, we're stumped on the single-line comments for now. But anyway, we can combine the last three regular expressions into one, because lookaheads don't actually advance the position of the regex engine on the target string:
public(?=[^"\r\n]*("(?:[^"\r\n\\]|\\.)*"[^"\r\n]*)*$)(?=[^'\r\n]*('(?:[^'\r\n\\]|\\.)*'[^'\r\n]*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))
Now what about those single-line comments? The trick to emulate variable-length lookbehinds is usually to reverse the string and the pattern - which makes the lookbehind a lookahead:
cilbup(?!.*//)
Of course, that means we have to reverse all other patterns, too. The good news is, if we don't care about escaping, they look exactly the same (because both quotes and block comments are symmetrical). So you could run this pattern on a reversed input:
cilbup(?=[^"\r\n]*("[^"\r\n]*"[^"\r\n]*)*$)(?=[^'\r\n]*('[^'\r\n]*'[^'\r\n]*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
You can then find the match positions in your actual input by using inputLength -foundMatchPosition - foundMatchLength.
Now what about escaping? That get's quite annoying now, because we have to skip quotes, if they are followed by a backslash. Because of some backtracking issues we need to take care of that in five places. Three times, when consuming non-quote characters (because we need to allow "\ as well now. And twice, when consuming quote characters (using a negative lookahead to make sure there is no backslash after them). Let's look at double quotes:
cilbup(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)
(It looks horrible, but if you compare it with the pattern that disregards escaping, you will notice the few differences.)
So incorporating that into the above pattern:
cilbup(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
So this might actually do it for many cases. But as you can see it's horrible, almost impossible to read, and definitely impossible to maintain.
What were the caveats? No comment literals inside strings, no string literals inside strings of the other type, no string literals inside comments. Plus, we have four independent lookaheads, which will probably take some time (at least I think I have a voided most of backtracking).
In any case, I believe this is as close as you can get with regular expressions.
EDIT:
I just realised I forgot the condition that public must not be part of a longer literal. You included spaces, but what if it's the first thing in the input? The easiest thing would be to use \b. That matches a position (without including surrounding characters) that is between a word character and a non-word character. However, Java identifiers may contain any Unicode letter or digit, and I'm not sure whether Python's \b is Unicode-aware. Also, Java identifiers may contain $. Which would break that anyway. Lookarounds to the rescue! Instead of asserting that there is a space character on every side, let's assert that there is no non-space character. Because we need negative lookarounds for that, we will get the advantage of not including those characters in the match for free:
(?<!\S)cilbup(?!\S)(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
And because just from scrolling this code snippet to the right one cannot quite grasp how ridiculously huge this regex is, here it is in freespacing mode (re.X) with some annotations:
(?<!\S) # make sure there is no trailing non-whitespace character
cilbup # public
(?!\S) # make sure there is no leading non-whitespace character
(?= # lookahead (effectively lookbehind!) to ensure we are not inside a
# string
(?:[^"\r\n]|"\\)*
# consume everything except for line breaks and quotes, unless the
# quote is followed by a backslash (preceded in the actual input)
(?: # subpattern that matches two (unescaped) quotes
"(?!\\) # a quote that is not followed by a backslash
(?:[^"\r\n]|"\\)*
# we've seen that before
"(?!\\) # a quote that is not followed by a backslash
(?:[^"\r\n]|"\\)*
# we've seen that before
)* # end of subpattern - repeat 0 or more times (ensures even no. of ")
$ # end of line (start of line in actual input)
) # end of double-quote lookahead
(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)
# the same horrible bastard again for single quotes
(?= # lookahead (effectively lookbehind) for block comments
(?: # subgroup to consume anything except */
(?![*]/) # make sure there is no */ coming up
[\s\S] # consume an arbitrary character
)* # repeat
(?:/[*]|\Z)# require to find either /* or the end of the string
) # end of lookahead for block comments
(?!.*//) # make sure there is no // on this line
Have you considered replacing all comments and single and double quoted string literals with null strings using the re sub() method. Then just do a simple search/match/find of the resulting file for the word you're looking for?
That would at least give you the line numbers where the word is located. You may be able to use that information to edit the original file.
You could use pyparsing to find public keyword outside a comment or a double quoted string:
from pyparsing import Keyword, javaStyleComment, dblQuotedString
keyword = "public"
expr = Keyword(keyword).ignore(javaStyleComment | dblQuotedString)
Example
for [token], start, end in expr.scanString(r"""{keyword} should match
/*
{keyword} should not match "
*/
// this {keyword} also shouldn't match
"neither this \" {keyword}"
but this {keyword} will
re{keyword} is ignored
'{keyword}' - also match (only double quoted strings are ignored)
""".format(keyword=keyword)):
assert token == keyword and len(keyword) == (end - start)
print("Found at %d" % start)
Output
Found at 0
Found at 146
Found at 187
To ignore also single quoted string, you could use quotedString instead of dblQuotedString.
To do it with only regexes, see regex-negation tag on SO e.g., Regular expression to match string not containing a word? or using even less regex capabilities Regex: Matching by exclusion, without look-ahead - is it possible?. The simple way would be to use a positive match and skip matched comments, quoted strings. The result is the rest of the matches.
It's finding the opposite because that's just what you're asking for. :)
I don't know a way to match them all in a single regex (though it should be theoretically possible, since the regular languages are closed under complements and intersections). But you could definitely search for all instances of public, and then remove any instances that are matched by one of your "bad" regexes. Try using for example set.difference on the match.start and match.end properties from re.finditer.

Categories