Querying MongoDB (via pymongo) in case insensitive efficiently

Querying MongoDB (via pymongo) in case insensitive efficiently - python

I'm currently creating a website in python (pyramid) which requires users to sign up and log in. The system allows for users to choose a username which can be a mixture of capital letters, lowercase letters, and numbers.
The problem arises when making sure that two users don't accidentally share the same username, i.e. in my system 'randomUser' should be the same as 'RandomUser' or 'randomuser'.
Unfortunately (in this case) because Mongo stores strings as case sensitive, there could potentially be a number of users with the 'same' username.
I am aware of the method of querying mongo for case insensitive strings:
db.stuff.find_one({"foo": /bar/i});
However, this does not seem to work in my query method using pymongo:
username = '/' + str(username) + '/i'
response = request.db['user'].find_one({"username":username},{"username":1})
Is this the correct way of structuring the query for pymongo (I'm assuming not)?
This query will be used whenever a user account is created or logged in to (as it has to check if the username exists in the system). I know it's not the most efficient query, so should it matter if it's only used on log ins or account creation? Is it more desirable to instead do something like forcing users to choose only lowercase usernames (negating the need for case-insensitive queries altogether)?

PyMongo uses native python regular expressions, in the same way as the mongo shell uses native javascript regular expressions. To write the equivalent query of what you had written in the shell above, you would use:
db.stuff.find_one({'name': re.compile(username, re.IGNORECASE)})
Note that this will avoid using any index that may exist on the name field, however. A common pattern for case-insensitive searching or sorting is to have a second field in your document, for instance name_lower, which is always set whenever name changes (to a lower-cased version of name, in this case). You would then query for such a document like:
db.stuff.find_one({'name_lower': username.lower()})

Accepted answer is dangerous, it will match any string containing the username! Safe option is to match the exact string:
import re
db.stuff.find_one({'name': re.compile('^' + username + '$', re.IGNORECASE)})
Even safer, escape the variable of any special characters which might affect the regex match:
import re
db.stuff.find_one({'name': re.compile('^' + re.escape(username) + '$', re.IGNORECASE)})

Related

Checkmarx Reflected XSS

I have a Django based API. After scanning my application with CheckMarx if showed that I have Reflected XSS volnurability here:
user_input_data = json.loads(request.GET.get('user_input_data'))
What I already tried:
Used django.utils.html.escpae
Used django.utils.html.strip_tags
Used html.escape
Used escapejson package
Every time I run scanning, it finds stored XSS at exactly this location

What you have tried are correct and acceptable but Checkmarx's support to Django is somewhat limited which is why its not recognizing any of the functions you used. You need to argue with your security team about this and have them recognize that this is one of the proper ways of preventing XSS.
A rudimentary way of sanitizing input by using the replace function and replacing the '<' and '>' characters. Not a robust way, but that is what Checkmarx recognizes
def escape(s, quote=None):
'''Replace special characters "&", "<" and ">" to HTML-safe sequences.
If the optional flag quote is true, the quotation mark character (")
is also translated.'''
s = s.replace("&", "&") # Must be done first!
s = s.replace("<", "<")
s = s.replace(">", ">")
if quote:
s = s.replace('"', """)
return s
credit from this code snippet post

Python/SQLite3 escaping in WHERE-Clause

How should i do real escaping in Python for SQLite3?
If i google for it (or search stackoverflow) there are tons of questions for this and every time the response is something like:
dbcursor.execute("SELECT * FROM `foo` WHERE `bar` like ?", ["foobar"])
This helps against SQL-Injections, and is enough if i would do just comperations with "=" but it doesn't stripe Wildcards of course.
So if i do
cursor.execute(u"UPDATE `cookies` set `count`=? WHERE `nickname` ilike ?", (cookies, name))
some user could supply "%" for a nickname and would replace all of the cookie-entries with one line.
I could filter it myself (ugh… i probably will forget one of those lesser-known wildcards anyway), i could use lowercase on nick and nickname and replace "ilike" with "=", but what i would really like to do would be something along the lines of:
foo = sqlescape(nick)+"%"
cursor.execute(u"UPDATE `cookies` set `count`=? WHERE `nickname` ilike ?", (cookies, foo))

? parameters are intended to avoid formatting problems for SQL strings (and other problematic data types like floating-point numbers and blobs).
LIKE/GLOB wildcards work on a different level; they are always part of the string itself.
SQL allows to escape them, but there is no default escape character; you have to choose some with the ESCAPE clause:
escaped_foo = my_like_escape(foo, "\\")
c.execute("UPDATE cookies SET count = ? WHERE nickname LIKE ? ESCAPE '\',
(cookies, escaped_foo))
(And you have to write your own my_like_escape function for % and _ (LIKE) or * and ? (GLOB).)

You've avoided outright code injection by using parametrized queries. Now it seems you're trying to do a pattern match with user-supplied data, but you want the user-supplied portion of the data to be treated as literal data (hence no wildcards). You have several options:
Just filter the input. SQLite's LIKE only understands % and _ as wildcards, so it's pretty hard to get it wrong. Just make sure to always filter inputs. (My preferred method: Filter just before the query is constructed, not when user input is read).
In general, a "whitelist" approach is considered safer and easier than removing specific dangerous characters. That is, instead of deleting % and _ from your string (and any "lesser-known wildcards", as you say), scan your string and keep only the characters you want. E.g., if your "nicknames" can contain ASCII letters, digits, "-" and ".", it can be sanitized like this:
name = re.sub(r"[^A-Za-z\d.-]", "", name)
This solution is specific to the particula field you are matching against, and works well for key fields and other identifiers. I would definitely do it this way if I had to search with RLIKE, which accepts full regular expressions so there are a lot more characters to watch out for.
If you don't want the user to be able to supply a wildcard, why would you use LIKE in your query anyway? If the inputs to your queries come from many places in the code (or maybe you're even writing a library), you'll make your query safer if you can avoid LIKE altogether:
Here's case insensitive matching:
SELECT * FROM ... WHERE name = 'someone' COLLATE NOCASE
In your example you use prefix matching ("sqlescape(nick)+"%""). Here's how to do it with exact search:
size = len(nick)
cursor.execute(u"UPDATE `cookies` set `count`=? WHERE substr(`nickname`, 1, ?) = ?",
(cookies, size, nick))

Ummm, normally you'd want just replace 'ilike' with normal '=' comparison that doesn't interpret '%' in any special way. Escaping (effectively blacklisting of bad patterns) is error prone, e.g. even if you manage to escape all known patterns in the version of sqlLite you use, any future upgrade can put you at risk, etc..
It's not clear to me why you'd want to mass-update cookies based on a fuzzy match on user name.
If you really want to do that, my preferred approach would be to SELECT the list first and decide what to UPDATE at the application level to maintain a maximum level of control.

There are several very fun ways to do this with string format-ing.
From Python's Documentation:
The built-in str and unicode classes provide the ability to do complex variable substitutions and value formatting via the str.format() method:
s = "string"
c = "Cool"
print "This is a {0}. {1}, huh?".format(s,c)
#=> This is a string. Cool, huh?
Other nifty tricks you can do with string formatting:
"First, thou shalt count to {0}".format(3) # References first positional argument
"Bring me a {}".format("shrubbery!") # Implicitly references the first positional argument
"From {} to {}".format('Africa','Mercia') # Same as "From {0} to {1}"
"My quest is {name}" # References keyword argument 'name'
"Weight in tons {0.weight}" # 'weight' attribute of first positional arg
"Units destroyed: {players[0]}" # First element of keyword argument 'players'.`

2 similar looking "update" variants work different

Here is snapshot #1:
strr = "UPDATE fileinfo SET file_name = ? WHERE md5sum = ?"
cr.execute(strr, ( rec[0], rec[1]) )
and #2:
strr = "UPDATE fileinfo SET file_name = {0} WHERE md5sum = {1}".format(rec[0], rec[1])
cr.execute(strr)
The first one works fine while the second one fails. It throws
sqlite3.OperationalError: unrecognized
token: (some token in rec[0], depends on input data, might be "#" or "!" or whatever string you pass to input)
python 3.2, win7
Thank you.

Warning! Bad code follows!
The issue is that in your second example you're not supplying the strings as SQL strings but rather as literal values. That's very unlikely to work! Instead, you'd have to do this:
strr = "UPDATE fileinfo SET file_name = '{0}' WHERE md5sum = '{1}'".format(rec[0], rec[1])
cr.execute(strr)
Notice the additional single quotes? That's SQL string syntax.
But don't do this!
The problem is if the strings you're substituting in have characters in them that are understood by the SQL parser as anything other than a literal character inside a string context. The most obvious example is ' (the single quote character) itself. While you might be safe enough for the md5sum parameter, odd things do crop up in filenames (especially where non-technical users are involved!) so it's better to be careful at the beginning.
It's possible to handle this by adding extra magical quoting to the values during substitution, but it's easy to get wrong (a problem in vast numbers of PHP programs even to this day) and it's doing it the wrong way given that we have a better, simpler solution in the use of a prepared statement.
It's also slow. Using a prepared statement (i.e., your first example) is faster because the SQL engine can parse the code once instead of each time, and the values to be injected can be handled by effectively placing them in the right slot of the generated bytecode. There's never a need to recompute the query plan (i.e., the small program that SQLite creates inside itself) and the values themselves are never fussed around with; they're just faithfully used in the right way.

The two variants you show are very different. In the first, you are using the database api to fill in the parameters - the values are being properly escaped.
In the second variant, you just use pythons string formatting to add the variables to the SQL string - values are not escaped and depending on the contents of rec[0] and rec[1] you will have malformed SQL.
Note also, that this is the path to SQL injection vulnerability!

Python 2.x: how to automate enforcing unicode instead of string?

How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?
Eg.
Can I do it from within the code?
Is there a static analysis tool that has this feature?
Edit:
I wanted this for an application in Python 2.5, but it turns out this is not really possible because:
2.5 doesn't support unicode_literals
kwargs dictionary keys can't be unicode objects, only strings
So I'm accepting the answer that says it's not possible, even though it's for different reasons :)

You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.
There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.
What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.
If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.

It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.
It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:
import parser
from token import tok_name
def checkForNonUnicode(codeString):
return checkForNonUnicodeHelper(parser.suite(codeString).tolist())
def checkForNonUnicodeHelper(lst):
returnValue = True
nodeType = lst[0]
if nodeType in tok_name and tok_name[nodeType] == 'STRING':
stringValue = lst[1]
if stringValue[0] != "u": # Kind of hacky. Does this always work?
print "%s is not unicode!" % stringValue
returnValue = False
else:
for subNode in [lst[n] for n in range(1, len(lst))]:
if isinstance(subNode, list):
returnValue = returnValue and checkForNonUnicodeHelper(subNode)
return returnValue
print checkForNonUnicode("""
def foo():
a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
b = u'although this is ok.'
""")
which prints out
'This should blow up!' is not unicode!
False
True
Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.
Good question!
Edit
Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains
from ..obscure.relative.path import whatever
You can
checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())

Our SD Source Code Search Engine (SCSE) can provide this result directly.
The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array
of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.
It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific
type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").
So a search:
'for' ... I=ij*
finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.
An trivial search:
S
finds all string literals. This is often a pretty big set :-}
A search
UnicodeStrings
finds all string literals that are lexically defined as Unicode Strings (u"...")
What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:
S-UnicodeStrings
All hits shown will be the strings that aren't unicode strings, your precise question.
The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.

avoid regex [python]

I'd like to know if it's a good idea avoid regex.
actually I have avoided it in any case and some peoples has been giving me advice that i shouldn't avoid it, since if you know what means every thing like:
[] '|' \A \B \d \D \W \w \S \Z $ * ? ...
it would be easy to read, right? but i fell like avoiding regex i would have a more readable code.
it gets more unreadable when it's bigger, example: validators.py
email_re = re.compile(
r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom
r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string
r')#(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?$', re.IGNORECASE) # domain
so, I'd like to know a reason to not avoid regex?

No, don't avoid regular expressions. They're actually quite a nifty little tool and will save you a lot of work if you use them wisely.
What you do need to avoid is trying to use it for everything, a malaise that appears to strike those new to regular expressions before they become a little more tempered and a little less enamoured :-)
For example, don't use it to validate email addresses. The way you validate an email address is to send an email to it with a link that the receiver has to click on to complete the "transaction".
There are billions of valid email addresses (according to the RFCs) that have no physical email receiver behind them. The only way to be certain that there is a receiver is to send an email and wait for proof positive that it was received and acted upon.
If I find myself writing a regular expression that's more than, let's say, 60 characters, I step back to see if there's a more readable way. Similarly, if I write a regular expression and come back a week later and can't instantly recognise what it does, I think about replacing it. This particular paragraph consists of my opinions of course, but they've served me well :-)

Regular expressions are a tool. They are perfectly suited to some tasks and not to others. Like any tool, use them when they are the right tool for the job. Don't just avoid them because somebody said they were bad. Learn how to use them and then you can decide for yourself rather then depending on someone elses dogma.

If you choose to use a more general parsing approach, like pyparsing or PLY, you will never require regular expressions (which can only match a small subset of the languages matchable with such general parsers). However, lexers such as the one in PLY are typically built around regular expressions (which are a perfect match for a lexer's needs!), so you will probably have to avoid that (as well as powerful tools such as BeautifulSoup when any "normal" user would be able to keep using and enjoying it by simply passing a regular expression object as the selector, since BeautifulSoup fully supports that) and will have to recode a lot of such existing parsers with your chosen general-purpose parsing package.
Performance may suffer greatly, of course, by using extremely general tools in cases where simpler, highly optimized and concise ones would be a perfect solution -- and the size of your code may "blow up" to being very large in many common cases. But if you don't mind having programs twice as big and twice as slow, and are determined to avoid regular expressions at all costs, you can do that.
On the other hand, if your main concern is with readability (quite an understandable and commendable concern, too), then the re.VERBOSE option, by allowing abundant use of whitespace and comments within the RE's pattern, can really do wonders for that goal without removing any of REs' advantages (except by diluting a sometimes-excessive conciseness;-). You WILL want to also keep at least one general-purpose parsing system under your belt, of course (rather than stretch REs to do tasks they're wrong for, as so many people unfortunately do!) -- but a minimal command of REs will serve you well in so many cases (including, for example, full use of BeautifulSoup and many other tools which can accept REs as parameters to apply them appropriately) that I think it's quite to be recommended.

Just for some comparisions, here my version email format check not with regexp (with test cases) and one readable regexp offered to me as alternative (though sending email after it is accepted, is great idea):
# -*- coding: utf8 -*-
import string
print("Valid letters in this computer are: "+string.letters)
import re
def validateEmail(a):
sep=[x for x in a if not (x.isalpha() or
x.isdigit() or
x in r"!#$%&'*+-/=?^_`{|}~]") ]
sepjoined=''.join(sep)
## sep joined must be ..#.... form
if len(a)>255 or sepjoined.strip('.') != '#': return False
end=a
for i in sep:
part,i,end=end.partition(i)
if len(part)<2: return False
return len(end)>1
def emailval(address):
pattern = "[\.\w]{2,}[#]\w+[.]\w+"
return re.match(pattern, address)
if __name__ == '__main__':
emails = [ "test.#web.com","test+john#web.museum", "test+john#web.m",
"a#n.dk", "and.bun#webben.de","marjaliisa.hämäläinen#hel.fi",
"marja-liisa.hämäläinen#hel.fi", "marjaliisah#hel.",'tony#localhost',
'1234#23.45','me#somewhere']
print('\n\t'.join(["Valid emails are:"] +
filter(validateEmail,emails)))
print('\n\t'.join(["Regexp gives wrong answer:"] +
filter(emailval,emails)))
""" Output:
Valid letters in this computer are: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
Valid emails are:
test+john#web.museum
and.bun#webben.de
tony#localhost
1234#23.45
me#somewhere
Regexp gives wrong answer:
test.#web.com
and.bun#webben.de
1234#23.45
"""
EDIT: cleaned up the regex filter function from this ancient code, edited for #detly link based more permissive version. Good enough for form filling first check for me before sending the confirmation email. Finaly put the 255 character length limit check mentioned in comments.
This code by purpose does not accept the normal a#b as valid email address, but does accept me#somewhere. Another thing is that it depends of what isalpha returns. So this output, which is from Ideone.com has not accepted the scandinavian öä even they are valid nowadays. When run in my home computer, those are accepted. This is even when coding line is there.

(Deleted a regular expression which purported to be an "official" one but is in fact not found in the RFC it claimed to be from.)
This regex may be amusing as it is an attempt to precisely match the e-mail address grammar provided in an older version of the Internet mail standards.

Regular expressions are likely the right tool for extracting/validating email addresses...
To extract one or more email addresses from raw text:
import re
pat_e = re.compile(r'(?P<email>[\w.+-]+#(?:[\w-]+\.)+[a-zA-Z]{2,})')
emails = []
for r in pat_e.finditer(text):
emails.append(r.group('email'))
return emails
To see if a single piece of text is a valid email:
import re
pat_m = re.compile(r'([\w.+-]+#(?:[\w-]+\.)+[a-zA-Z]{2,}$)')
if pat_m.match(text):
return True
return False

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.