Check if a string is in List case-insentive [duplicate] - python

I love using the expression
if 'MICHAEL89' in USERNAMES:
...
where USERNAMES is a list.
Is there any way to match items with case insensitivity or do I need to use a custom method? Just wondering if there is a need to write extra code for this.

username = 'MICHAEL89'
if username.upper() in (name.upper() for name in USERNAMES):
...
Alternatively:
if username.upper() in map(str.upper, USERNAMES):
...
Or, yes, you can make a custom method.

str.casefold is recommended for case-insensitive string matching. #nmichaels's solution can trivially be adapted.
Use either:
if 'MICHAEL89'.casefold() in (name.casefold() for name in USERNAMES):
Or:
if 'MICHAEL89'.casefold() in map(str.casefold, USERNAMES):
As per the docs:
Casefolding is similar to lowercasing but more aggressive because it
is intended to remove all case distinctions in a string. For example,
the German lowercase letter 'ß' is equivalent to "ss". Since it is
already lowercase, lower() would do nothing to 'ß'; casefold()
converts it to "ss".

I would make a wrapper so you can be non-invasive. Minimally, for example...:
class CaseInsensitively(object):
def __init__(self, s):
self.__s = s.lower()
def __hash__(self):
return hash(self.__s)
def __eq__(self, other):
# ensure proper comparison between instances of this class
try:
other = other.__s
except (TypeError, AttributeError):
try:
other = other.lower()
except:
pass
return self.__s == other
Now, if CaseInsensitively('MICHAEL89') in whatever: should behave as required (whether the right-hand side is a list, dict, or set). (It may require more effort to achieve similar results for string inclusion, avoid warnings in some cases involving unicode, etc).

Usually (in oop at least) you shape your object to behave the way you want. name in USERNAMES is not case insensitive, so USERNAMES needs to change:
class NameList(object):
def __init__(self, names):
self.names = names
def __contains__(self, name): # implements `in`
return name.lower() in (n.lower() for n in self.names)
def add(self, name):
self.names.append(name)
# now this works
usernames = NameList(USERNAMES)
print someone in usernames
The great thing about this is that it opens the path for many improvements, without having to change any code outside the class. For example, you could change the self.names to a set for faster lookups, or compute the (n.lower() for n in self.names) only once and store it on the class and so on ...

Here's one way:
if string1.lower() in string2.lower():
...
For this to work, both string1 and string2 objects must be of type string.

I think you have to write some extra code. For example:
if 'MICHAEL89' in map(lambda name: name.upper(), USERNAMES):
...
In this case we are forming a new list with all entries in USERNAMES converted to upper case and then comparing against this new list.
Update
As #viraptor says, it is even better to use a generator instead of map. See #Nathon's answer.

You could do
matcher = re.compile('MICHAEL89', re.IGNORECASE)
filter(matcher.match, USERNAMES)
Update: played around a bit and am thinking you could get a better short-circuit type approach using
matcher = re.compile('MICHAEL89', re.IGNORECASE)
if any( ifilter( matcher.match, USERNAMES ) ):
#your code here
The ifilter function is from itertools, one of my favorite modules within Python. It's faster than a generator but only creates the next item of the list when called upon.

To have it in one line, this is what I did:
if any(([True if 'MICHAEL89' in username.upper() else False for username in USERNAMES])):
print('username exists in list')
I didn't test it time-wise though. I am not sure how fast/efficient it is.

Example from this tutorial:
list1 = ["Apple", "Lenovo", "HP", "Samsung", "ASUS"]
s = "lenovo"
s_lower = s.lower()
res = s_lower in (string.lower() for string in list1)
print(res)

My 5 (wrong) cents
'a' in "".join(['A']).lower()
UPDATE
Ouch, totally agree #jpp, I'll keep as an example of bad practice :(

I needed this for a dictionary instead of list, Jochen solution was the most elegant for that case so I modded it a bit:
class CaseInsensitiveDict(dict):
''' requests special dicts are case insensitive when using the in operator,
this implements a similar behaviour'''
def __contains__(self, name): # implements `in`
return name.casefold() in (n.casefold() for n in self.keys())
now you can convert a dictionary like so USERNAMESDICT = CaseInsensitiveDict(USERNAMESDICT) and use if 'MICHAEL89' in USERNAMESDICT:

Related

Avoid duplicated operations in lambda functions

I'm using a lambda function to extract the number in a string:
text = "some text with a number: 31"
get_number = lambda info,pattern: re.search('{}\s*(\d)'.format(pattern),info.lower()).group(1) if re.search('{}\s*(\d)'.format(pattern),info.lower()) else None
get_number(text,'number:')
How can I avoid to make this operation twice?:
re.search('{}\s*(\d)'.format(pattern),info.lower()
You can use findall() instead, it handles a no match gracefully. or is the only statement needed to satisfy the return conditions. The None is evaluated last, thus returned if an empty list is found (implicit truthiness of literals like lists).
>>> get_number = lambda info,pattern: re.findall('{}\s*(\d)'.format(pattern),info.lower()) or None
>>> print get_number(text, 'number:')
['3']
>>> print get_number(text, 'Hello World!')
>>>
That being said, I'd recommend defining a regular named function using def instead. You can extract more complex parts of this code to variables, leading to an easier to follow algorithm. Writing long anonymous function can lead to code smells. Something similar to below:
def get_number(source_text, pattern):
regex = '{}\s*(\d)'.format(pattern)
matches = re.findall(regex, source_text.lower())
return matches or None
This is super ugly, not going to lie, but it does work and avoids returning a match object if it's found, but does return None when it's not:
lambda info,pattern: max(re.findall('{}\s*(\d)'.format(pattern),info.lower()),[None],key=lambda x: x != [])[0]

How would you use the ord function to get ord("a") as 0?

For example, in python, when I type in ord("a") it returns 97 because it refers to the ascii list. I want ord("a") to return zero from a string that I created such as
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789 .,?!"
so ord("b") would be 1 and ord("c") would be 2 ect.
How would I go about doing this?
You don't.
You're going about this the wrong way: you're making the mistake
This existing thing doesn't meet my needs. I want to make it meet my needs!
instead, the way to go about the problem is
This existing thing doesn't meet my needs. I need a thing that does meet my needs!
Once you realize that, the problem is now pretty straightforward. e.g.
DEFAULT_ALPHABET = "abcdefghijklmnopqrstuvwxyz0123456789 .,?!"
def myord(x, alphabet=DEFAULT_ALPHABET):
return alphabet.find(x)
Something like this should do the trick:
def my_ord(c):
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789 .,?!"
return alphabet.index(c)
If i've understood correctly, this is what you want:
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789 .,?!"
def crypt(c, key=97):
return ord(c)-key
def decrypt(c, key=97):
return chr(c+key)
dst = [crypt(c) for c in alphabet]
src = [decrypt(c) for c in dst]
print dst
print ''.join(src)
You can create a dict to map from characters to indices and then do lookups into that. This will avoid repeatedly searching the string as other answers are suggesting (which is O(n)) and instead give O(1) lookup time with respect to the alphabet:
my_ord_dict = {c : i for i, c in enumerate(alphabet)}
my_ord_dict['0'] # 26
At that point you can easily wrap it in a function:
def my_ord(c):
return my_ord_dict['0']
Or use the bound method directly
my_ord = my_ord_dict.__getitem__
But you don't want to change the name that refers to a builtin function, that'll confuse everyone else trying to use it that can see your change. If you are really trying to hurt yourself you can replace my_ord with ord in the above.

Sort strings accompanied by integers in list

I am trying to make a leaderboard.
Here is a list i have :
list=['rami4\n', 'kev13\n', 'demian6\n']
I would like to be able to sort this list from highest number to smallest, or even smallest to highest, giving something like :
list=['kev13\n', 'demian6\n', 'rami4\n']
I tried to use stuff like re.findall('\d+', list[loop])[0] but i only managed to get, out of the list, the best player. Not wanting to repeat the code for as many players as there are, does anyone have an idea ?
You indeed have to use the re module, but also the key parameter of the sort() method.
reg = re.compile('\w*?(\d+)\\n')
lst.sort(key=lambda s: int(reg.match(s).group(1)))
It works fine using findall() as you did too:
reg = re.compile('\d+')
lst.sort(key=lambda s: int(reg.findall(s)[0]))
Note that I compile() the regular expression so it is computed once and for all rather than for each element in the list.
I have an other solution based on Object Oriented Programming and the overriding of the __lt__ special methods of str.
import re
class SpecialString(str):
def __lt__(self, other):
pattern=re.compile(r"\d+")
return int(pattern.search(str(self)).group(0)) < int(pattern.search(str(other)).group(0))
if __name__ == "__main__":
listing = ['rami4\n', 'kev13\n', 'demian6\n']
spe_list = [SpecialString(x) for x in listing]
spe_list.sort()
print(spe_list)
Which print to the standard output:
['rami4\n', 'demian6\n', 'kev13\n']
This method allows you to not rewrite the sort function and use the built-in one (which is probably optimized). More over, since your strings may be thinked like "specialization of the str class", the inheritance mecanism is very suitable because you keep all its properties but re-write its comparison mecanism.

Determine whether Python object is regex or string

Thought exercise: What is the "best" way to write a Python function that takes a regex pattern or a string to match exactly:
import re
strings = [...]
def do_search(matcher):
"""
Returns strings matching matcher, which can be either a string
(for exact match) or a compiled regular expression object
(for more complex matches).
"""
if not is_a_regex_pattern(matcher):
matcher = re.compile('%s$' % re.escape(matcher))
for s in strings:
if matcher.match(s):
yield s
So, ideas for the implementation of is_a_regex_pattern()?
You can access the _sre.SRE_Pattern type via re._pattern_type:
if not isinstance(matcher, re._pattern_type):
matcher = re.compile('%s$' % re.escape(matcher))
Below is a demonstration:
>>> import re
>>> re._pattern_type
<class '_sre.SRE_Pattern'>
>>> isinstance(re.compile('abc'), re._pattern_type)
True
>>>
Or, make it quack:
try:
does_match = matcher.match(s)
except AttributeError:
does_match = re.match(matcher.s)
if does_match:
yield s
In other words, treat matcher as if it already were a compiled regular expression. And if that breaks, then treat it like a string that needs to be compiled.
This is called Duck Typing. Not everyone agrees that exceptions should be used like this for routine contingencies. This is the ask-permission versus ask-forgiveness debate. Python is more amenable to forgiveness than most languages.
Not a string:
def is_a_regex_pattern(s):
return not isinstance(s, basestring)
Is a _sre.SRE_Pattern (though that's not importable, so use a gross string match):
def is_a_regex_pattern(s):
return s.__class__.__name__ == 'SRE_Pattern'
You can re-compile a SRE_Pattern and it seems to evaluate the same.
def is_a_regex_pattern(s):
return s == re.compile(s)
You could test, if matcher has an method match:
import re
def do_search(matcher, strings):
"""
Returns strings matching matcher, which can be either a string
(for exact match) or a compiled regular expression object
(for more complex matches).
"""
if hasattr(matcher, 'match'):
test = matcher.match
else:
test = lambda s: matcher==s
for s in strings:
if test(s):
yield s
You should not use global variables, but use a second parameter.
On Python 3.7, re._pattern_type was renamed to re.Pattern
https://stackoverflow.com/a/27366172/895245 therefore broke at that point, as re._pattern_type is not defined.
While re.Pattern looks nicer and will therefore hopefully be more stable, it is not mentioned at all in the docs: https://docs.python.org/3/library/re.html#regular-expression-objects so maybe it is not a good idea to rely on it.
https://stackoverflow.com/a/46779329/895245 does make some sense. But what is someday the str class adds a .match method and it does something completely different? :-) Ah, the joys of typeless languages.
So I think I'm going with:
import re
_takes_s_or_re_type = type(re.compile(''))
def takes_s_or_re(s_or_re):
if isinstance(s_or_re, _takes_s_or_re_type):
return 0
else:
return 1
assert takes_s_or_re(re.compile('a.c')) == 0
assert takes_s_or_re('a.c') == 1
as this can only break when a public API breaks.
Tested on Python 3.8.0.

Python: Lazy String Decoding

I'm writing a parser, and there is LOTS of text to decode but most of my users will only care about a few fields from all the data. So I only want to do the decoding when a user actually uses some of the data. Is this a good way to do it?
class LazyString(str):
def __init__(self, v) :
self.value = v
def __str__(self) :
r = ""
s = self.value
for i in xrange(0, len(s), 2) :
r += chr(int(s[i:i+2], 16))
return r
def p_buffer(p):
"""buffer : HASH chars"""
p[0] = LazyString(p[2])
Is that the only method I need to override?
I'm not sure how implementing a string subclass is of much benefit here. It seems to me that if you're processing a stream containing petabytes of data, whenever you've created an object that you don't need to you've already lost the game. Your first priority should be to ignore as much input as you possibly can.
You could certainly build a string-like class that did this:
class mystr(str):
def __init__(self, value):
self.value = value
self._decoded = None
#property
def decoded(self):
if self._decoded == None:
self._decoded = self.value.decode("hex")
return self._decoded
def __repr__(self):
return self.decoded
def __len__(self):
return len(self.decoded)
def __getitem__(self, i):
return self.decoded.__getitem__(i)
def __getslice__(self, i, j):
return self.decoded.__getslice__(i, j)
and so on. A weird thing about doing this is that if you subclass str, every method that you don't explicitly implement will be called on the value that's passed to the constructor:
>>> s = mystr('a0a1a2')
>>> s
 ¡¢
>>> len(s)
3
>>> s.capitalize()
'A0a1a2'
I don't see any kind on lazy evaluation in your code. The fact that you use xrange only means that the list of integers from 0 to len(s) will be generated on demand. The whole string r will be decoded during string conversion anyway.
The best way to implement lazy sequence in Python is using generators. You could try something like this:
def lazy(v):
for i in xrange(0, len(v), 2):
yield int(v[i:i+2], 16)
list(lazy("0a0a0f"))
Out: [10, 10, 15]
What you're doing is built in already:
s = "i am a string!".encode('hex')
# what you do
r = ""
for i in xrange(0, len(s), 2) :
r += chr(int(s[i:i+2], 16))
# but decoding is builtin
print r==s.decode('hex') # => True
As you can see your whole decoding is s.decode('hex').
But "lazy" decoding sounds like premature optimization to me. You'd need gigabytes of data to even notice it. Try profiling, the .decode is 50 times faster that your old code already.
Maybe you want somthing like this:
class DB(object): # dunno what data it is ;)
def __init__(self, data):
self.data = data
self.decoded = {} # maybe cache if the field data is long
def __getitem__(self, name):
try:
return self.decoded[name]
except KeyError:
# this copies the fields data
self.decoded[name] = ret = self.data[ self._get_field_slice( name ) ].decode('hex')
return ret
def _get_field_slice(self, name):
# find out what part to decode, return the index in the data
return slice( ... )
db = DB(encoded_data)
print db["some_field"] # find out where the field is, get its data and decode it
The methods you need to override really depend on how are planning to use you new string type.
However you str based type looks a little suspicious to me, have you looked into the implementation of str to check that it has the value attribute that you are setting in your __init__()? Performing a dir(str) does not indicate that there is any such attribute on str. This being the case the normal str methods will not be operating on your data at all, I doubt that is the effect you want otherwise what would be the advantage of sub-classing.
Sub-classing base data types is a little strange anyway unless you have very specific requirements. For the lazy evaluation you want you are probably better of creating your class that contains a string rather than sub-classing str and write your client code to work with that class. You will then be free to add the just in time evaluation you want in a number of ways an example using the descriptor protocol can be found in this presentation: Python's Object Model (search for "class Jit(object)" to get to the relevant section)
The question is incomplete, in that the answer will depend on details of the encoding you use.
Say, if you encode a list of strings as pascal strings (i.e. prefixed with string length encoded as a fixed-size integer), and say you want to read the 100th string from the list, you may seek() forward for each of the first 99 strings and not read their contents at all. This will give some performance gain if the strings are large.
If, OTOH, you encode a list of strings as concatenated 0-terminated stirngs, you would have to read all bytes until the 100th 0.
Also, you're speaking about some "fields" but your example looks completely different.

Categories