Unicode case conversion

Unicode case conversion - python

I am given either a single character or a string, and am using Python.
How do I find out if a specific character has a lowercase equivalent according to the standards (standard and special case mappings) proposed by Unicode?
And how do I find out if a string has one or more characters that have a lowercase equivalent according to the standards (standard and special case mappings) proposed by Unicode?

def haslower(unicodechar):
return unicodechar != unicodechar.lower()
def anylower(unicodestring):
return any(haslower(c) for c in unicodestring)
This will only work correctly in as much as the Python version you're using has correctly implemented the .lower() method per unicode standards, of course. Also, I'm assuming that you don't consider, e.g., u'a', to "have a lowercase equivalent" (it has an uppercase one of course). If you mean something different, consider
def changescase(uc):
return uc != uc.lower() or uc != uc.upper()
(I've renamed the argument to uc to avoid excessive line length;-) -- if that's what you want I recommend not naming the function in terms of "lowercase equivalent" as that would be sure to confuse readers/maintainers of your code!-)

#Albert, You appear to be overly concerned with the minutiae of case conversion, when you haven't yet sorted out (nor explained to answerers) what you really want to do.
=== Your previous attempt at explanation (in comment on my answer to this question) ===
#John: Well, I'm actually making an API for my web service. My webservice accepts a key that maps out to a specific record in my database. The key is case-sensitive, and the key can be composed of any unicode characteer. So in order to normalize all input, I will convert all key queries into lowercase (if they have uppercase equivalents). A consequence of that is when I create the record keys (which my users can customize), I cannot accept any uppercase character that can be converted to a lowercase equivalent by the toLower() function. So I'm trying to make a filter for that. Any suggestions?
=== and my replying comment ===
#Albert: If your keys are case sensitive, why are you normalising them??? "record keys which users can customize" means what??? "any unicode char" vs "cannot accept any uppercase char" ??? To answer your question literally: Looks like you can't accept a character c when c.lower() != c which means that you can't accept any key if key.lower() != key. I think that you should start a NEW QUESTION, explaining exactly what you are trying to do, with examples.
... and you've certainly asked a new question (in fact 2 of them) but you haven't explained anything. This "new" question is so new that #Alex Martelli's answer is essentially the same as my comment highlighted above.
I think that you should start a NEW QUESTION, with new content, explaining exactly what you are trying to do, with examples.

Related

I know of f-strings, but what are r-strings? Are there others?

I started learning python for the first time in an accelerated course on data science a few weeks ago and we were introduced early on to f-strings.
The simple code:
name = 'Tim'
print(f'There are some who call me {name}...')
outputs the string "There are some who call me Tim..."
Through my browsing of various packages out of curiosity, I came upon pages like this one detailing a function you can call in matplotlib to render $\LaTeX$-like expressions within the generated images. In the example code they use something similar to f-strings but with an r instead of an f.
import matplotlib.pyplot as plt
plt.title(r'$\alpha > \beta$')
plt.show()
The resulting (otherwise empty) graph has a title using text which has been formatted similarly to how one would expect using MathJax or $\LaTeX$ with a greek character alpha and a greek character beta.
My questions are the following:
What precisely is an r-string and how does it compare to an f-string? Are r-strings specifically used for matplotlib's mathtext and usetex?
Apart from f-strings and r-strings, are there any other notable similar string variants or alternates that I should familiarize myself with or be made aware of?

An r-string is a raw string.
It ignores escape characters. For example, "\n" is a string containing a newline character, and r"\n" is a string containing a backslash and the letter n.
If you wanted to compare it to an f-string, you could think of f-strings as being "batteries-included." They have tons of flexibility in the ability to escape characters and execute nearly arbitrary expressions. The r-string on the other hand is stripped down and minimalist, containing precisely the characters between its quotation marks.
As far as actually using the things, typically you would use an r-string if you're passing the string into something else that uses a bunch of weird characters or does its own escaping so that you don't have to think too hard about how many backslashes you really need to get everything to work correctly. In your example, they at least needed r-strings to get the \a bit working correctly without double escapes. Note that '$\\alpha > \\beta$' is identical to r'$\alpha > \beta$'.
Since you're using f-strings, I'll assume you have at least Python 3.6. Not all of these options are supported for older versions but any of the following prefixes are valid in Python 3.6+ in any combination of caps and lowers: r, u, f, rf, fr, b, rb, br
The b-strings are binary literals. In Python 2 they do nothing and only exist so that the source code is compatible with Python 3. In Python 3, they allow you to create a bytes object. Strings can be thought of as a view of the underlying bytes, often restricted as to which combinations are allowed. The distinction in types helps to prevent errors from blindly applying text techniques to raw data. In Python 3, note that 'A'==b'A' is False. These are not the same thing.
The u-strings are unicode literals. Strings are unicode by default in Python 3, but the u prefix is allowed for backward compatibility with Python 2. In Python 2, strings are ASCII by default, and the u prefix allows you to include non-ASCII characters in your strings. For example, note the accented character in the French phrase u"Fichier non trouvé".
In the kind of code I write, I rarely need anything beyond r, u, f, and b. Even b is a bit out there. Other people deal with those prefixes every day (presumably). They aren't necessarily anything you need to familiarize yourself with, but knowing they exist and being able to find their documentation is probably a good skill to have.
Just so that it's in an answer instead of buried in a comment, Peter Gibson linked the language specification, and that's the same place I pulled the prefix list from. With your math background, a formal language specification might be especially interesting — depending a little on how much you like algebra and mathematical logic.
Even if it's just for a semantically trivial language like Forth, I think many programmers would enjoy writing a short interpreter and gain valuable insight into how their language of choice works.

Identify Visually Similar Strings in Python

I am working on a python project in which I need to filter profane words, and I already have a filter in place. The only problem is that if a user switches a character with a visually similar character (e.g. hello and h311o), the filter does not pick it up. Is there some way that I could find detect these words without hard coding every combination in?

What about translating l331sp33ch to leetspeech and applying a simple levensthein distance? (you need to pip install editdistance first)
import editdistance
try:
from string import maketrans # python 2
except:
maketrans = str.maketrans # python 3
t = maketrans("01345", "oleas")
editdistance.eval("h3110".translate(t), 'hello')
results in 0

Maybe build a relationship between the visually similar characters and what they can represent i.e.
dict = {'3': 'e', '1': 'l', '0': 'o'} #etc....
and then you can use this to test against your database of forbidden words.
e.g.
input:he11
if any of the characters have an entry in dict,
dict['h'] #not exist
dict['e'] #not exist
dict['1'] = 'l'
dict['1'] = 'l'
Put this together to form a word and then search your forbidden list. I don't know if this is the fastest way of doing it, but it is "a" way.
I'm interested to see what others come up with.
*disclaimer: I've done a year or so of Perl and am starting out learning Python right now. When I get the time. Which is very hard to come by.

Linear Replacement
You will want something adaptable to innovative orthographers. For a start, pattern-match the alphabetic characters to your lexicon of banned words, using other characters as wild cards. For instance, your example would get translated to "h...o", which you would match to your proposed taboo word, "hello".
Next, you would compare the non-alpha characters to a dictionary of substitutions, allowing common wild-card chars to stand for anything. For instance, asterisk, hyphen, and period could stand for anything; '4' and '#' could stand for 'A', and so on. However, you'll do this checking from the strength of the taboo word, not from generating all possibilities: the translation goes the other way.
You will have a little ambiguity, as some characters stand for multiple letters. "#" can be used in place of 'O' of you're getting crafty. Also note that not all the letters will be in your usual set: you'll want to deal with moentary symbols (Euro, Yen, and Pound are all derived from letters), as well as foreign letters that happen to resemble Latin letters.
Multi-character replacements
That handles only the words that have the same length as the taboo word. Can you also handle abbreviations? There are a lot of combinations of the form "h-bomb", where the banned word appears as the first letter only: the effect is profane, but the match is more difficult, especially where the 'b's are replaced with a scharfes-S (German), the 'm' with a Hebrew or Cryllic character, and the 'o' with anything round form the entire font.
Context
There is also the problem that some words are perfectly legitimate in one context, but profane in a slang context. Are you also planning to match phrases, perhaps parsing a sentence for trigger words?
Training a solution
If you need a comprehensive solution, consider training a neural network with phrases and words you label as "okay" and "taboo", and let it run for a day. This can take a lot of the adaptation work off your shoulders, and enhancing the model isn't a difficult problem: add your new differentiating text and continue the training from the point where you left off.

Thank you to all who posted an answer to this question. More answers are welcome, as they may help others. I ended up going off of David Zemens' comment on the question.
I'd use a dictionary or list of common variants ("sh1t", etc.) which you could persist as a plain text file or json etc., and read in to memory. This would allow you to add new entries as needed, independently of the code itself. If you're only concerned about profanities, then the list should be reasonably small to maintain, and new variations unlikely. I've used a hard-coded dict to represent statistical t-table (with 1500 key/value pairs) in the past, seems like your problem would not require nearly that many keys.
While this still means that all there word will be hard coded, it will allow me to update the list more easily.

Python3 comparison operator, letter match a letter [duplicate]

This question already has answers here:
python-re: How do I match an alpha character
(3 answers)
Closed 4 years ago.
Ok so basically this is what I know, and it does work, using Python3:
color="Red1 and Blue2!"
color[2]=="d"
True
What I need is that when I call any position, (which inputs any single character Lower or Upper case in the comparison), into the brackets "color[ ]" and compare it to match only with "Lower or Upper case letters" excluding all numbers and characters (.*&^%$##!).
in order words something to the effects below:
color="Red1 and Blue2!"
if color[5]==[a-zA-z]:
doSomething
else:
doSomethingElse
Of course what I just listed above does not work. Perhaps my syntax is wrong, perhaps it just cant be done. If I only use a single letter on the "right" side of the equals, then all is well, But like I said I need whatever single letter is pulled into the left side, to match something on the right.
First off I wan't to make sure that its possible to do, what I'm trying to accomplish?
2nd, if it is indeed possible to do then have this accomplished "Without" importing anything other then "sys".
If the only way to accomplish this is by importing something else, then I will take a look at that suggestion, however I prefer not to import anything if at all possible.
I'v searched my books, and a whole other questions on this site and I can't seem to find anything that matches, thanks.

For the case of looking for letters, a simple .isalpha() check:
if color[5].isalpha():
will work.
For the general case where a specific check function doesn't exist, you can use in checks:
if color[5] in '13579': # Checks for existence in some random letter set
If the "random letter set" is large enough, you may want to preconvert to a frozenset for checking (frozenset membership tests are roughly O(1), vs. O(n) for str, but str tests are optimized enough that you'd need quite a long str before the frozenset makes sense; possibly larger than the one in the example):
CHARSET = frozenset('13579adgjlqetuozcbm')
if color[5] in CHARSET:
Alternatively, you can use regular expressions to get the character classes you were trying to get:
import re
# Do this once up front to avoid recompiling, then use repeatedly
islet = re.compile('^[a-zA-Z]$').match
...
if islet(color[5]):

This is where isalpha() is helpful.
color="Red1 and Blue2!"
if color[5].isalpha():
doSomething
else:
doSomethingElse
There's also isnumeric(), if you need numbers.

Not really sure why you'd require not importing anything from the standard libraries though.
import string
color="Red1 and Blue2!"
if color[5] in string.ascii_letters:
print("do something")
else:
print("do something else")

Python efficient mass replacing unknown characterers

PHP4+mySQL4 based project post to Django 1.1 project and it mixes up some letters.
What is the best way (most efficient) to replace in this fashion?
The problem for me is that i cannot get values for those letters. Is there an online tool to do that?
I have textField with various letters and i want to replace those in this fashion:
àèæëáðøûþ => ąčęėįšųūž
ÀÈÆËÁÐØÛÞ => ĄČĘĖĮŠŲŪŽ
I had similar case where i had to clean up the code so i used this:
def clean(string):
return ''.join([c for c in string if ord(c) > 31 or ord(c) in [9, 10, 13]] )
Update: i succeeded to extract Unicode values looking at Django debug messages (replace_from:replace_to):
{'\xe0':'\u0105', '\xe8':'\u010d', '\xe6':'\u0119', '\xeb':'\u0117', '\xe1':'\u012f',
'\xf0':'\u0161', '\xf8':'\u0179', '\xfb':'\u016b', '\xfe':'\u017e',
'\xc0':'\u0104', '\xc8':'\u010c', '\xc6':'\u0118', '\xcb':'\u0116', '\xc1':'\u012e',
'\xd0':'\u0160', '\xd8':'\u0172', '\xdb':'\u016a', '\xde':'\u017d'
So the main problem remains - replacing

Try the str.replace() method - should work with unicode strings.
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
Make sure your old and new strings are of type Unicode
(that applies to your input data as well).
Find out what your input (non-unicode) string is supposed to be encoded in.
For example, it may be in latin1 encoding.
Use the builtin str.decode() method to create a Unicode version of your data,
and feed that to str.replace().
>>> unioldchars = oldchars.decode("latin1")
>>> newdata = data.replace(unioldchars, newchars)

I'd do it myself. The built-in replace function is of little use if you want multiple, efficient replacements.
Give this a look: http://code.activestate.com/recipes/81330-single-pass-multiple-replace/
EDIT: WAIT, you wanted to do the replacement client-side, like in the text-box?

string.translate(s, table[, deletechars])
Delete all characters from s that are in deletechars (if
present), and then translate the characters using table, which must be
a 256-character string giving the translation for each character value,
indexed by its ordinal. If table is None, then only the character deletion
step is performed.
See also http://docs.python.org/library/string.html#string.maketrans

Python 2.x: how to automate enforcing unicode instead of string?

How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?
Eg.
Can I do it from within the code?
Is there a static analysis tool that has this feature?
Edit:
I wanted this for an application in Python 2.5, but it turns out this is not really possible because:
2.5 doesn't support unicode_literals
kwargs dictionary keys can't be unicode objects, only strings
So I'm accepting the answer that says it's not possible, even though it's for different reasons :)

You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.
There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.
What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.
If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.

It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.
It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:
import parser
from token import tok_name
def checkForNonUnicode(codeString):
return checkForNonUnicodeHelper(parser.suite(codeString).tolist())
def checkForNonUnicodeHelper(lst):
returnValue = True
nodeType = lst[0]
if nodeType in tok_name and tok_name[nodeType] == 'STRING':
stringValue = lst[1]
if stringValue[0] != "u": # Kind of hacky. Does this always work?
print "%s is not unicode!" % stringValue
returnValue = False
else:
for subNode in [lst[n] for n in range(1, len(lst))]:
if isinstance(subNode, list):
returnValue = returnValue and checkForNonUnicodeHelper(subNode)
return returnValue
print checkForNonUnicode("""
def foo():
a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
b = u'although this is ok.'
""")
which prints out
'This should blow up!' is not unicode!
False
True
Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.
Good question!
Edit
Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains
from ..obscure.relative.path import whatever
You can
checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())

Our SD Source Code Search Engine (SCSE) can provide this result directly.
The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array
of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.
It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific
type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").
So a search:
'for' ... I=ij*
finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.
An trivial search:
S
finds all string literals. This is often a pretty big set :-}
A search
UnicodeStrings
finds all string literals that are lexically defined as Unicode Strings (u"...")
What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:
S-UnicodeStrings
All hits shown will be the strings that aren't unicode strings, your precise question.
The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.