Python read a single unicode character from the user - python

I am searching for a method to get a single unicode character from the standard input. Recently, I saw this topic in which the solution does not apply for unicode characters but only ASCII ones.
Using the function getch() cited in the mentioned topic, when the user types an unicode character, it is represented as more than one ASCII characters. In fact, getch() only returns the first part (byte). The remaining bytes are only accessible using getch() again (however I do not know how to know how many bytes remain).
Is there a way to actually get a single unicode character from the input?
Thanks!

If you are using UTF-8 the first byte of a multibyte character tells you how many bytes there are. So something like this can work:
c = getch()
first_byte = ord(c)
bytes_remain = 0
while (first_byte >> (6 - bytes_remain)) & 0b11 == 0b11:
bytes_remain += 1
c += getch()

Related

Character count of Unicode string [duplicate]

This question already has an answer here:
python get unicode string size
(1 answer)
Closed 8 years ago.
How would I get the character count of the below in python?
s = 'הוא אוסף אתכם מחר בשלוש וחצי.'
Char count: 29
Char length: 52
len(s) = 52
? = 29
decode your byte string (according to whatever encoding it's in, utf-8 maybe) -- the len of the resulting Unicode string is what you're after.
If fact best practice is to decode inputs as soon as possible, deal only with actual text (i.e, unicode, in Python 2; it's just the way ordinary strings are, in Python 3) in your code, and if need be encode just as you're outputting again.
Byte strings should be handled in your program only if it's specifically about byte strings (e.g, controlling or monitoring some hardware device, &c) -- far more programs are about text, and thus, except where indispensable at some I/O boundaries, they should be exclusively dealing with text strings (spelled unicode in Python 2:-).
But if you do want to keep s as a bytestring nevertheless,
len(s.decode('utf-8'))
(or whatever other encoding you're using to represent text as byte strings) should still do what you request.
Use a unicode string
s = 'הוא אוסף אתכם מחר בשלוש וחצי.'
len(s) #52
s = u'הוא אוסף אתכם מחר בשלוש וחצי.'
len(s) #29

Large strings and len()

This may be a newbie question, but here it goes. I have a large string (167572 bytes) with both ASCII and non ASCII characters. When I use len() on the string I get the wrong length. It seems that len() doesn't count 0x0A characters. The only way I can get the actual length of the string is with this code:
for x in test:
totalLen += 1
for x in test:
if x == '\x0a':
totalLen += 1
print totalLen
What is wrong with len()? Or am I using it wrong?
You are confusing encoded byte strings with unicode text. In UTF-8, for example, up to 3 bytes are used to encode any given character, in UTF-16 each character is encoded using at least 2 bytes each.
A python string is a series of bytes, to get unicode you'd have to decode the string with an appropriate codec. If your text is encoded using UTF-8, for example, you can decode it with:
test = test.decode('utf8')
On the other hand, data written to a file is always encoded, so a unicode string of length 10 could take up 20 bytes in a file, if written using the UTF-16 codec.
Most likely you are getting confused with such 'wider' characters, not with wether or not your \n (ASCII 10) characters are counted correctly.
Please do yourself a favour and read up on Unicode and encodings:
Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The Python Unicode HOWTO.
Could it be that you're expecting it to contain \r\n, i.e. ASCII 13 (carriage return) followed by ASCII 10 (line feed), or that you look at the string once it's been written out to a text file, which adds these?
It's hard to be specific since you don't give a lot of detail, i.e. where the string's data comes from.

Python: Convert Unicode-Hex-String to Unicode

I have a hex-string made from a unicode string with that function:
def toHex(s):
res = ""
for c in s:
res += "%02X" % ord(c) #at least 2 hex digits, can be more
return res
hex_str = toHex(u"...")
This returns a string like this one:
"80547CFB4EBA5DF15B585728"
That's a sequence of 6 chinese symbols.
But
u"Knödel"
converts to
"4B6EF664656C"
What I need now is a function to convert this back to the original unicode. The chinese symbols seem to have a 2-byte representation while the second example has 1-byte representations for all characters. So I can't just use unichr() for each 1- or 2-byte block.
I've already tried
binascii.unhexlify(hex_str)
but this seems to convert byte-by-byte and returns a string, not unicode. I've also tried
binascii.unhexlify(hex_str).decode(...)
with different formats. Never got the original unicode string.
Thank you a lot in advance!
This seems to work just fine:
binascii.unhexlify(binascii.hexlify(u"Knödel".encode('utf-8'))).decode('utf-8')
Comes back to the original object. You can do the same for the chinese text if it's encoded properly, however ord(x) already destroys the text you started from. You'll need to encode it first and only then treat like a string of bytes.
Can't be done. Using %02X loses too much information. You should be using something like UTF-8 first and converting that, instead of inventing a broken encoding.
>>> u"Knödel".encode('utf-8').encode('hex')
'4b6ec3b664656c'
When I was working with Unicode in a VB app a while ago the first 1 or 2 digits would be removed if they were a "0". Meaning "&H00A2" would automatically be converted to "&HA2", I just created a small function to check the length of the string and if it was less than 4 chars add the missing 0's. I'm not sure if this is what's happening to you, but I thought I would give bit of information as something to be aware of.

Returning the first N characters of a unicode string

I have a string in unicode and I need to return the first N characters.
I am doing this:
result = unistring[:5]
but of course the length of unicode strings != length of characters.
Any ideas? The only solution is using re?
Edit: More info
unistring = "Μεταλλικα" #Metallica written in Greek letters
result = unistring[:1]
returns-> ?
I think that unicode strings are two bytes (char), that's why this thing happens. If I do:
result = unistring[:2]
I get
M
which is correct,
So, should I always slice*2 or should I convert to something?
Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (str) and Unicode strings (unicode).
Prior to the unification in Python 3.0 there are two ways to declare a string literal: unistring = "Μεταλλικα" which is a byte string and unistring = u"Μεταλλικα" which is a unicode string.
The reason you see ? when you do result = unistring[:1] is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example.
So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO
When you say:
unistring = "Μεταλλικα" #Metallica written in Greek letters
You do not have a unicode string. You have a bytestring in (presumably) UTF-8. That is not the same thing. A unicode string is a separate datatype in Python. You get unicode by decoding bytestrings using the right encoding:
unistring = "Μεταλλικα".decode('utf-8')
or by using the unicode literal in a source file with the right encoding declaration
# coding: UTF-8
unistring = u"Μεταλλικα"
The unicode string will do what you want when you do unistring[:5].
There is no correct straight-forward approach with any type of "Unicode string".
Even Python "Unicode" UTF-16 string has variable length characters so, you can't just cut with ustring[:5]. Because some Unicode Code points may use more then one "character" i.e. Surrogate pairs.
So if you want to cut 5 code points (note these are not characters) so you may analyze the text, see http://en.wikipedia.org/wiki/UTF-8 and http://en.wikipedia.org/wiki/UTF-16 definitions. So you need to use some bit masks to figure out boundaries.
Also you still do not get characters. Because for example. Word "שָלוֹם" -- peace in Hebrew "Shalom" consists of 4 characters and 6 code points letter "shin", vowel "a" letter "lamed", letter "vav" and vowel "o" and final letter "mem".
So character is not code point.
Same for most western languages where a letter with diacritics may be represented as two code points. Search for example for "unicode normalization".
So... If you really need 5 first characters you have to use tools like ICU library. For example there is ICU library for Python that provides characters boundary iterator.

How to check if a string in Python is in ASCII?

I want to I check whether a string is in ASCII or not.
I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()'s documentation).
Is there another way to check?
I think you are not asking the right question--
A string in python has no property corresponding to 'ascii', utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that's where you need to go for an answer.
Perhaps the question you can ask is: "Is this string the result of encoding a unicode string in ascii?" -- This you can answer
by trying:
try:
mystring.decode('ascii')
except UnicodeDecodeError:
print "it was not a ascii-encoded unicode string"
else:
print "It may have been an ascii-encoded unicode string"
def is_ascii(s):
return all(ord(c) < 128 for c in s)
In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.
def isascii(s):
"""Check if the characters in string s are in ASCII, U+0-U+7F."""
return len(s) == len(s.encode())
To check, pass the test string:
>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True
New in Python 3.7 (bpo32677)
No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method - .isascii() will check if the strings is ascii.
print("is this ascii?".isascii())
# True
Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:
try:
mystring.encode('ascii')
except UnicodeEncodeError:
pass # string is not ascii
else:
pass # string is ascii
Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.
Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.
Byte strings (e.g. "foo", or 'bar', in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u"foo" or u'bar') are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.
Instead of ord(u'é'), try this:
>>> [ord(x) for x in u'é']
That tells you which sequence of code points "é" represents. It may give you [233], or it may give you [101, 770].
Instead of chr() to reverse this, there is unichr():
>>> unichr(233)
u'\xe9'
This character may actually be represented either a single or multiple unicode "code points", which themselves represent either graphemes or characters. It's either "e with an acute accent (i.e., code point 233)", or "e" (code point 101), followed by "an acute accent on the previous character" (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.
Most of the time you shouldn't have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.
The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.
Ran into something like this recently - for future reference
import chardet
encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
print 'string is in ascii'
which you could use with:
string_ascii = string.decode(encoding['encoding']).encode('ascii')
How about doing this?
import string
def isAscii(s):
for c in s:
if c not in string.ascii_letters:
return False
return True
I found this question while trying determine how to use/encode/decode a string whose encoding I wasn't sure of (and how to escape/convert special characters in that string).
My first step should have been to check the type of the string- I didn't realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.
If you're getting a rude and persistent
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 263: ordinal not in range(128)
particularly when you're ENCODING, make sure you're not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)
Eventually I determined that what I wanted to do was this:
escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))
Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):
# -*- coding: utf-8 -*-
That allows you to test special characters ('àéç') without having to use their unicode escapes (u'\xe0\xe9\xe7').
>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'àéç'
To improve Alexander's solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html
from curses import ascii
def isascii(s):
return all(ascii.isascii(c) for c in s)
You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.
A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.
However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.
Like #RogerDahl's answer but it's more efficient to short-circuit by negating the character class and using search instead of find_all or match.
>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True
I imagine a regular expression is well-optimized for this.
import re
def is_ascii(s):
return bool(re.match(r'[\x00-\x7F]+$', s))
To include an empty string as ASCII, change the + to *.
To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors
>>> ord("¶")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
For example
def is_ascii(s):
try:
return all(ord(c) < 128 for c in s)
except TypeError:
return False
I use the following to determine if the string is ascii or unicode:
>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>>
Then just use a conditional block to define the function:
def is_ascii(input):
if input.__class__.__name__ == "str":
return True
return False

Categories