Character count of Unicode string [duplicate] - python

This question already has an answer here:
python get unicode string size
(1 answer)
Closed 8 years ago.
How would I get the character count of the below in python?
s = 'הוא אוסף אתכם מחר בשלוש וחצי.'
Char count: 29
Char length: 52
len(s) = 52
? = 29

decode your byte string (according to whatever encoding it's in, utf-8 maybe) -- the len of the resulting Unicode string is what you're after.
If fact best practice is to decode inputs as soon as possible, deal only with actual text (i.e, unicode, in Python 2; it's just the way ordinary strings are, in Python 3) in your code, and if need be encode just as you're outputting again.
Byte strings should be handled in your program only if it's specifically about byte strings (e.g, controlling or monitoring some hardware device, &c) -- far more programs are about text, and thus, except where indispensable at some I/O boundaries, they should be exclusively dealing with text strings (spelled unicode in Python 2:-).
But if you do want to keep s as a bytestring nevertheless,
len(s.decode('utf-8'))
(or whatever other encoding you're using to represent text as byte strings) should still do what you request.

Use a unicode string
s = 'הוא אוסף אתכם מחר בשלוש וחצי.'
len(s) #52
s = u'הוא אוסף אתכם מחר בשלוש וחצי.'
len(s) #29

Related

Byte string and unicode string inputs [duplicate]

This question already has answers here:
byte string vs. unicode string. Python
(2 answers)
Closed 3 years ago.
I'm a Python noob. I was reading through some documentation and I came across something that baffled me.
What is the difference between Byte strings and Unicode strings in python? Especially in terms of what is being inputed and the output.
Please explain using the simplest terms possible
N.B : I use python 3.x
I searched around and found that byte strings can only contain byte characters, which exclude punctuation marks and other unicode characters. Unicode strings can contain, well, all unicode characters.
In python 2.x, byte strings are written much like ordinary strings while unicode strings have a prefixed "u".
a = 'foobar' (byte string)
b = u'foo-bar' (unicode string)
It's written the opposite way for python 3.x
a = b'foobar' (byte string)
b = 'foo-bar' (unicode string)

Does python automatically decode ASCII and UTF-8 byte strings? [duplicate]

This question already has answers here:
What does a b prefix before a python string mean?
(2 answers)
Closed 3 years ago.
From what I understand, a Python3 string is a sequence of bytes that has been decoded to be readable by humans, and a Python3 bytes object are the raw bytes that are not human readable. What I'm having trouble understanding, however, is how strings encoded with UTF-8 or ASCII are displayed as a string prefixed by a b, rather than a sequence of bytes
string = "I am a string"
# prints a sequence of bytes, like I would expect
string.encode("UTF-16")
b'\xff\xfeI\x00 \x00a\x00m\x00 \x00a\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00'
# Prints a sequence of human readable characters, which I don't understand
string.encode("UTF-8")
b'I am a string'
Why does a string encoded by UTF-8 or ASCII not display a sequence of bytes?
UTF-8 is a backwards-compatible superset of ASCII, i.e. anything that’s valid ASCII is valid UTF-8 and everything present in ASCII is encoded by UTF-8 using the same byte as ASCII. So it’s not “UTF-8 or ASCII” so much as “just some of ASCII”. Try other Unicode:
>>> "café".encode("UTF-8")
b'caf\xc3\xa9'
or other ASCII that wouldn’t be very helpful to look at in character form:
>>> "hello\f\n\t\r\v\0\N{SOH}\N{DEL}".encode("UTF-8")
b'hello\x0c\n\t\r\x0b\x00\x01\x7f'
The reason the repr of bytes displays printable characters instead of \xnn escapes when possible is because it’s helpful if you do happen to have bytes that contain ASCII.
And, of course, it’s still a well-formed bytes literal:
>>> b'I am a string'[0]
73
Additionally: From the docs
While bytes literals and representations are based on ASCII text,
bytes objects actually behave like immutable sequences of integers,
with each value in the sequence restricted such that 0 <= x < 256
(attempts to violate this restriction will trigger ValueError). This
is done deliberately to emphasise that while many binary formats
include ASCII based elements and can be usefully manipulated with some
text-oriented algorithms, this is not generally the case for arbitrary
binary data
-emphasis added.
At the end of the day, this is a design choice that python made for displaying bytes.

Convert utf-8 string to bytes and back in Python 2.7

I have a program which takes a string, makes a list with its byte representation, and then converts the list back to a string. This is really easy if the string contains only ASCII characters:
def messagetobitlist(message):
bitlist = []
for i in message:
for x in (format(ord(i), '08b')):
bitlist.append(int(x))
return bitlist
And then I simply convert it back with unichr (or also chr would work).
I want however to expand the code and make it capable of accepting string with accents and foreign characters. To do this I though of encoding it in UTF-8 and creating the bitlist, but when I try to convert it back it doesn't work, since the characters are represented with a different number of bytes and the code is not capable of distinguishing beforehand if it has to read just one byte or more. I tried to encode every character with 4 bytes (since it is the maximum of UTF-8), but this really does seem a waste of space and it doesn't work anyway.
Is there a solution to have a function that does this while still being somewhat space-conservative?
EDIT: Whoops, wrote Python 3 instead of Python 2.7

Python read a single unicode character from the user

I am searching for a method to get a single unicode character from the standard input. Recently, I saw this topic in which the solution does not apply for unicode characters but only ASCII ones.
Using the function getch() cited in the mentioned topic, when the user types an unicode character, it is represented as more than one ASCII characters. In fact, getch() only returns the first part (byte). The remaining bytes are only accessible using getch() again (however I do not know how to know how many bytes remain).
Is there a way to actually get a single unicode character from the input?
Thanks!
If you are using UTF-8 the first byte of a multibyte character tells you how many bytes there are. So something like this can work:
c = getch()
first_byte = ord(c)
bytes_remain = 0
while (first_byte >> (6 - bytes_remain)) & 0b11 == 0b11:
bytes_remain += 1
c += getch()

Large strings and len()

This may be a newbie question, but here it goes. I have a large string (167572 bytes) with both ASCII and non ASCII characters. When I use len() on the string I get the wrong length. It seems that len() doesn't count 0x0A characters. The only way I can get the actual length of the string is with this code:
for x in test:
totalLen += 1
for x in test:
if x == '\x0a':
totalLen += 1
print totalLen
What is wrong with len()? Or am I using it wrong?
You are confusing encoded byte strings with unicode text. In UTF-8, for example, up to 3 bytes are used to encode any given character, in UTF-16 each character is encoded using at least 2 bytes each.
A python string is a series of bytes, to get unicode you'd have to decode the string with an appropriate codec. If your text is encoded using UTF-8, for example, you can decode it with:
test = test.decode('utf8')
On the other hand, data written to a file is always encoded, so a unicode string of length 10 could take up 20 bytes in a file, if written using the UTF-16 codec.
Most likely you are getting confused with such 'wider' characters, not with wether or not your \n (ASCII 10) characters are counted correctly.
Please do yourself a favour and read up on Unicode and encodings:
Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The Python Unicode HOWTO.
Could it be that you're expecting it to contain \r\n, i.e. ASCII 13 (carriage return) followed by ASCII 10 (line feed), or that you look at the string once it's been written out to a text file, which adds these?
It's hard to be specific since you don't give a lot of detail, i.e. where the string's data comes from.

Categories