Why does this string gets printed out like this? - python

i am playing around with string formatting. And actually i trying to understand the following piece of code:
mystring = "\x80" * 50;
print mystring
output:
>>>
€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€
>>>
the output is one string of Euro sings. But why is this like that? This is no ASCII afaik, and the other question i am asking myself is why does it not print out the hex \x80 ? Thanks in advance

As for the first question, \x80 Is interpreted as \u0080. A nice explanation can be found at Bytes in a unicode Python string.
Edit:
#Joran Besley is right, so let me rephrase it:
u'\x80' is equal to u'\u0080'.
In fact:
unicode(u'\u0080')
>>> u'\x80'
and that's because Python < 3 prefers \x as escaping representation of Unicode characters when possible, that is as long as the code point is less than 256. After that it uses the normal \u:
unicode(u'\u2019')
>>> u'\u2019' # curved quotes in windows-1252
Where the character is then mapped depends on your terminal encoding. As Joran said, you are probably using Windows-1252 or something close to it, where the euro symbol is the hex byte 0x80. In iso-8898-15 for example the hex value is 0xa4:
"\xa4".decode("iso-8859-15") == "\x80".decode('windows-1252')
>>> True
If you are curious about your terminal encoding you can get it from sys
import sys
sys.stdin.encoding
>>> 'UTF-8' # my terminal
sys.stdout.encoding
>>> 'UTF-8' # same as above
I hope it makes up for my mistake.

A little tinkering in IDLE produced this output.
>>> a = "\x80"
>>> a
'\x80'
>>> print a * 50
€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€
>>> print a
€
>>>
The first thing that stands out is the '\' character. This character is used for escaping characters in strings. You can learn about escaping characters in the link below.
http://en.wikipedia.org/wiki/Escape_character
Changing the string slightly tells us that escaping is occurring.
>>> print '\x8'
ValueError: invalid \x escape
What I think is happening is the escape is causing the string to be looked up in the ASCII (or similar) table.

It depends on your terminal encoding ... in the windows terminal that encodes to a bunch of C-cedilla's
if you want to see the "\x80" you can print repr(mystring)
furthermore 0x80 = 128 which is the (not ascii,since ascii only technically goes to 0x7f) value of the euro
specifically that is how "Windows-1252" encodes the euro sign (actually apparently thats how almost all the "Windows-125x" encode the euro sign)
this answer has lots more info
Hex representation of Euro Symbol €
furthermore you can convert it to unicode
unicode_ch = "\x80".decode("Windows-1252") #it is now decoded into unicode
print repr(unicode_ch) # \u20AC the unicode equivalent of Euro
print unicode_ch #as long as your terminal can handle it

Related

String has unicode code points embedded, how to convert? Python 3 [duplicate]

I'm getting back from a library what looks to be an incorrect unicode string:
>>> title
u'Sopet\xc3\xb3n'
Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?
The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?
I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == ó, but they're not what's supposed to be there)
It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:
>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón
But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?
a) Try to put it through the method below.
b)
>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'
You should use:
>>> title.encode('raw_unicode_escape')
Python2:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))
Python3:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))

Understanding unicode and encoding in Python

When I enter following in the python 2.7 console
>>>'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>>u'áíóús'
u'\xe1\xed\xf3\xfas'
I get the above output. What is the difference between the two? I understand the basics of unicode, and different kind of encoding like UTF8, UTF16 etc. But, I don't understand what is being printed on the console or how to make sense of it.
u'áíóús' is a string of text. What you see echoed in the REPL is the canonical representation of that object:
>>> print u'áíóús'
áíóús
>>> print repr(u'áíóús')
u'\xe1\xed\xf3\xfas'
The things like \xe1 are related to hexadecimal ordinals of each character:
>>> [hex(ord(c)) for c in u'áíóús']
['0xe1', '0xed', '0xf3', '0xfa', '0x73']
Only the last character was in the ascii range, i.e. ordinals in range(128), so only that last character "s" is plainly visible in Python 2.x:
>>> chr(0x73)
's'
'áíóús' is a string of bytes. What you see printed is an encoding of the same text characters, with your terminal emulator assuming the encoding:
>>> 'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>> u'áíóús'.encode('utf-8')
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'

How to print Unicode like “u{variable}” in Python 2.7?

For example, I can print Unicode symbol like:
print u'\u00E0'
Or
a = u'\u00E0'
print a
But it looks like I can't do something like this:
a = '\u00E0'
print someFunctionToDisplayTheCharacterRepresentedByThisCodePoint(a)
The main use case will be in loops. I have a list of unicode code points and I wish to display them on console. Something like:
with open("someFileWithAListOfUnicodeCodePoints") as uniCodeFile:
for codePoint in uniCodeFile:
print codePoint #I want the console to display the unicode character here
The file has a list of unicode code points. For example:
2109
OOBO
00E4
1F1E6
The loop should output:
℉
°
ä
🇦
Any help will be appreciated!
This is probably not a great way, but it's a start:
>>> x = '00e4'
>>> print unicode(struct.pack("!I", int(x, 16)), 'utf_32_be')
ä
First, we get the integer represented by the hexadecimal string x. We pack that into a byte string, which we can then decode using the utf_32_be encoding.
Since you are doing this a lot, you can precompile the struct:
int2bytes = struct.Struct("!I").pack
with open("someFileWithAListOfUnicodeCodePoints") as fh:
for code_point in fh:
print unicode(int2bytes(int(code_point, 16)), 'utf_32_be')
If you think it's clearer, you can also use the decode method instead of the unicode type directly:
>>> print int2bytes(int('00e4', 16)).decode('utf_32_be')
ä
Python 3 added a to_bytes method to the int class that lets you bypass the struct module:
>>> str(int('00e4', 16).to_bytes(4, 'big'), 'utf_32_be')
"ä"
You want print unichr(int('00E0',16)). Convert the hex string to an integer and print its Unicode codepoint.
Caveat: On Windows codepoints > U+FFFF won't work.
Solution: Use Python 3.3+ print(chr(int(line,16)))
In all cases you'll still need to use a font that supports the glyphs for the codepoints.
These are unicode code points but lack the \u python unicode-escape. So, just put it in:
with open("someFileWithAListOfUnicodeCodePoints", "rb") as uniCodeFile:
for codePoint in uniCodeFile:
print "\\u" + codePoint.strip()).decode("unicode-escape")
Whether this works on a given system depends on the console's encoding. If its a Windows code page and the characters are not in its range, you'll still get funky errors.
In python 3 that would be b"\\u".

How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?

For example, if I have a unicode string, I can encode it as an ASCII string like so:
>>> u'\u003cfoo/\u003e'.encode('ascii')
'<foo/>'
However, I have e.g. this ASCII string:
'\u003foo\u003e'
... that I want to turn into the same ASCII string as in my first example above:
'<foo/>'
It took me a while to figure this one out, but this page had the best answer:
>>> s = '\u003cfoo/\u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'
There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).
EDIT: See also Python Standard Encodings.
On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).
I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.
Anyway, this is it.
At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)
For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors
>>> s = '\u003cfoo\u003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>
Ned Batchelder said:
It's a little dangerous depending on where the string is coming from,
but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'
Actually this method can be made safe like so:
>>> s = '\u003cfoo\u003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]
Mind the triple-quote string and the dash right before the closing 3-quotes.
Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]
So there would be no need to worry about what the users enter, as long as it is captured in raw format.
It's a little dangerous depending on where the string is coming from, but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'

How to check if a string in Python is in ASCII?

I want to I check whether a string is in ASCII or not.
I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()'s documentation).
Is there another way to check?
I think you are not asking the right question--
A string in python has no property corresponding to 'ascii', utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that's where you need to go for an answer.
Perhaps the question you can ask is: "Is this string the result of encoding a unicode string in ascii?" -- This you can answer
by trying:
try:
mystring.decode('ascii')
except UnicodeDecodeError:
print "it was not a ascii-encoded unicode string"
else:
print "It may have been an ascii-encoded unicode string"
def is_ascii(s):
return all(ord(c) < 128 for c in s)
In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.
def isascii(s):
"""Check if the characters in string s are in ASCII, U+0-U+7F."""
return len(s) == len(s.encode())
To check, pass the test string:
>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True
New in Python 3.7 (bpo32677)
No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method - .isascii() will check if the strings is ascii.
print("is this ascii?".isascii())
# True
Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:
try:
mystring.encode('ascii')
except UnicodeEncodeError:
pass # string is not ascii
else:
pass # string is ascii
Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.
Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.
Byte strings (e.g. "foo", or 'bar', in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u"foo" or u'bar') are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.
Instead of ord(u'é'), try this:
>>> [ord(x) for x in u'é']
That tells you which sequence of code points "é" represents. It may give you [233], or it may give you [101, 770].
Instead of chr() to reverse this, there is unichr():
>>> unichr(233)
u'\xe9'
This character may actually be represented either a single or multiple unicode "code points", which themselves represent either graphemes or characters. It's either "e with an acute accent (i.e., code point 233)", or "e" (code point 101), followed by "an acute accent on the previous character" (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.
Most of the time you shouldn't have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.
The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.
Ran into something like this recently - for future reference
import chardet
encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
print 'string is in ascii'
which you could use with:
string_ascii = string.decode(encoding['encoding']).encode('ascii')
How about doing this?
import string
def isAscii(s):
for c in s:
if c not in string.ascii_letters:
return False
return True
I found this question while trying determine how to use/encode/decode a string whose encoding I wasn't sure of (and how to escape/convert special characters in that string).
My first step should have been to check the type of the string- I didn't realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.
If you're getting a rude and persistent
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 263: ordinal not in range(128)
particularly when you're ENCODING, make sure you're not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)
Eventually I determined that what I wanted to do was this:
escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))
Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):
# -*- coding: utf-8 -*-
That allows you to test special characters ('àéç') without having to use their unicode escapes (u'\xe0\xe9\xe7').
>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'àéç'
To improve Alexander's solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html
from curses import ascii
def isascii(s):
return all(ascii.isascii(c) for c in s)
You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.
A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.
However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.
Like #RogerDahl's answer but it's more efficient to short-circuit by negating the character class and using search instead of find_all or match.
>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True
I imagine a regular expression is well-optimized for this.
import re
def is_ascii(s):
return bool(re.match(r'[\x00-\x7F]+$', s))
To include an empty string as ASCII, change the + to *.
To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors
>>> ord("¶")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
For example
def is_ascii(s):
try:
return all(ord(c) < 128 for c in s)
except TypeError:
return False
I use the following to determine if the string is ascii or unicode:
>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>>
Then just use a conditional block to define the function:
def is_ascii(input):
if input.__class__.__name__ == "str":
return True
return False

Categories