How can I get a ° (degree) character into a string?
This is the most coder-friendly version of specifying a Unicode character:
degree_sign = u'\N{DEGREE SIGN}'
Escape Sequence: \N{name}
Meaning: Character named name in the Unicode database
Reference: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
Note:
"N" must be uppercase in the \N construct to avoid confusion with the \n newline character
The character name inside the curly braces can be any case
It's easier to remember the name of a character than its Unicode index. It's also more readable, ergo debugging-friendly. The character substitution happens at compile time, i.e. the .py[co] file will contain a constant for u'°':
>>> import dis
>>> c= compile('u"\N{DEGREE SIGN}"', '', 'eval')
>>> dis.dis(c)
1 0 LOAD_CONST 0 (u'\xb0')
3 RETURN_VALUE
>>> c.co_consts
(u'\xb0',)
>>> c= compile('u"\N{DEGREE SIGN}-\N{EMPTY SET}"', '', 'eval')
>>> c.co_consts
(u'\xb0-\u2205',)
>>> print c.co_consts[0]
°-∅
>>> u"\u00b0"
u'\xb0'
>>> print _
°
BTW, all I did was search "unicode degree" on Google. This brings up two results:
"Degree sign U+00B0" and "Degree Celsius U+2103", which are actually different:
>>> u"\u2103"
u'\u2103'
>>> print _
℃
Put this line at the top of your source
# -*- coding: utf-8 -*-
If your editor uses a different encoding, substitute for utf-8
Then you can include utf-8 characters directly in the source
You can also use chr(176) to print the degree sign.
Here is an example using python 3.6.5 interactive shell:
just use \xb0 (in a string); python will convert it automatically
Above answers assume that UTF8 encoding can safely be used - this one is specifically targetted for Windows.
The Windows console normaly uses CP850 encoding and not utf-8, so if you try to use a source file utf8-encoded, you get those 2 (incorrect) characters ┬░ instead of a degree °.
Demonstration (using python 2.7 in a windows console):
deg = u'\xb0` # utf code for degree
print deg.encode('utf8')
effectively outputs ┬░.
Fix: just force the correct encoding (or better use unicode):
local_encoding = 'cp850' # adapt for other encodings
deg = u'\xb0'.encode(local_encoding)
print deg
or if you use a source file that explicitely defines an encoding:
# -*- coding: utf-8 -*-
local_encoding = 'cp850' # adapt for other encodings
print " The current temperature in the country/city you've entered is " + temp_in_county_or_city + "°C.".decode('utf8').encode(local_encoding)
Using python f-string, f"{var}", you can use:
theta = 45
print(f"Theta {theta}\N{DEGREE SIGN}.")
Output: Theta 45°.
*improving tzot answer
Special case: Sending a string to a 16x2 or 20x4 LCD (e.g. attached to a Raspberry Pi with I2C LCD display):
f"Outside Temp: {my_temp_variable}{chr(223)}"
Related
For example, I can print Unicode symbol like:
print u'\u00E0'
Or
a = u'\u00E0'
print a
But it looks like I can't do something like this:
a = '\u00E0'
print someFunctionToDisplayTheCharacterRepresentedByThisCodePoint(a)
The main use case will be in loops. I have a list of unicode code points and I wish to display them on console. Something like:
with open("someFileWithAListOfUnicodeCodePoints") as uniCodeFile:
for codePoint in uniCodeFile:
print codePoint #I want the console to display the unicode character here
The file has a list of unicode code points. For example:
2109
OOBO
00E4
1F1E6
The loop should output:
℉
°
ä
🇦
Any help will be appreciated!
This is probably not a great way, but it's a start:
>>> x = '00e4'
>>> print unicode(struct.pack("!I", int(x, 16)), 'utf_32_be')
ä
First, we get the integer represented by the hexadecimal string x. We pack that into a byte string, which we can then decode using the utf_32_be encoding.
Since you are doing this a lot, you can precompile the struct:
int2bytes = struct.Struct("!I").pack
with open("someFileWithAListOfUnicodeCodePoints") as fh:
for code_point in fh:
print unicode(int2bytes(int(code_point, 16)), 'utf_32_be')
If you think it's clearer, you can also use the decode method instead of the unicode type directly:
>>> print int2bytes(int('00e4', 16)).decode('utf_32_be')
ä
Python 3 added a to_bytes method to the int class that lets you bypass the struct module:
>>> str(int('00e4', 16).to_bytes(4, 'big'), 'utf_32_be')
"ä"
You want print unichr(int('00E0',16)). Convert the hex string to an integer and print its Unicode codepoint.
Caveat: On Windows codepoints > U+FFFF won't work.
Solution: Use Python 3.3+ print(chr(int(line,16)))
In all cases you'll still need to use a font that supports the glyphs for the codepoints.
These are unicode code points but lack the \u python unicode-escape. So, just put it in:
with open("someFileWithAListOfUnicodeCodePoints", "rb") as uniCodeFile:
for codePoint in uniCodeFile:
print "\\u" + codePoint.strip()).decode("unicode-escape")
Whether this works on a given system depends on the console's encoding. If its a Windows code page and the characters are not in its range, you'll still get funky errors.
In python 3 that would be b"\\u".
I'm trying to use Python's sub function from the regex module to recognize and change a pattern in a string. Below is my code.
old_string = "afdëhë:dfp"
newString = re.sub(ur'([aeiouäëöüáéíóúàèìò]|ù:|e:|i:|o:|u:|ä:|ë:|ö:|ü:|á:|é:|í:|ó:|ú:|à:|è:|ì:|ò:|ù:)h([aeiouäëöüáéíóúàèìòù])', ur'\1\2', old_string)
So what I'm looking to get after the code is applied is afdëë:dfp (without the h). So I'm trying to match a vowel (sometimes with accents, sometimes with a colon after it) then the h then another vowel (sometimes with accents). So a few examples...
ò:ha becomes ò:a
ä:hà becomes ä:hà
aha becomes aa
üha becomes üa
ëhë becomes ëë
So I'm trying to remove the h when it is between two vowels and also remove the h when it follows a volume with a colon after it then another vowel (ie a:ha). Any help is greatly appreciated. I've been playing around with this for a while.
A single user-perceived character may consist of multiple Unicode codepoints. Such characters can break u'[abc]'-like regex that sees only codepoints in Python. To workaround it, you could use u'(?:a|b|c)' regex instead. In addition, don't mix bytes and Unicode strings i.e., old_string should be also Unicode.
Applying the last rule fixes your example.
You could write your regex using lookahead/lookbehind assertions:
# -*- coding: utf-8 -*-
import re
from functools import partial
old_string = u"""
ò:ha becomes ò:a
ä:hà becomes ä:à
aha becomes aa
üha becomes üa
ëhë becomes ëë"""
# (?<=a|b|c)(:?)h(?=a|b|c)
chars = u"a e i o u ä ë ö ü á é í ó ú à è ì ò".split()
pattern = u"(?<=%(vowels)s)(:?)h(?=%(vowels)s)" % dict(vowels=u"|".join(chars))
remove_h = partial(re.compile(pattern).sub, ur'\1')
# remove 'h' followed and preceded by vowels
print(remove_h(old_string))
Output
ò:a becomes ò:a
ä:à becomes ä:à
aa becomes aa
üa becomes üa
ëë becomes ëë
For completeness, you could also normalize all Unicode strings in the program using unicodedata.normalize() function (see the example in the docs, to understand why you might need it).
It is encoding issue. Different combinations of file encoding and old_string being non-unicode behave differently for different pythons.
For example your code works fine for python for 2.6 to 2.7 this way (all data below is cp1252 encoded):
# -*- coding: cp1252 -*-
old_string = "afdëhë:dfp"
but fails with SyntaxError: Non-ASCII character '\xeb' if no encoding specified in file.
However, those lines fails for python 2.5 with
`UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)` for python 2.5
While for all pythons fails to remove h with old_string being non-unicode:
# -*- coding: utf8 -*-
old_string = "afdëhë:dfp"
So you have to provide correct encoding and define old_unicode being unicode string as well, for example this one will do:
# -*- coding: cp1252 -*-
old_string = u"afdëhë:dfp"
So I have something like this:
x = "CЕМЬ"
x[:len(x)-1]
Which is to remove the last character from the string.
However it doesn't work and it gives me an error. I figured it's because it's Unicode. So how do you do this simple formatting on non-ansi strings.
That's because in Python 2.x "CЕМЬ", is a strange way of writing the byte string b'C\xd0\x95\xd0\x9c\xd0\xac'.
You want a character string. In Python 2.x, character strings are prefixed with a u:
x = u"CЕМЬ"
x[:-1] # Returns u"CЕМ" (len(x) is implicit for negative values)
If you're writing this in a program (as opposed to an interactive shell), you will want to specify a source code encoding. To do that, simply add the following line to the beginning of the file, where utf-8 matches your file encoding:
# -*- coding: utf-8 -*-
save the file with utf-8 encoding:
# -*- coding: utf-8 -*-
x = u'CЕМЬ'
print x[:-1] #prints CЕМ
x = u'some string'
x2 = x[:-1]
Python file
# -*- coding: UTF-8 -*-
a = 'Köppler'
print a
print a.__class__.__name__
mydict = {}
mydict['name'] = a
print mydict
print mydict['name']
Output:
Köppler
str
{'name': 'K\xc3\xb6ppler'}
Köppler
It seems that the name remains the same, but only when printing a dictionary I get this strange escaped character string. What am I looking at then? Is that the UTF-8 representation?
The reason for that behavior is that the __repr__ function in Python 2 escapes non-ASCII unicode characters. As the link shows, this is fixed in Python 3.
Yes, that's the UTF-8 representation of ö (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS). It consists of a 0xC3 octet followed by a 0xB6 octet. UTF-8 is a very elegant encoding, I think, and worth reading up on. The history of its design (on a placemat in a diner) is described here by Rob Pike.
As far as I'm concerned there are two methods in Python for displaying objects: str() and repr(). Str() is used internally inside print, however Apparently dict's str() uses repr() for keys and values.
As it has been mentioned: repr() escapes unicode characters.
It seems you are using python 2.x, where you have to specify that the object is actually a unicode string and not a plain ascii. You specified that the code is utf-8, thus you actually typed 2 bytes for your ö, and as it is a regular string, you got the 2 escaped chars.
Try to specify the unicode a= u'Köppler'. You may need to encode it before printing, depending on your consol encoding: print a.encode('utf-8')
I have a string that looks like so:
6Â 918Â 417Â 712
The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:
s.replace('Â ', '')
That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.
I never quite could understand how to switch between different encodings.
Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header:
#!/usr/bin/python2.4
# -*- coding: utf-8 -*-
The code:
f = urllib.urlopen(url)
soup = BeautifulSoup(f)
s = soup.find('div', {'id':'main_count'})
#making a print 's' here goes well. it shows 6Â 918Â 417Â 712
s.replace('Â ','')
save_main_count(s)
It gets no further than s.replace...
Throw out all characters that can't be interpreted as ASCII:
def remove_non_ascii(s):
return "".join(c for c in s if ord(c)<128)
Keep in mind that this is guaranteed to work with the UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).
Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.
See:
http://docs.python.org/tutorial/interpreter.html#source-code-encoding
To enable utf-8 source encoding, this would go in one of the top two lines:
# -*- coding: utf-8 -*-
The above is in the docs, but this also works:
# coding: utf-8
Additional considerations:
The source file must be saved using the correct encoding in your text editor as well.
In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.
s.replace(u"Â ", u"") will also fail if s is not a unicode string.
string.replace returns a new string and does not edit in place, so make sure you're using the return value as well
>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'
The following code will replace all non ASCII characters with question marks.
"".join([x if ord(x) < 128 else '?' for x in s])
Using Regex:
import re
strip_unicode = re.compile("([^-_a-zA-Z0-9!##%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')
Way too late for an answer, but the original string was in UTF-8 and '\xc2\xa0' is UTF-8 for NO-BREAK SPACE. Simply decode the original string as s.decode('utf-8') (\xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1:
Example (Python 3)
s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE
Output
6Â 918Â 417Â 712
6 918 417 712
6_918_417_712
6-918-417-712
#!/usr/bin/env python
# -*- coding: utf-8 -*-
s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "")
print s
This will print out 6 918 417 712
I know it's an old thread, but I felt compelled to mention the translate method, which is always a good way to replace all character codes above 128 (or other if necessary).
Usage : str.translate(table[, deletechars])
>>> trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 )
>>> 'Résultat'.translate(trans_table)
'R sultat'
>>> '6Â 918Â 417Â 712'.translate(trans_table)
'6 918 417 712'
Starting with Python 2.6, you can also set the table to None, and use deletechars to delete the characters you don't want as in the examples shown in the standard docs at http://docs.python.org/library/stdtypes.html.
With unicode strings, the translation table is not a 256-character string but a dict with the ord() of relevant characters as keys. But anyway getting a proper ascii string from a unicode string is simple enough, using the method mentioned by truppo above, namely : unicode_string.encode("ascii", "ignore")
As a summary, if for some reason you absolutely need to get an ascii string (for instance, when you raise a standard exception with raise Exception, ascii_message ), you can use the following function:
trans_table = ''.join( [chr(i) for i in range(128)] + ['?'] * 128 )
def ascii(s):
if isinstance(s, unicode):
return s.encode('ascii', 'replace')
else:
return s.translate(trans_table)
The good thing with translate is that you can actually convert accented characters to relevant non-accented ascii characters instead of simply deleting them or replacing them by '?'. This is often useful, for instance for indexing purposes.
s.replace(u'Â ', '') # u before string is important
and make your .py file unicode.
This is a dirty hack, but may work.
s2 = ""
for i in s:
if ord(i) < 128:
s2 += i
For what it was worth, my character set was utf-8 and I had included the classic "# -*- coding: utf-8 -*-" line.
However, I discovered that I didn't have Universal Newlines when reading this data from a webpage.
My text had two words, separated by "\r\n". I was only splitting on the \n and replacing the "\n".
Once I looped through and saw the character set in question, I realized the mistake.
So, it could also be within the ASCII character set, but a character that you didn't expect.
my 2 pennies with beautiful soup,
string='<span style="width: 0px> dirty text begin ( ĀĒēāæśḍṣ <0xa0> ) dtext end </span></span>'
string=string.encode().decode('ascii',errors='ignore')
print(string)
will give
<span style="width: 0px> dirty text begin ( ) dtext end </span></span>