Printing all unicode characters in Python

Printing all unicode characters in Python - python

I've written some code to create all 4-digit combinations of the hexidecimal system, and now I'm trying to use that to print out all the unicode characters that are associated with those values. Here's the code I'm using to do this:
char_list =["0","1","2","3","4","5","6","7","8","9","A","B","C","D","E","F"]
pairs = []
all_chars = []
# Construct pairs list
for char1 in char_list:
for char2 in char_list:
pairs.append(char1 + char2)
# Create every combination of unicode characters ever
for pair1 in pairs:
for pair2 in pairs:
all_chars.append(pair1 + pair2)
# Print all characters
for code in all_chars:
expression = "u'\u" + code + "'"
print "{}: {}".format(code,eval(expression))
And here is the error message I'm getting:
Traceback (most recent call last): File "C:\Users\andr7495\Desktop\unifun.py",
line 18, in <module> print "{}: {}".format(code,eval(expression))
UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0:
ordinal not in range(128)
The exception is thrown when the code tries to print u"\u0080", however, I can do this in the interactive interpreter without a problem.
I've tried casting the results to unicode and specifying to ignore errors, but it's not helping. I feel like I'm missing a basic understanding about how unicode works, but is there anything I can do to get my code to print out all valid unicode expressions?

import sys
for i in xrange(sys.maxunicode):
print unichr(i);

You're trying to format a Unicode character into a byte string. You can remove the error by using a Unicode string instead:
print u"{}: {}".format(code,eval(expression))
^
The other answers are better at simplifying the original problem however, you're definitely doing things the hard way.

it is likely a problem with your terminal (cmd.exe is notoriously bad at this) as most of the time when you "print" you are printing to a terminal and that ends up trying to do encodings ... if you run your code in idle or some other space that can render unicode you should see the characters. also you should not use eval try this
for uni_code in range(...):
print hex(uni_code),unichr(uni_code)

Here's a rewrite of examples in this article that saves the list to a file.
Python 3.x:
import sys
txtfile = "unicode_table.txt"
print("creating file: " + txtfile)
F = open(txtfile, "w", encoding="utf-16", errors='ignore')
for uc in range(sys.maxunicode):
line = "%s %s" % (hex(uc), chr(uc))
print(line, file=F)
F.close()

Related

Unicode to ASCII string in Python 2

Trying to make a wall display of current MET data for a selected airport.
This is my first use of a Raspberry Pi 3 and of Python.
The plan is to read from a net data provider and show selected data on a LCD display.
The LCD library seems to work only in Python2. Json data seems to be easier to handle in Python3.
This question python json loads unicode probably adresses my problem, but I do not anderstand what to do.
So, what shall I do to my code?
Minimal example demonstrating my problem:
#!/usr/bin/python
import I2C_LCD_driver
import urllib2
import urllib
import json
mylcd = I2C_LCD_driver.lcd()
mylcd.lcd_clear()
url = 'https://avwx.rest/api/metar/ESSB'
request = urllib2.Request(url)
response = urllib2.urlopen(request).read()
data = json.loads(response)
str1 = data["Altimeter"], data["Units"]["Altimeter"]
mylcd.lcd_display_string(str1, 1, 0)
The error is as follows:
$python Minimal.py
Traceback (most recent call last):
File "Minimal.py", line 18, in <module>
mylcd.lcd_display_string(str1, 1, 0)
File "/home/pi/I2C_LCD_driver.py", line 161, in lcd_display_string
self.lcd_write(ord(char), Rs)
TypeError: ord() expected a character, but string of length 4 found

It's a little bit hard to tell without seeing the inside of mylcd.lcd_display_string(), but I think the problem is here:
str1 = data["Altimeter"], data["Units"]["Altimeter"]
I suspect that you want str1 to contain something of type string, with a value like "132 metres". Try adding a print statement just after, so that you can see what str1 contains.
str1 = data["Altimeter"], data["Units"]["Altimeter"]
print( "str1 is: {0}, of type {1}.".format(str1, type(str1)) )
I think you will see a result like:
str1 is: ('foo', 'bar'), of type <type 'tuple'>.
The mention of "type tuple", the parentheses, and the comma indicate that str1 is not a string.
The problem is that the comma in the print statement does not do concatenation, which is perhaps what you are expecting. It joins the two values into a tuple. To concatenate, a flexible and sufficient method is to use the str.format() method:
str1 = "{0} {1}".format(data["Altimeter"], data["Units"]["Altimeter"])
print( "str1 is: {0}, of type {1}.".format(str1, type(str1)) )
Then I expect you will see a result like:
str1 is: 132 metres, of type <type 'str'>.
That value of type "str" should work fine with mylcd.lcd_display_string().

You are passing in a tuple, not a single string:
str1 = data["Altimeter"], data["Units"]["Altimeter"]
mylcd.lcd_display_string() expects a single string instead. Perhaps you meant to concatenate the two strings:
str1 = data["Altimeter"] + data["Units"]["Altimeter"]
Python 2 will implicitly encode Unicode strings to ASCII byte strings where a byte string is expected; if your data is not ASCII encodable you'd have to encode your data yourself, with a suitable codec, one that maps your Unicode codepoints to the right bytes for your LCD display (I believe you can have different character sets stored in ROM). This likely would involve a manual translation table as non-ASCII LCD ROM character sets seldom correspond with standard codecs.
If you just want to force the string to contain ASCII characters only, encode with the error handler set to 'ignore':
str1 = (data["Altimeter"] + data["Units"]["Altimeter"]).encode('ASCII', 'ignore')

Subtitle Project: How to solve the unicode reading failure?

Basically I'm doing a subtitle project.
Very complicated, but I just want to insert lines after a line for all lines in a converted ASS file(Currently still a txt file in the experiment)
Untouched lines. I won't talk about Aegisub problems here
Dialogue: 0,0:00:00.00,0:00:03.90,Default,,0,0,0,,Hello, viewers. This is The Reassembler,
Dialogue: 0,0:00:03.90,0:00:07.04,Default,,0,0,0,,the show where we take everyday objects in their component form
Dialogue: 0,0:00:07.04,0:00:10.24,Default,,0,0,0,,and put them back together, very slowly.
Objective:
Every line in the dialogue section appended with
'\N{\3c&HAA0603&\fs31\b1}'
Dialogue: 0,0:00:00.00,0:00:03.90,Default,,0,0,0,,Hello, viewers. This is The Reassembler,\N{\3c&HAA0603&\fs31\b1}
Dialogue: 0,0:00:03.90,0:00:07.04,Default,,0,0,0,,the show where we take everyday objects in their component form\N{\3c&HAA0603&\fs31\b1}
Dialogue: 0,0:00:07.04,0:00:10.24,Default,,0,0,0,,and put them back together, very slowly.\N{\3c&HAA0603&\fs31\b1}
The Python 3.x code:
text1 = open('d:\Programs\sub1.txt','r')
text2 = open('e:\modsub.ass','w+')
alltext1 = text1.read()
lines = alltext1.split('\n')
for i in range(lines.index('[Events]')+1,len(lines)):
lines[i] += ' hello '
print(lines)
text2.write(str(lines))
text1.close()
text2.close()
1.Python doesn't recognize one or two characters in it, apparently, in unicode
'\N{\3c&HAA0603&\fs31\b1}'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-23: unknown Unicode character name
How to deal with it without affecting the output?
2.When I used ' hello ' instead of the subtitling code, the output was this:
'Dialogue: 0,0:00:07.04,0:00:10.24,Default,,0,0,0,,and put them back together, very slowly. hello ', 'Dialogue: 0,0:00:10.24,0:00:11.72,Default,,0,0,0,,That feels very nice. hello ', 'Dialogue: 0,0:00:11.72,0:00:13.36,Default,,0,0,0,,Oh, yes. Look at that! hello ',
et cetera, instead of lines after lines arrangement.
How to make the strings just line up and take out the quotes and stuff?

Use a raw string literal, i.e. replace:
'\N{\3c&HAA0603&\fs31\b1}'
with:
r'\N{\3c&HAA0603&\fs31\b1}'
In this way the interpreter will not try to look for the unicode character named \3c&HAA0603&\fs31\b1 which does not exist.
>>> '\N{\3c&HAA0603&\fs31\b1}'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-23: unknown Unicode character name
>>> r'\N{\3c&HAA0603&\fs31\b1}'
'\\N{\\3c&HAA0603&\\fs31\\b1}'
>>> print(r'\N{\3c&HAA0603&\fs31\b1}')
\N{\3c&HAA0603&\fs31\b1}

The problem is that you're using a string with \ characters in it, without escaping them. You need to double them up or use the r'' notation.
lines[i] += '\\N{\\3c&HAA0603&\\fs31\\b1}'
or
lines[i] += r'\N{\3c&HAA0603&\fs31\b1}'
As for your other problem, you're writing str(lines) which shows a literal representation. Use '\n'.join(lines) + '\n' instead.

Python: UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-39: ordinal not in range(128)

I've got a data of twitter log and I have to sort the file to show each user's retweeted tweet ranking.
Here's the code.
import codecs
with codecs.open('hoge_qdata.tsv', 'r', 'utf-8') as tweets:
tweet_list = tweets.readlines()
tweet_list.pop(0)
facul={}
for t in tweet_list:
t = t.split('\t')
t[-2] = int(t[-2])
if t[-2] <= 0:
continue
if not t[0] in facul:
facul[t[0]] = []
facul[t[0]].append(t)
def cmp_retweet(a,b):
if a[-2] < b[-2]:
return 1
if a[-2] > b[-2]:
return -1
return 0
for f in sorted(facul.keys()):
facul[f].sort(cmp=cmp_retweet)
print ('[%s]' %(f))
for t in facul[f][:5]:
print ('%d:%s:%s' % (t[-2], t[2], t[-1].strip())
Somehow I got an error saying:
print '%d:%s:%s' %(t[-2], t[2], t[-1].strip())
UnicodeEncodeError: 'ascii' codec can't encode characters in position
34-39: ordinal not in range(128)
Looks like Japanese language letters can't be decoded. How can I fix this?
I tried to use sys.setdefaultencoding("utf-8") but then I got an error:
sys.setdefaultencoding("utf-8")
AttributeError: 'module' object has no attribute 'setdefaultencoding'
This is how I tried it:
import codecs
import sys
sys.setdefaultencoding("utf-8")
with codecs.open('hoge_qdata.tsv', 'r', 'utf-8') as tweets:
tweet_list = tweets.readlines()
p.s. I am using Python version 2.7.5

The basic issue, as you have discovered, is that ASCII cannot represent much of unicode.
So you have to make a choice on how to handle it:
don't display non-ASCII chars
display non-ASCII chars as some other type of representation
The first choice would like this:
for t in facul[f][:5]:
print ('%d:%s:%s' % (
t[-2],
t[2].encode('ascii', errors='ignore'),
t[-1].encode('ascii', errors='ignore').strip()
))
While the second choice would replace ignore with something like replace, xmlcharrefreplace, or backslashreplace.
Here's the reference.

The error message is giving you two clues: first, the problem is in the statement
print '%d:%s:%s' %(t[-2], t[2], t[-1].strip())
Second, the problem is related to an encode operation. If you don't remember what is meant by "encode", now would be a good time to re-read the Unicode HOWTO in the Python 2.7 docs.
It looks like your list t[] contains Unicode strings. The print() statement is emitting byte strings. The conversion of Unicode strings to byte strings is encoding. Because you aren't specifying an encoding, Python is implicitly doing a default encoding. It uses the ascii codec, which cannot handle any accented or non-Latin characters.
Try splitting that print() statement into two parts. First, insert the unicode t[] values into a unicode format string. Note the use of u'' syntax. Second, encode the unicode string to UTF and print.
s = u'%d:%s:%s' %(t[-2], t[2], t[-1].strip())
print s.encode('utf8')
(I haven't tested this change to your code. Let me know if it doesn't work.)
I think sys.setdefaultencoding() is probably a red herring, but I don't know your environment well.
By the way, the statement, as you write it above, has unbalanced parentheses. Did you drop a right parenthesis when you pasted in the code?
print ('%d:%s:%s' %(t[-2], t[2], t[-1].strip())

Python-How to solve UnicodeEncodeError [duplicate]

This question already has answers here:
UnicodeEncodeError: 'charmap' codec can't encode characters
(11 answers)
Closed 6 months ago.
import urllib, urllib2
from bs4 import BeautifulSoup, Comment
strg=""
iter=1
url='http://www.amazon.in/product-reviews/B00EOPJEYK/ref=cm_cr_pr_top_link_1? ie=UTF8&pageNumber=1&showViewpoints=0&sortBy=bySubmissionDateDescending'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, "html.parser")
rows =soup.find_all('div',attrs={"class" : "reviewText"})
for row in soup.find_all('div',attrs={"class" : "reviewText"}):
strg = strg +str(iter)+"." + row.text + "\n\n"
iter=iter+1
with open('outp.txt','w') as f:
f.write(strg)
f.close()
I require this code to write the contents of the variable,strg to the file,outp.txt.
Instead I get this error:
Traceback (most recent call last):
File "C:\Python27\demo_amazon.py", line 14, in <module>
f.write(strg)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 226: ordinal not in range(128)
strg stores the required output.There is some problem in the writing statement I guess.How to solve this?
Kindly help.
Thank you.

well, first of all, if you want to get rid of the unicode errors, you shall switch to Python 3 that defaults to unicode strings instead of ascii strings in python 2.
That said, to get rid of the UnicodeEncodeError exception, you shall do:
with open('outp.txt','w') as f:
f.write(strg.encode('utf8'))
as a reference, see that question. And try to use unicode strings as much as possible to avoid as much as possible changing charsets, by using u"this is an unicode string" instead of "this is an ascii string"
thus in your for loop:
strg = strg +str(iter)+"." + row.text + "\n\n"
should instead be:
strg = strg +unicode(iter)+u"." + row.text + u"\n\n"
and strg should be defined as strg = u""
N.B.: f.close() in your code is redundant with the use of the with keyword that actually takes care of closing the file when you exit the with block, through the __exit__() method of the File object.

Basically you have a non-ASCII character. I suggest using Unidecode which will try and find the "closest" ASCII character to the offending one. So, for instance it would turn é into e.
So you'd just do
from unidecode import unidecode
f.write(unidecode(strg))

How to encode/decode this file in Python?

I am planning to make a little Python game that will randomly print keys (English) out of a dictionary and the user has to input the value (in German). If the value is correct, it prints 'correct' and continue. If the value is wrong, it prints 'wrong' and breaks.
I thought this would be an easy task but I got stuck on the way. My problem is I do not know how to print the German characters. Let's say I have a file 'dictionary.txt' with this text:
cat:Katze
dog:Hund
exercise:Übung
solve:lösen
door:Tür
cheese:Käse
And I have this code just to test how the output looks like:
# -*- coding: UTF-8 -*-
words = {} # empty dictionary
with open('dictionary.txt') as my_file:
for line in my_file.readlines():
if len(line.strip())>0: # ignoring blank lines
elem = line.split(':') # split on ":"
words[elem[0]] = elem[1].strip() # appending elements to dictionary
print words
Obviously the result of the print is not as expected:
{'cheese': 'K\xc3\xa4se', 'door': 'T\xc3\xbcr',
'dog': 'Hund', 'cat': 'Katze', 'solve': 'l\xc3\xb6sen',
'exercise': '\xc3\x9cbung'}
So where do I add the encoding and how do I do it?
Thank you!

You are looking at byte string values, printed as repr() results because they are contained in a dictionary. String representations can be re-used as Python string literals and non-printable and non-ASCII characters are shown using string escape sequences. Container values are always represented with repr() to ease debugging.
Thus, the string 'K\xc3\xa4se' contains two non-ASCII bytes with hex values C3 and A4, a UTF-8 combo for the U+00E4 codepoint.
You should decode the values to unicode objects:
with open('dictionary.txt') as my_file:
for line in my_file: # just loop over the file
if line.strip(): # ignoring blank lines
key, value = line.decode('utf8').strip().split(':')
words[key] = value
or better still, use codecs.open() to decode the file as you read it:
import codecs
with codecs.open('dictionary.txt', 'r', 'utf8') as my_file:
for line in my_file:
if line.strip(): # ignoring blank lines
key, value = line.strip().split(':')
words[key] = value
Printing the resulting dictionary will still use repr() results for the contents, so now you'll see u'cheese': u'K\xe4se' instead, because \xe4 is the escape code for Unicode point 00E4, the ä character. Print individual words if you want the actual characters to be written to the terminal:
print words['cheese']
But now you can compare these values with other data that you decoded, provided you know their correct encoding, and manipulate them and encode them again to whatever target codec you needed to use. print will do this automatically, for example, when printing unicode values to your terminal.
You may want to read up on Unicode and Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

This is how you should do it.
def game(input,answer):
if input == answer:
sentence = "You got it!"
return sentence
elif input != answer:
wrong = "sorry, wrong answer"
return wrong

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Printing all unicode characters in Python - python

import sys for i in xrange(sys.maxunicode): print unichr(i);

You're trying to format a Unicode character into a byte string. You can remove the error by using a Unicode string instead: print u"{}: {}".format(code,eval(expression)) ^ The other answers are better at simplifying the original problem however, you're definitely doing things the hard way.

Related

Unicode to ASCII string in Python 2

Subtitle Project: How to solve the unicode reading failure?

Python: UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-39: ordinal not in range(128)

Python-How to solve UnicodeEncodeError [duplicate]

How to encode/decode this file in Python?

Categories

Resources