Japanese characters are garbled when connecting to mysql from python using docker - python

I created python and mysql containers using docker, connected to mysql from python, and tried to display tables using fetchall, but the Japanese characters were garbled.
I entered the mysql container and tried to display the file directly, but it may or may not be garbled depending on the character encoding setting.
Due to the data handled, it seems to display correctly when read in latin1.
Does anyone know how to display this data from python, or is it not possible to display the data in latin1?
[In case of garbled characters↓]
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
[If the characters are not garbled↓]
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
Due to the data being handled, it seems to display correctly when read in latin1.
So I changed the character encoding from the python code and tried again while displaying the settings, but the characters were garbled when read from python even though the character encoding settings were the same. I've tried charset, changing charset (all to utf8mb4 or utf8, etc.), and various other things, but the characters are garbled.
[test.py]
# coding: latin1
import mysql.connector
#Connect to mysql from python
cnx = mysql.connector.connect(
host='',
port='3306',
user='user',
password='pass',
database='test_database',
charset='latin1'
)
cursor_=cnx.cursor()
cursor_.execute("use test_database;")
#Change character encoding and get status of settings
query1 = "SET CHARACTER SET latin1;"
cursor_.execute(query1)
cursor_.execute("show variables like 'chara%';")
rows = cursor_.fetchall()
for row in rows:
print (row)
cursor_.execute("select * from test;")
rows = cursor_.fetchall()
for row in rows:
print (row)
('character_set_client', 'latin1')
('character_set_connection', 'utf8mb4')
('character_set_database', 'utf8mb4')
('character_set_filesystem', 'binary')
('character_set_results', 'latin1')
('character_set_server', 'utf8mb4')
('character_set_system', 'utf8')
('character_sets_dir', '/usr/share/mysql/charsets/')
Also, attempts to attach cording to print failed because it was a tuple type. The solution to this is also unknown.
for row in rows:
print (row.cording(utf8やlatin1))
'tuple' object has no attribute 'encord'

Related

Python's exif module and Umlauts in JPEG Metadata

I am writing a little script that should help me edit the EXIF metadata of JPEG files in Python, especially the 'artist' field, using the exif module in Python3. However, as I am German, I have to work on a few files where the author field contains an Umlaut, such as 'ü'. If I now open one of these files in 'rb' mode, create an exif Image object with myimgobj=Image(myfile) and try to access myimgobj.artist, I get a long list of multiple (!) UnicodeDecodeErrrors which are basically all the same:
'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)
For some of the error messages, it is not position 9, but 0, but I guess this can all be traced back to the same reason - the Umlaut. Everything works fine if there is no Umlaut in the field.
Is there any way I can work with the exif package and extract the artist, even if it contains an Umlaut?
Edit: To provide a minimal example, please consider any JPEG image where you set the artist field to 'ä' ( I'd upload one, but the EXIF tags get removed during the upload). It then fails for example when I try to print the artist like this:
from exif import Image
with open('Umlaut.jpg','rb') as imgfile:
my_image=Image(imgfile)
print(my_image.artist)
Use the following:
import exifread
with open('Umlaut.jpg','rb') as imgfile:
tags = exifread.process_file(imgfile)
print(tags) # all tags
for i,tag in enumerate(tags):
print(i,tag, tags[tag]) # tag by tag
Result, tested with string äüà (== b'\xc3\xa4\xc3\xbc\xc3\x83'.decode('utf8')) inserted manually to the Authors: .\SO\65720067.py
{'Image Artist': (0x013B) ASCII=äüà # 2122, 'Image ExifOffset': (0x8769) Long=2130 # 30, 'Image XPAuthor': (0x9C9D) Byte=äüà # 4210, 'Image Padding': (0xEA1C) Undefined=[] # 62, 'EXIF Padding': (0xEA1C) Undefined=[] # 2148}
0 Image Artist äüÃ
1 Image ExifOffset 2130
2 Image XPAuthor äüÃ
3 Image Padding []
4 EXIF Padding []
In the light of these facts, you can change your code to
from exif import Image
with open('Umlaut.jpg','rb') as imgfile:
my_image=Image(imgfile)
# print(my_image.artist) # error described below
print(my_image.xp_author) # äüà as expected
BTW, running your code unchanged, the following occurs (where every … Horizontal Ellipsis represents a bunch of messages in the full error traceback):
…
+--------+-----------+-------+----------------------+------------------+
| Offset | Access | Value | Bytes | Type |
+--------+-----------+-------+----------------------+------------------+
| | | | | AsciiZeroTermStr |
| | [0:0] | '' | | |
| 0 | --error-- | | c3 a4 c3 bc c3 83 00 | |
+--------+-----------+-------+----------------------+------------------+
UnicodeDecodeError occurred during unpack operation:
'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
…
+--------+-----------+-------+-------------------+----------+
| Offset | Access | Value | Bytes | Type |
+--------+-----------+-------+-------------------+----------+
| | | | | AsciiStr |
| | [0:0] | '' | | |
| 0 | --error-- | | c3 a4 c3 bc c3 83 | |
+--------+-----------+-------+-------------------+----------+
UnicodeDecodeError occurred during unpack operation:
'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

How do decode b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"?

[Summary]:
The data grabbed from the file is
b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
How to decode these bytes into readable Chinese characters please?
======
I extracted some game scripts from an exe file. The file is packed with Enigma Virtual Box and I unpacked it.
Then I'm able to see the scripts' names just right, in English, as it supposed to be.
In analyzing these scripts, I get an error looks like this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte
I changed the decoding to GBK, and the error disappeared.
But the output file is not readable. It includes readable English characters and non-readable content which supposed to be in Chinese. Example:
chT0002>pDIӘIʆ
I tried different encodings for saving the file and they show the same result, so the problem might be on the decoding part.
The data grabbed from the file is
b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
I tried many ways but I just can't decode these bytes into readable Chinese characters. Is there anything wrong with the file itself? Or somewhere else? I really need help, please.
One of the scripts are attached here.
In order to reliably decode bytes, you must know how the bytes were encoded. I will borrow the quote from the python codecs docs:
Without external information it’s impossible to reliably determine which encoding was used for encoding a string.
Without this information, there are ways to try and detect the encoding (chardet seems to be the most widely-used). Here's how you could approach that.
import chardet
data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
detected = chardet.detect(data)
decoded = data.decode(detected["encoding"])
The above example, however, does not work in this case because chardet isn't able to detect the encoding of these bytes. At that point, you'll have to either use trial-and-error or try other libraries.
One method you could use is to simply try every standard encoding, print out the result, and see which encoding makes sense.
codecs = [
"ascii", "big5", "big5hkscs", "cp037", "cp273", "cp424", "cp437", "cp500", "cp720",
"cp737", "cp775", "cp850", "cp852", "cp855", "cp856", "cp857", "cp858", "cp860",
"cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875",
"cp932", "cp949", "cp950", "cp1006", "cp1026", "cp1125", "cp1140", "cp1250",
"cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257",
"cp1258", "cp65001", "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312",
"gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2",
"iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1",
"iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", "iso8859_6", "iso8859_7",
"iso8859_8", "iso8859_9", "iso8859_10", "iso8859_11", "iso8859_13", "iso8859_14",
"iso8859_15", "iso8859_16", "johab", "koi8_r", "koi8_t", "koi8_u", "kz1048",
"mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman",
"mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", "shift_jisx0213",
"utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7",
"utf_8", "utf_8_sig",
]
data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
for codec in codecs:
try:
print(f"{codec}, {data.decode(codec)}")
except UnicodeDecodeError:
continue
Output
cp037, nC«^ýËfimb«[
cp273, nC«¢ýËfimb«¬
cp437, ò├è░ìsåëöéè║
cp500, nC«¢ýËfimb«¬
cp720, ـ├è░së¤éè║
cp737, Χ├Λ░ΞsΗΚΦΓΛ║
cp775, Ģ├Ŗ░ŹsåēöéŖ║
cp850, ò├è░ìsåëöéè║
cp852, Ľ├Ő░ŹsćëöéŐ║
cp855, Ћ├і░ЇsєЅћѓі║
cp856, ץ├ך░םsזיפגך║
cp857, ò├è░ısåëöéè║
cp858, ò├è░ìsåëöéè║
cp860, ò├è░ìsÁÊõéè║
cp861, þ├è░Þsåëöéè║
cp862, ץ├ך░םsזיפגך║
cp863, Ï├è░‗s¶ëËéè║
cp864, ¼ﺃ├٠┌s│┬½∙├ﻑ
cp865, ò├è░ìsåëöéè║
cp866, Х├К░НsЖЙФВК║
cp875, nCα£δΉfimbας
cp949, 빩뒺뛱냹봻듆
cp1006, ﺣﺍsﭦ
cp1026, nC«¢`Ëfimb«¬
cp1125, Х├К░НsЖЙФВК║
cp1140, nC«^ýËfimb«[
cp1250, •ĂŠ°Ťs†‰”‚Šş
cp1251, •ГЉ°Ќs†‰”‚Љє
cp1256, •أٹ°چs†‰”‚ٹ؛
gbk, 暶姲峴唹攤姾
gb18030, 暶姲峴唹攤姾
latin_1, ðsº
iso8859_2, Ă°sş
iso8859_4, ðsē
iso8859_5, УАsК
iso8859_7, Γ°sΊ
iso8859_9, ðsº
iso8859_10, ðsš
iso8859_11, รฐsบ
iso8859_13, Ć°sŗ
iso8859_14, ÃḞsẃ
iso8859_15, ðsº
iso8859_16, Ă°sș
koi8_r, ∙ц┼╟█s├┴■┌┼╨
koi8_u, ∙ц┼╟█s├┴■┌┼╨
kz1048, •ГЉ°Қs†‰”‚Љғ
mac_cyrillic, Х√К∞НsЖЙФВКЇ
mac_greek, ïΟäΑçsÜâî²äΚ
mac_iceland, ï√ä∞çsÜâîÇä∫
mac_latin2, ē√äįćsÜČĒāäļ
mac_roman, ï√ä∞çsÜâîÇä∫
mac_turkish, ï√ä∞çsÜâîÇä∫
ptcp154, •ГҠ°ҚsҶү”ӮҠә
shift_jis_2004, 陛寛行̹狽桓
shift_jisx0213, 陛寛行̹狽桓
utf_16, 쎕낊玍覆芔몊
utf_16_be, 闃誰赳蚉钂誺
utf_16_le, 쎕낊玍覆芔몊
Edit: After running all of the seemingly legible results through Google Translate, I suspect this encoding is UTF-16 big-endian. Here's the results:
+-----------+---------------+--------------------+--------------------------+
| Encoding | Decoded | Language Detected | English Translation |
+-----------+---------------+--------------------+--------------------------+
| gbk | 暶姲峴唹攤姾 | Chinese | Jian Xian JiaoTanJiao |
| gb18030 | 暶姲峴唹攤姾 | Chinese | Jian Xian Jiao Tan Jiao |
| utf_16 | 쎕낊玍覆芔몊 | Korean | None |
| utf_16_be | 闃誰赳蚉钂誺 | Chinese | Who is the epiphysis? |
| utf_16_le | 쎕낊玍覆芔몊 | Korean | None |
+-----------+---------------+--------------------+--------------------------+

How can I create a table in python?

I have found a way to list all the alt codes but I want to put them in a
table so it looks something like this.
This is what I have tried:
variable = -1
for i in range(55295):
print("---------")
variable = variable + 1
print(str(variable) + " " + chr(variable))
This code will print all the alt codes.
To get it into a table I tried this. (It has a time delay)
import time
variable = -1
#for i in range(55295):
for i in range(15):
print("---------")
variable = variable + 1
print(" | "+ str(variable) + " | " + chr(variable) + " | ")
time.sleep(0.0001)
print("---------------------------------------------------")
I have run out of ideas, can you help please?
(This is the first time i've asked a question on here.)
You cannot print characters with such low codes on a 21st century system. The characters in the image are those that appeared on old MS-DOS systems (according to Wikipedia: Code Page 437). However, modern systems work with Unicode fonts, and the codes below 32 (space) are control codes, reserved for special purposes. The code 9, for example, inserts a Tab, and 10 puts the text cursor on a new line.
(This was also the case on those old systems but you could circumvent this by writing immediately into the video buffer. Nowadays, that is no longer an option on most computers.)
To get the modern equivalent of the old characters, you need a lookup list that translates them. I copied mine from the wiki page linked to above. Note that there is no official representation of the code 0000; I changed it to a space. This is only for the control codes below 32. There are a few codes above 126 that also may not show "correctly" (as in "not as on antique computers" 😄), but you can look them up on the wiki page.
To correctly align one- and two-digit numbers, use print formatting. Aligning can be done with functions such as rjust and .format; but, coming from a C background, I prefer what the documentation calls "Old style formatting" (https://docs.python.org/3/library/stdtypes.html#old-string-formatting).
cp437 = [0x0020, 0x263A, 0x263B, 0x2665, 0x2666, 0x2663, 0x2660, 0x2022, 0x25D8, 0x25CB,
0x25D9, 0x2642, 0x2640, 0x266A, 0x266B, 0x263C, 0x25BA, 0x25C4, 0x2195, 0x203C,
0x00B6, 0x00A7, 0x25AC, 0x21A8, 0x2191, 0x2193, 0x2192, 0x2190, 0x221F, 0x2194,
0x25B2, 0x25BC]
for i in range(15):
print("+----+-----+")
print("| %2d | %s |" % (i, chr(cp437[i])))
print("+----+-----+")
This produces the following table:
+----+-----+
| 0 | |
+----+-----+
| 1 | ☺ |
+----+-----+
| 2 | ☻ |
+----+-----+
| 3 | ♥ |
+----+-----+
| 4 | ♦ |
+----+-----+
| 5 | ♣ |
+----+-----+
| 6 | ♠ |
+----+-----+
| 7 | • |
+----+-----+
| 8 | ◘ |
+----+-----+
| 9 | ○ |
+----+-----+
| 10 | ◙ |
+----+-----+
| 11 | ♂ |
+----+-----+
| 12 | ♀ |
+----+-----+
| 13 | ♪ |
+----+-----+
| 14 | ♫ |
+----+-----+
You can try to use the pyplot table method: I have modified and extended your example code so that it looks like:
import matplotlib.pyplot as plt
import time
variable = 96
chars = []
#for i in range(55295):
for i in range(26):
print("---------")
variable = variable + 1
chars.append([str(variable), chr(variable)])
print(" | "+ str(variable) + " | " + chr(variable) + " | ")
time.sleep(0.0001)
print("---------------------------------------------------")
plt.figure('Table')
plt.axis('off')
columns=['Alt Codes','Characters']
tab = plt.table(cellText=chars,loc='center',cellLoc='center',colLabels=columns)
plt.show()
The result looks like:
png file of the table saved with matplotlib
I did not succeed to make the characters with alt-code starting from 0 visible in the table (only in the console).

How can I use a list-like type to generate a markdown with string.Template python?

I have the following Template
from string import Template
myTemplate = '''$heading
| Name | Age |
| ---- |---- |
'''
The problem is that I don't know when writing the template how many people there will be in the table. So I would like to pass in a list of tuples such as:
myTemplate.substitute(...=[("Tom", "23"), ("Bill", "43"), ("Tim", "1")])
How can this be done? If I just add in a placeholder for the list with tuples, this would not work since the surrounding formatting of the data would be lost.
I would like the template to capture the formatting and the list to capture the data and keep those two elements separate.
The result should be as follows:
| Name | Age |
| ---- |---- |
| Tom | 23 |
| Bill | 43 |
| Tim | 1 |
There may be a reason for not wanting to import a fully featured templating engine, such as wanting to run the code in a seriously resource-limited environment. If so, it's not hard to do this in a few lines of code.
The following can cope with a list of tuples of up to 26 elements itentified as $A to $Z in the template string, and returns a list of template expansions.
from string import Template
def iterate_template( template, items):
AZ=[ 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'[i:i+1] for i in range(26) ] # ['A','B',... 'Z']
return [ Template(template).safe_substitute(
dict(zip( AZ, elem ))) for elem in items ]
Edit: for efficiency I should probably have instantiated the Template once and used it multiple times in the list comprehension:
def iterate_template( template, items):
AZ=[ 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'[i:i+1] for i in range(26) ] # ['A','B',... 'Z']
tem = Template(template)
return [ tem.safe_substitute( dict(zip( AZ, elem ))) for elem in items ]
examples of use
>>> table = [('cats','feline'), ('dogs','canine')]
>>> iterate_template('| $A | $B |', table )
['| cats | feline |', '| dogs | canine |']
>>> x=Template('$heading\n$stuff').substitute(
heading='This is a title',
stuff='\n'.join(iterate_template('| $A | $B | $C |',
[('cats','feline'), ('dogs', 'canine', 'pack')] ) ) # slight oops
)
>>> print(x)
This is a title
| cats | feline | $C |
| dogs | canine | pack |
I recommend Mustache. This is a simple template engine that can do what you need.

Convert Python Pretty table to CSV using shell or batch command line

Whats an easy way convert the output of Python Pretty table to grammatically usable format such as CSV.
The output looks like this :
C:\test> nova list
spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
Perhaps this will get you close:
nova list | grep -v '\-\-\-\-' | sed 's/^[^|]\+|//g' | sed 's/|\(.\)/,\1/g' | tr '|' '\n'
This will strip the --- lines
Remove the leading |
Replace all but the last | with ,
Replace the last | with \n
Here's a real ugly one-liner
import csv
s = """\
spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+"""
result = [tuple(filter(None, map(str.strip, splitline))) for line in s.splitlines() for splitline in [line.split("|")] if len(splitline) > 1]
with open('output.csv', 'wb') as outcsv:
writer = csv.writer(outcsv)
writer.writerows(result)
I can unwrap it a bit to make it nicer:
splitlines = s.splitlines()
splitdata = line.split("|")
splitdata = filter(lambda line: len(line) > 1, data)
# toss the lines that don't have any data in them -- pure separator lines
header, *data = [[field.strip() for field in line if field.strip()] for line in splitdata]
result = [header] + data
# I'm really just separating these, then re-joining them, but sometimes having
# the headers separately is an important thing!
Or possibly more helpful:
result = []
for line in s.splitlines():
splitdata = line.split("|")
if len(splitdata) == 1:
continue # skip lines with no separators
linedata = []
for field in splitdata:
field = field.strip()
if field:
linedata.append(field)
result.append(linedata)
#AdamSmith's answer has a nice method for parsing the raw table string. Here are a few additions to turn it into a generic function (I chose not to use the csv module so there are no additional dependencies)
def ptable_to_csv(table, filename, headers=True):
"""Save PrettyTable results to a CSV file.
Adapted from #AdamSmith https://stackoverflow.com/questions/32128226
:param PrettyTable table: Table object to get data from.
:param str filename: Filepath for the output CSV.
:param bool headers: Whether to include the header row in the CSV.
:return: None
"""
raw = table.get_string()
data = [tuple(filter(None, map(str.strip, splitline)))
for line in raw.splitlines()
for splitline in [line.split('|')] if len(splitline) > 1]
if table.title is not None:
data = data[1:]
if not headers:
data = data[1:]
with open(filename, 'w') as f:
for d in data:
f.write('{}\n'.format(','.join(d)))
Here's a solution using a regular expression. It also works for an arbitrary number of columns (the number of columns is determined by counting the number of plus signs in the first input line).
input_string = """spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+"""
import re, csv, sys
def pretty_table_to_tuples(input_str):
lines = input_str.split("\n")
num_columns = len(re.findall("\+", lines[0])) - 1
line_regex = r"\|" + (r" +(.*?) +\|"*num_columns)
for line in lines:
m = re.match(line_regex, line.strip())
if m:
yield m.groups()
w = csv.writer(sys.stdout)
w.writerows(pretty_table_to_tuples(input_string))

Categories