pyodbc encoding question: big5 text is being mangled

pyodbc encoding question: big5 text is being mangled - python

I am reading from a pyodbc connection with source data which is probably encoded in big5.
for example:
cursor = myob_cnxn.cursor()
cursor.execute(f"select * from {tablename}")
rows = cursor.fetchall()
in the rows list,
an address column can look like this:
s=rows[8628][3]
'¤EÀs»ò¦a¹D62¸¹¥Ã¦w¼s³õ2¼Ó8«Ç'
This is probably big5; the source is a Hong Kong MYOB file. If I use the export feature of this application and open in big5 encoding, I get chinese characters.
MYOB stores data in a file, and I think it follows the encoding of the Windows machine (which is big5). I am running my code on such a Windows desktop, so my 32 bit python 3.7 environment is the same as the MYOB executable.
If I save that 'string' via vi and open it in big5 encoding, I get chinese characters, but also some errors, errors which are not present in the application's export feature.
python thinks it is a string.
I have tried passing big5 encoding to pyodbc but it makes no difference.
that is, myob_cnxn.setencoding(encoding="big5") #this is not going to work, and it doesn't
I am stuck with no good ideas. I think if I could get binary results not decoded strings, I may have a chance, but I don't actually know what I am getting with pyodbc and this connection
The answer came from #lenz (so far, in a comment).
This is my implementation, which uses errors = "ignore" and therefore throws away some characters. For me, this an acceptable outcome for this data.
big5_decoded = value.encode("latin",errors="ignore").decode("big5",errors="ignore")

Related

Special characters like ç and ã aren't decoded when the text is obtained from a file

I'm learning Python and tried to make a hanging game (literal translation - don't know the real name of the game in English. Sorry.). For those who aren't familiar with this game, the player must discover a secret word by guessing one letter at a time.
In my code, I get a collection of secret words which is imported from a txt file using the following code:
words_bank = open('palavras.txt', 'r')
words = []
for line in words_bank:
words.append(line.strip().lower())
words_bank.close()
print(words)
The output of print(words) is ['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3'] but if I try print('maçã, açaí, tucumã') in order to check the special characters, everything is printed correctly. Looks like the issue is in the encoding (or decoding... I'm still reading lots of articles about it to really understand) special characters from files.
The content of line 1 of my code is # coding: utf-8 because after some research I found out that I have to specify the Unicode format that is required for the text to be encoded/decoded. Before adding it, I was receiving the following message when running the code:
File "path/forca.py", line 12
SyntaxError: Non-ASCII character '\xc3' in file path/forca.py on line 12, but no encoding declared
Line 12 content: print('maçã, açaí, tucumã')
Things that I've already tried:
add encode='utf-8' as parameter in open('palavras.txt', 'r')
add decode='utf-8' as parameter in open('palavras.txt', 'r')
same as above but with latin1
substitute line 1 content for #coding: latin1
My OS is Ubuntu 20.04 LTS, my IDE is VS Code.
Nothing works!
I don't know what search and what to do anymore.
SOLUTION HERE
Thanks to the help given by the friends above, I was able to find out that the real problem was in the combo VS Code extension (Code Runner) + python alias version from Ubuntu 20.04 LTS.
Code Runner is set to run codes in Terminal in my situation, so apparently, when it calls for python the alias version was python 2.7.x. To overcome this situation I've used this thread to set python 3 as default.
It's done! Whenever python is called, both in terminal and VS Code with Code Runner, all special characters works just fine.
Thank's everybody for your time and your help =)

This only happens when using Python 2.x.
The error is probably because you're printing a list not printing items in the list.
When calling print(words) (words is a list), Python invokes a special function called repr on the list object. The list then creates a summary representation of the list by calling repr in each child in the list, then creates a neat string visualisation.
repr(string) actually returns an ASCII representation (with escapes) rather than a suitable version for your terminal.
Instead, try:
for x in words:
print(x)
Note. The option for open is encoding. E.g
open('myfile.txt', encoding='utf-8')
You should always, always pass the encoding option to open - Python <=3.8 on Linux and Mac will assume UTF-8 (for most people). Python <=3.8 on Windows will use an 8-bit code page.
Python 3.9 will always use UTF-8
See Python 2.x vs 3.x behaviour:
Py2
>>> print ['maçã', 'açaí', 'tucumã']
['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3']
>>> repr('maçã')
"'ma\\xc3\\xa7\\xc3\\xa3'"
>>> print 'maçã'
maçã
Py3
>>> print(['maçã', 'açaí', 'tucumã'])
['maçã', 'açaí', 'tucumã']
>>> repr('maçã')
"'maçã'"

Strange symbols instead of Cyrillic Django

I'm trying to add data to DB by external script.
In this script, I first create a list of Model elements, and then add them to DB by bulk_create method.
from shop.models import SpeciesOfWood
species_of_wood = [
SpeciesOfWood(title="Ель"),
SpeciesOfWood(title="Кедр"),
SpeciesOfWood(title="Пихта"),
SpeciesOfWood(title="Лиственница")
]
SpeciesOfWood.objects.bulk_create(species_of_wood)
This code works well in terms of adding data to DB, but idk what happens with values I wanted to add, here is screenshot:
I already tried to add:
# -*- coding: utf-8 -*-
u prefix to title values
But it didn't change anything.
UPD 1
I tried to create models myself like SpeciesOfWood.objects.create(...) and it also doesn't change anything.
UPD 2
I tried to add cyrillic data via admin panel, and it works ok, data looks like I wanted. I still don't know why data added via script added with wrong encoding, but via admin panel ok.
UPD 3
I tried to use SpeciesOfWood.objects.create(...) via python manage.py shell, and it works well if I write it by hand. Also, it can be useful, I executing this dummy data script using this code:
>>> python manage.py shell
>>> exec(open("my_script.py").read())

This looks suspiciously like your database is misconfigured, or the software with which you're reading the data back is: the characters in the table image corresponds to your original data encoded to UTF-8 then decoded to Windows-1251 (a "legacy" cyrillic encoding although wikipedia tells me it remains extremely popular):
>>> print("Ель".encode('utf-8').decode('windows-1251'))
Р•Р»СЊ
This means either the database is configured such that the reader assumes Windows-1251 encoding, or the software you use to view database content assumes the database returns data in whatever encoding is setup on the system (and your desktop environment is configured in cyrillic using windows-1251 / cp1251).
Either way it doesn't look like an issue with the input to me as the data is originally encoded / stored as UTF-8.

Answer lies in the way how I executing script. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns).... I was mistaken in thinking that default encoding while reading file is UTF-8.

Wrong character encoding of rtf file

When I copy and paste the sentence How brave they’ll all think me at home! into a blank TextEdit rtf document on the Mac, it looks fine. But if I create an an apparently identical rtf file programatically, and write the same sentence into it, on opening TextEdit it appears as How brave theyâ€™ll all think me at home! In the following code, output is OK, but the file when viewed in TextEdit has problems with the right single quotation mark (here used as an apostrophe), unicode U-2019.
header = r"""{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf400
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\f0\fs24 \cf0 """
sen = 'How brave they’ll all think me at home!'
with open('staging.rtf', 'w+’) as f:
f.write(header)
f.write(sen)
f.write('}')
with open('staging.rtf') as f:
output = f.read()
print(output)
I’ve discovered from https://www.i18nqa.com/debug/utf8-debug.html that this may be caused by “UTF-8 bytes being interpreted as Windows-1252”, and that makes sense as it seems that ansicpg1252 in the header indicates US Windows.
But I still can’t work out how to fix it, even having read the similar issue here: Encoding of rtf file. I’ve tried replacing ansi with mac without effect. And adding ,encoding='utf8' to the open function doesn’t seem to help either.
(The reason for using rtf by the way is to be able to export sentences with colour-coded words, allow them to be manually edited, then read back in for further processing).

OK, I've found the answer myself. I needed to use , encoding='windows-1252' both when writing to the rtf file and also when reading from it.

Airflow worker not understanding file encoding when parsing a file with pandas

With the following code :
import pandas as pd
filename = r"/path/to/my/file.csv"
with open(filename) as f:
data_frame = pd.read_csv(f,
usecols=['col1', 'col2'],
parse_dates=['DateProd', 'DateStart', 'DateEnd'],
header=0,
delimiter=';',
encoding='latin-1')
print(data_frame)
When this is executed locally, it prints the expected dataframe with proper accentuation. When executed in an airflow task on a remote worker, it fails with the following error :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2581: invalid continuation byte
When looking at the full debug strack trace (sorry, sensitive information, can't fully provide), I see within the stack trace that encoding='latin-1' was definitely specified, and it still fails with the UnicodeDecodeError error. Anybody noticed a similar issue? I've been running in circles, trying as many encoding as possible, but nothing seems to work.
I forgot to mention that the file is a remote file on a samba share. Wether I try to read it directly with smbclient.open() or copy it over locally and then opening it, I get the same result : UnicodeDecodeError. When I try the same thing locally (both reading directly from the samba share, or copying it over), everything seems fine, and I noticed I don't even need to specify the encoding, it will find it automatically, and accents will be displayed properly.
Another update : It seems that wether the file is read from the samba share or not does not make a difference. I managed to run the docker image that is used on the remote worker, and I can reproduce this issue with everything hosted locally, wether I open the file before, wether I read it entirely before giving it to pandas, or wether I simply give the filename to read_csv.
The engine does not seem to make a difference either : specifying engine='python' or engine='c' yield the same results.
Another update : It appears that this same issue is also happening with a fresh ubuntu docker image. I'm guessing there is some locales that need to be installed before it is able to parse them.

I've figured it out.
On a Windows machine, the encoding seems to be different by default. I don't even have to specify an encoding for it to work. Not in the container. Thus, I need to specify the encoding when opening files in the containers. The following should work :
import pandas as pd
filename = r"/path/to/my/file.csv"
with open(filename, encoding='latin-1') as f:
data_frame = pd.read_csv(f,
usecols=['col1', 'col2'],
parse_dates=['DateProd', 'DateStart', 'DateEnd'],
header=0,
delimiter=';',)
# Notice the lack of encoding='latin-1' here.
print(data_frame)
But! SambaHook essentially returns a pysmbclient's SambaClient. When you try to open a file with this Sambaclient, there is no way to specify the encoding of the file. So, locally, on a windows machine, everything seems to work fine. In a linux container, it fails with the UnicodeDecodeError. Looking under the hood, I've found that it essentially copies the file over before simply calling open() on the file.
For now, this is my solution : Copy the file over with the SambaClient returned by SambaHook in a temp file, open it with the proper encoding, ask panda to parse it. I will see what I can do about improving SambaHook and pysmbclient so that others can specify the encoding when opening a file.

Postgres COPY FROM file throwing unicode error while referenced character apparently not in file

First of all, thank you to everyone on Stack Overflow for past, present, and future help. You've all saved me from disaster (both of my own design and otherwise) too many times to count.
The present issue is part of a decision at my firm to transition from a Microsoft SQL Server 2005 database to PostgreSQL 9.4. We have been following the notes on the Postgres wiki (https://wiki.postgresql.org/wiki/Microsoft_SQL_Server_to_PostgreSQL_Migration_by_Ian_Harding), and these are the steps we're following for the table in question:
Download table data [on Windows client]:
bcp "Carbon.consensus.observations" out "Carbon.consensus.observations" -k -S [servername] -T -w
Copy to Postgres server [running CentOS 7]
Run Python pre-processing script on Postgres server to change encoding and clean:
import sys
import os
import re
import codecs
import fileinput
base_path = '/tmp/tables/'
cleaned_path = '/tmp/tables_processed/'
files = os.listdir(base_path)
for filename in files:
source_path = base_path + filename
temp_path = '/tmp/' + filename
target_path = cleaned_path + filename
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with open(source_path, 'r') as source_file:
with open(target_path, 'w') as target_file:
start = True
while True:
contents = source_file.read(BLOCKSIZE).decode('utf-16le')
if not contents:
break
if start:
if contents.startswith(codecs.BOM_UTF8.decode('utf-8')):
contents = contents.replace(codecs.BOM_UTF8.decode('utf-8'), ur'')
contents = contents.replace(ur'\x80', u'')
contents = re.sub(ur'\000', ur'', contents)
contents = re.sub(ur'\r\n', ur'\n', contents)
contents = re.sub(ur'\r', ur'\\r', contents)
target_file.write(contents.encode('utf-8'))
start = False
for line in fileinput.input(target_path, inplace=1):
if '\x80' in line:
line = line.replace(r'\x80', '')
sys.stdout.write(line)
Execute SQL to load table:
COPY consensus.observations FROM '/tmp/tables_processed/Carbon.consensus.observations';
The issue is that the COPY command is failing with a unicode error:
[2015-02-24 19:52:24] [22021] ERROR: invalid byte sequence for encoding "UTF8": 0x80
Where: COPY observations, line 2622420: "..."
Given that this could very likely be because of bad data in the table (which also contains legitimate non-ASCII characters), I'm trying to find the actual byte sequence in context, and I can't find it anywhere (sed to look at the line in question, regexes to replace the character as part of the preprocessing, etc). For reference, this grep returns nothing:
cat /tmp/tables_processed/Carbon.consensus.observations | grep --color='auto' -P "[\x80]"
What am I doing wrong in tracking down where this byte sequence sits in context?

I would recommend loading the SQL file (which appears to be /tmp/tables_processed/Carbon.consensus.observations) into an editor that has a hex mode. This should allow you to see it (depending on the exact editor) in context.
gVim (or terminal-based Vim) is one option I would recommend.
For example, if I open in gVim an SQL copy file that has this content:
1 1.2
2 1.1
3 3.2
I can the convert it into hex mode via the command %!xxd (in gVim or terminal Vim) or the Menu option Tools > Convert to HEX.
That yields this display:
0000000: 3109 312e 320a 3209 312e 310a 3309 332e 1.1.2.2.1.1.3.3.
0000010: 320a 2.
You can then run %!xxd -r to convert it back, or the Menu option Tools > Convert back.
Note: This actually modifies the file, so it would be advisable to do this to a copy of the original, just in case the changes somehow get written (you would have to explicitly save the buffer in Vim).
This way, you can see both the hex sequences on the left, and their ASCII equivalent on the right. If you search for 80, you should be able to see it in context. With gVim, the line numbering will be different for both modes, though, as is evidenced by this example.
It's likely the first 80 you find will be that line, though, since if there were earlier ones, it likely would've failed on those instead.
Another tool which might help that I've used in the past is the graphical hex editor GHex. Since that's a GNOME project, not quite sure it'll work with CentOS. wxHexEditor supposedly works with CentOS and looks promising from the website, although I've not yet used it. It's pitched as a "hex editor for massive files", so if your SQL file is large, that might be the way to go.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.