Strange symbols instead of Cyrillic Django - python

I'm trying to add data to DB by external script.
In this script, I first create a list of Model elements, and then add them to DB by bulk_create method.
from shop.models import SpeciesOfWood
species_of_wood = [
SpeciesOfWood(title="Ель"),
SpeciesOfWood(title="Кедр"),
SpeciesOfWood(title="Пихта"),
SpeciesOfWood(title="Лиственница")
]
SpeciesOfWood.objects.bulk_create(species_of_wood)
This code works well in terms of adding data to DB, but idk what happens with values I wanted to add, here is screenshot:
I already tried to add:
# -*- coding: utf-8 -*-
u prefix to title values
But it didn't change anything.
UPD 1
I tried to create models myself like SpeciesOfWood.objects.create(...) and it also doesn't change anything.
UPD 2
I tried to add cyrillic data via admin panel, and it works ok, data looks like I wanted. I still don't know why data added via script added with wrong encoding, but via admin panel ok.
UPD 3
I tried to use SpeciesOfWood.objects.create(...) via python manage.py shell, and it works well if I write it by hand. Also, it can be useful, I executing this dummy data script using this code:
>>> python manage.py shell
>>> exec(open("my_script.py").read())

This looks suspiciously like your database is misconfigured, or the software with which you're reading the data back is: the characters in the table image corresponds to your original data encoded to UTF-8 then decoded to Windows-1251 (a "legacy" cyrillic encoding although wikipedia tells me it remains extremely popular):
>>> print("Ель".encode('utf-8').decode('windows-1251'))
Ель
This means either the database is configured such that the reader assumes Windows-1251 encoding, or the software you use to view database content assumes the database returns data in whatever encoding is setup on the system (and your desktop environment is configured in cyrillic using windows-1251 / cp1251).
Either way it doesn't look like an issue with the input to me as the data is originally encoded / stored as UTF-8.

Answer lies in the way how I executing script. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns).... I was mistaken in thinking that default encoding while reading file is UTF-8.

Related

pyodbc encoding question: big5 text is being mangled

I am reading from a pyodbc connection with source data which is probably encoded in big5.
for example:
cursor = myob_cnxn.cursor()
cursor.execute(f"select * from {tablename}")
rows = cursor.fetchall()
in the rows list,
an address column can look like this:
s=rows[8628][3]
'¤EÀs»ò¦a¹D62¸¹¥Ã¦w¼s³õ2¼Ó8«Ç'
This is probably big5; the source is a Hong Kong MYOB file. If I use the export feature of this application and open in big5 encoding, I get chinese characters.
MYOB stores data in a file, and I think it follows the encoding of the Windows machine (which is big5). I am running my code on such a Windows desktop, so my 32 bit python 3.7 environment is the same as the MYOB executable.
If I save that 'string' via vi and open it in big5 encoding, I get chinese characters, but also some errors, errors which are not present in the application's export feature.
python thinks it is a string.
I have tried passing big5 encoding to pyodbc but it makes no difference.
that is, myob_cnxn.setencoding(encoding="big5") #this is not going to work, and it doesn't
I am stuck with no good ideas. I think if I could get binary results not decoded strings, I may have a chance, but I don't actually know what I am getting with pyodbc and this connection
The answer came from #lenz (so far, in a comment).
This is my implementation, which uses errors = "ignore" and therefore throws away some characters. For me, this an acceptable outcome for this data.
big5_decoded = value.encode("latin",errors="ignore").decode("big5",errors="ignore")

Special characters like ç and ã aren't decoded when the text is obtained from a file

I'm learning Python and tried to make a hanging game (literal translation - don't know the real name of the game in English. Sorry.). For those who aren't familiar with this game, the player must discover a secret word by guessing one letter at a time.
In my code, I get a collection of secret words which is imported from a txt file using the following code:
words_bank = open('palavras.txt', 'r')
words = []
for line in words_bank:
words.append(line.strip().lower())
words_bank.close()
print(words)
The output of print(words) is ['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3'] but if I try print('maçã, açaí, tucumã') in order to check the special characters, everything is printed correctly. Looks like the issue is in the encoding (or decoding... I'm still reading lots of articles about it to really understand) special characters from files.
The content of line 1 of my code is # coding: utf-8 because after some research I found out that I have to specify the Unicode format that is required for the text to be encoded/decoded. Before adding it, I was receiving the following message when running the code:
File "path/forca.py", line 12
SyntaxError: Non-ASCII character '\xc3' in file path/forca.py on line 12, but no encoding declared
Line 12 content: print('maçã, açaí, tucumã')
Things that I've already tried:
add encode='utf-8' as parameter in open('palavras.txt', 'r')
add decode='utf-8' as parameter in open('palavras.txt', 'r')
same as above but with latin1
substitute line 1 content for #coding: latin1
My OS is Ubuntu 20.04 LTS, my IDE is VS Code.
Nothing works!
I don't know what search and what to do anymore.
SOLUTION HERE
Thanks to the help given by the friends above, I was able to find out that the real problem was in the combo VS Code extension (Code Runner) + python alias version from Ubuntu 20.04 LTS.
Code Runner is set to run codes in Terminal in my situation, so apparently, when it calls for python the alias version was python 2.7.x. To overcome this situation I've used this thread to set python 3 as default.
It's done! Whenever python is called, both in terminal and VS Code with Code Runner, all special characters works just fine.
Thank's everybody for your time and your help =)
This only happens when using Python 2.x.
The error is probably because you're printing a list not printing items in the list.
When calling print(words) (words is a list), Python invokes a special function called repr on the list object. The list then creates a summary representation of the list by calling repr in each child in the list, then creates a neat string visualisation.
repr(string) actually returns an ASCII representation (with escapes) rather than a suitable version for your terminal.
Instead, try:
for x in words:
print(x)
Note. The option for open is encoding. E.g
open('myfile.txt', encoding='utf-8')
You should always, always pass the encoding option to open - Python <=3.8 on Linux and Mac will assume UTF-8 (for most people). Python <=3.8 on Windows will use an 8-bit code page.
Python 3.9 will always use UTF-8
See Python 2.x vs 3.x behaviour:
Py2
>>> print ['maçã', 'açaí', 'tucumã']
['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3']
>>> repr('maçã')
"'ma\\xc3\\xa7\\xc3\\xa3'"
>>> print 'maçã'
maçã
Py3
>>> print(['maçã', 'açaí', 'tucumã'])
['maçã', 'açaí', 'tucumã']
>>> repr('maçã')
"'maçã'"

ArcPy and Python encoding messing up?

I am faced with a strange behavior between ArcPy and Python encoding. I work with VisualStudio 2010 Shell with Python tools for VS (PTVS) installed. I isolated my problem through a simple script file. The py script file that contains the following commands. In VisualStudio, I have set the « Advanced Save Options...» to « UTF-8 without signature ». The script simply print on the screen a accented string, then import arcpy module, then again print the same string. Importing Arcpy seems to change the Python encoding setup but I don't know why and I would like to restablish it correctly because it causes problems a little bit everywhere in the original script.
I checked the python « encoding » folder and erased every pyc file. Than I ran the script and it generated 3 pyc files :
cp850.pyc (which corresponds to my stdout.encoding)
cp1252.pyc (which corresponds to my Windows environment encoding)
utf_8.pyc (which fits the encoding of my script)
When ArcPy is being imported, something comes altering the encoding that affects the initial variables.
Why?
Is it possible with some Python command to find where the ArcPy encode cp1252 is located and read it so that I can make a function that deals with it?
# -*- coding: utf-8 -*-
import sys
print ('Loaded encoding : %(t)s'%{'t':sys.getdefaultencoding()})
reload(sys) # See stackoverflow question 2276200
sys.setdefaultencoding('utf-8')
print ('Set default encoding : %(t)s'%{'t':sys.getdefaultencoding()})
print ''
texte = u'Récuperation des données'
print ('Original type : %(t)s'%{'t':type(texte)})
print ('Original text : %(t)s'%{'t':texte})
print ''
import arcpy
print ('imported arcpy')
print ('Loaded encoding : %(t)s'%{'t':sys.getdefaultencoding()})
print ''
print ('arcpy mess up original type : %(t)s'%{'t':type(texte)})
print ('arcpy mess up original text : %(t)s'%{'t':texte})
print ''
print ('arcpy mess up reencoded with cp1252 type : %(t)s'%{'t':type(texte.encode('cp1252'))})
print ('arcpy mess up reencoded with cp1252 text : %(t)s'%{'t':texte.encode('cp1252')})
raw_input()
and when I run the script, I get these results :
Loaded encoding : ascii
Set encoding : utf-8
Original type : type 'unicode'
Original text : Récuperation des données <--- This is right
import arcpy
Loaded encoding : utf-8
arcpy mess up original type : type 'unicode'
arcpy mess up original text : R'cuperation des donn'es> <--- This is wrong
arcpy mess up ReEncode with cp1252 type : type 'str'
arcpy mess up ReEncode with cp1252 text : Récuperation des données> <--- This is fits with the original unicode
Answering my question.
From ESRI support, I got this information :
By default, python in the command line is not going to change the code page to a UTF-8 based text for print statements to show up in Unicode. ArcGIS on the other hand specifically allows unicode values to be passed to it and has changed the code page within the command line so that the values you see printed are the values ArcGIS is using. This is why the command line should be the only environment where you see the import sys followed by import arcpy give you a different printed value.
Since my application run scripts that does not always need arcpy, depending of what I want it to do, to solve my problem, I made a generic function that deals with the encoding, whether or not arcpy has been imported, using the information provided by :
Coding_CMD_Window = sys.stdout.encoding
Coding_OS = locale.getpreferredencoding()
Coding_Script = sys.getdefaultencoding()
Coding2Use = Coding_CMD_Window
if any('arcpy' in importedmodules for importedmodules in sys.modules):
Coding2Use = Coding_OS
Also, I made sure that all of my scripts had the proper UTF-8 encoding without signature.
Hope this helps anyone.
For those in doubt, try something like the following (e.g., in a .py file):
import codecs
#import arcpy
f = codecs.open('utf.file.txt', encoding='utf-8-sig') #assuming a BOM present
l = f.readlines()
print u''.join(l)
Then run the same code once more, but first remove the hash comment from the arcpy line. It'll take about 6 seconds more time.
What I get is perfectly fine text running the first version, gibberish when allowing arcpy to load.
ArcGIS for Desktop version used: 10.2.1

Python and EasyEclipse: identical code but results differ (encoding)

I can't believe it. After I solved my question in Problems with Encoding in Eclipse Console and Python I thought it wouldn't happen again that I got problems here. But now this:
I have a program test.py in the project TestMe that looks like this:
print "ö"
-> Run as... Python Run results in
ö
So far so good. When I now copy the program in EasyEclipse by right click/copy and paste I receive the program copy of test.py in the same project that looks exactly the same:
print "ö"
-> Bun Run as... Python Run results in
ö
I noticed, that the file properties changed from Encoding UTF-8 to Default, but also changing to UTF-8 doesn't help here.
Another difference between the two files is the line ending which is "Windows" in the original file and "Unix" in the copy (great definition of copy, btw). Changing this in Notepad++ also doesn't change anything.
I am perplexed...
Set up:
Python 2.5
Windows 7
Easy Eclipse 1.2.2.2
Settings that I've set to UTF-8 / Windows:
Project/Rightclick/Properties
File/Rightclick/Properties
Window/Preferences/Workspace
Several places to change the encoding, most immersive first:
Workspace Window > Preferences > General > Workspace
Project Properties
File Properties
Run Configuration.
Using the first method is the most useful one as the others including the console inherit from it by default which is probably what you want.

Problems with Encoding in Eclipse Console and Python

I guess I need some help regarding encodings in Python (2.6) and Eclipse. I used Google and the so-search and tried a lot of things but as a matter of fact I don't get it.
So, how do I achieve, that the output in the Eclipse console is able to show äöü etc.?
I tried:
Declaring the document encoding in the first line with
# -*- coding: utf-8 -*-
I changed the encoding settings in Window/Preferences/General/Workspace and Project/Properties to UTF-8
As nothing changed I tried the following things alone and in combination but nothing seemed to work out:
Changing the stdout as mentioned in the Python Cookbook:
sys.stdout = codecs.lookup("utf-8")-1
Adding an unicode u:
print u"äöü".encode('UTF8')
reloading sys (I don't know what for but it doesn't work either ;-))
I am trying to do this in order to debug the encoding-problems I have in my programs... (argh)
Any ideas? Thanks in advance!
EDIT:
I work on Windows 7 and it is EasyEclipse
Got it! If you have the same problem go to
Run/Run Configurations/Common and select the UTF-8 (e.g.) as console encoding.
So, finally, print "ö" results in "ö"
Even this is a bit old question, I'm new in StackOverflow and I'd like to contribute a bit. You can change the default encoding in Eclipse (currently Neon) for the all text editors from the menu Window -> Preferences -> General -> Workspace : Text file encoding
Item Path

Categories