Writing from Python to a database with an encoding different from utf8 - python

Python 3.7.2
I write the strings from my Python code into my database. My strings contain Latin and Cyrillic characters, so in the database I use 1-byte encoding koi8-r. The miracle is that my strings without distortion are written to the database, although utf8 and koi8r have completely different sequence of characters (for example, as in ascii and utf8). Sometimes characters of other layouts appear in the text and then write errors appear.
Therefore, the question appears:
Who converts strings: the database or the aiomysql library, that I use to write to the database.
How quickly in Python / MariaDB to remove non-koi8-r characters to avoid errors.
Is there a multibyte encoding that stores the Latin and Cyrillic characters in the first byte, and other layouts in other bytes.
Thank you in advance for participating in the conversation.

Here's the processing when INSERTing:
The Client has the characters encoded with charset-1.
You told MySQL that that was the case when you connected or via SET NAMES.
The column that the characters will be inserted into is declared to be charset-2.
The INSERT converts from charset-1 to charset-2. So, all is well.
Upon SELECTing, the same thing happens, except that the conversion is in the other direction.
What you are doing is OK. But, going forward, everyone 'should' use UTF-8 characters in clients and CHARACTER SET utf8mb4 for columns. You will essentially have to change to such if you ever branch out beyond what your character sets allow, which may be nothing more than Russian and English.

Related

Python unicode errors reading files generated by other apps

I'm getting decoding exception errors when reading export files from multiple applications. Have been running into this for a month, as I learn far more about unicode than I ever wanted to know. Some fundamentals are still missing. I understand utf, I understand codepages, I understand how they tend to be used in practice (a single codepage per document e.g., though I can't imagine that's still true today--see the back page of a health statement with 15 languages.)
Is it true that utf-8 can and does encode every possible unicode char? How then is it possible for one application to write a utf-8 file and another to not be able to read it?
when utf is used, codepages are NOT used, is that correct? as I think it through, the codepage is an older style and is made obsolete by utf. I'm sure there are some exceptions.
utf could also be looked as a data compression scheme, less than an encoding one.
But there I'm stuck, as in practice, I have 6 different applications made in different countries, which can create export files, 3 in ut-f, 3 in cp1252, yet python 3.7 cannot read them without error:
'charmap' codec can't decode byte 0x9d in position 1555855: character maps to
'charmap' codec can't decode byte 0x81 in position 4179683: character maps to
I use Edit Pro to examine the files, which successfully reads the files. It points to a line that contains an extra pair of special double quotes:
"Metro Exodus review: “Not only the best Metro yet, it's one of the best shooters in years” | GamesRadar+"
Removing that ” allows python to continue reading in the file, to the next error.
python reports it as char x9d, but an (really old: Codewright) old editor reports it as x94. Codewright I believe. Verified it is an x94 and x93 pair on the internet so it must be true. ;-)
It is very troublesome that I don't know for sure what the actual bytes are, as there are so many layers of translation, interpretation, format for display, etc.
So the visual studio debug report of x9d is a misdirect. What's going on with the python library that it would report this?
How is this possible? I can find no info about how chars in one codepage can be invalid under utf (if that's the problem). What would I search under?
It should not be this hard. I have 30 years experience in programming c++, sql, you name it, learning new libraries, languages is just breakfast.
I also do not understand why the information to handle this is so hard to find. Surely numerous other programmers doing data conversions, import/exports between applications have run into this for decades.
The files I'm importing are csv files from 6 apps, and json files from another. the 6 apps export in utf-8 and cp1252 (as reported by Edit Pro) and the other app exports json in utf-8, though I could also choose csv.
The 6 apps run on an iPhone and export files I'm attempting to read on windows 10. I'm running python 3.7.8, though this problem has persisted since 3.6.3.
Thanks in advance
Dan
The error 'charmap' codec can't decode byte... shows that you are not using utf-8 to read the file. That's the source of your struggles on this one. Unless the file starts with a BOM (byte order mark), you kinda have to know how the file was encoded to decode it correctly.
utf-8 encodes all unicode characters and python should be able to read them all. Displaying is another matter. You need font files for the unicode characters to do that part. You were reading in "charmap", not "utf-8" and that's why you had the error.
"when utf is used" ... there are several UTF encodings. utf-8, utf-16-be (big endian), utf-16-le (little endian), utf-16 (synonym for utf-16-le), utf-32 variants (I've never seen this in the wild) and variants that include the BOM (byte order mark) which is an optional set of characters at the start of the file describing utf encoding type.
But yes, UTF encodings are meant to replace the older codepage encodings.
No, its not compression. The encoded stream could be larger than the bytes needed to hold the string in memory. This is especially true of utf-8, less true with utf-16 (that's why Microsoft went with utf-16). But utf-8 as a superset of ASCII that does not have byte order issues like utf-16 has many other advantages (that's why all the sane people chose it). I can't think of a case where a UTF encoding would ever be smaller than the count of its characters.

New line with invisible character

I'm sure this has been answered before but after attempting to search for others who had the problem I didn't have much luck.
I am using csv.reader to parse a CSV file. The file is in the correct format, but on one of the lines of the CSV file I get the notification "list index out of range" indicating that the formatting is wrong. When I look at the line, I don't see anything wrong. However, when I go back to the website where I got the text, I see a square/rectangle symbol where there is a space. This symbol must be leading csv.reader to treat that as a new line symbol.
A few questions: 1) What is this symbol and why can't I see it in my text files? 2) How do I avoid having these treated as new lines? I wonder if the best way is to find and replace them given that I will be processing the file multiple times in different ways.
Here is the symbol:
Update: When I copy and paste the symbol into Google it searches for  (a-circumflex). However, when I copy and paste  into my documents, it shows up correctly. That leads me to believe that the symbol is not actually Â.
This looks like a charset problem. The "Â" is latin-1 for a non-breaking space in UTF-8. Assuming you are running Windows, you are using one of the latins as character set. UTF-8 is the default encoding for OSX and Linux-based OSs. The OS locale is used as default locale in most text editors, and thus encode files created with those programs as latin-1. A lot of programmers on OSX have problems with non-breaking spaces because it is very easy to mistakenly type it (it is Option+Spacebar) and impossible to see.
In python >= 3.1, the csv reader supports dialects for solving those kind of problems. If you know what program was used to create the csv file, you can manually specify a dialect, like 'excel'. You can use a csv sniffer to automatically deduce it by peeking into the file.
Life Management Advice: If you happen to see weird characters anywhere, always assume charset problems. There is an awesome charset problem debug table HERE.

Which character encoding Python 3.x supports for file I/O?

I had a problem writing a UTF-8 supported character (\ufffd) to a text file. I was wondering what the most inclusive character set Python 3.x supports for writing string data to files.
I was able to overcome the problem by
valEncoded = (origVal.encode(encoding='ASCII', errors='replace')).decode()
which basically filtered out non-ASCII characters from origVal. But I figure Python file I/O should support more than ASCII, which is pretty conservative. So I am looking for what is the most inclusive character set supported.
Any of the UTF encodings should work:
UTF-8 is typically the most compact (particularly if the text is largely ASCII compatible), and portable between systems with different endianness without requiring a BOM (byte order mark). It's the most common encoding used on non-Windows systems that support the full Unicode range (and the most common encoding used for serving data over a network, e.g. HTML, XML, JSON)
UTF-16 is commonly used by Windows (the system "wide character" APIs use it as the native encoding)
UTF-32 is uncommon, and wastes a lot of disk space, with the only real benefit being a fixed ratio between characters and bytes (you can divide file size by four after subtracting the BOM and you get the number of characters)
In general, I'd recommend going with UTF-8 unless you know it will be consumed by another tool that expects UTF-16 or the like.

Dealing with encoding inconsistencies/cleaning hidden characters from webpage

I scraped the link below and I want to process the text for further analysis using Python. The segment at issue is "kwa vimada wake". I want to end up with text the corresponds to the way it is intended to display (and does display on my browser), as "kwa vimada wake". However, there are hidden characters around "vimada", which you can see if you copy the text and paste into a program like Notepad++. These mess with my tokenization and NLP processing (POS tagger doesn't recognize the word, for example) and seem not to stay consistent between my script and other programs (after using machine learning and then loading the results in my script, I end up with vimadaÃ, which it can't match with vimada�).
The webpage seems to be using UTF-8 encoding and my files are saved with UTF-8 encoding. If I could solve this issue and eliminate any strange/hidden characters, I would have no issues with consistency across files or using it as input into NLP tools.
My script is using # -- coding: utf-8 --
I would prefer to work with the text I've already downloaded because security changes to the site have made re-scraping it impractical. My database has it saved as "kwa âvimadaâ wake". The begin/end characters display in Notepad++ as three characters each: [â][PAD][SOS] and [â][PAD][SGCI].
I want to remove unicode white space/hidden characters and convert all variants of punctuation like apostrophes, quotation marks, hyphens, etc. into their ASCII equivalents. I would prefer to keep accented characters as is. However, not all accented characters are currently being interpreted correctly. Some are encoded incorrectly, some were changed on the website presumably due to software changes and show up as html code like é. So a simple deletion of a class of characters won't clean the data properly. I'm using python 2.7.
http://www.jamiiforums.com/threads/rais-dhaifu-ccm-uchaguzi-2015.459292/#post-6461865

Character Encoding

My text editor allows me to code in several different character formats Ansi, UTF-8, UTF-8(No BOM), UTF-16LE, and UTF-16BE.
What is the difference between them?
What is commonly regarded as the best format (I'm using Python if that makes a diffrence)?
"Ansi" is a misnomer and usually refers to some 8-bit encoding that's the default on the current platform (on "western" Windows installations that's usually Windows-1252). It only supports a small set of characters (256 different characters at most).
UTF-8 is a variable-length, ASCII-compatible encoding capable of storing any and all Unicode characters. It's a pretty good choice for western text that should support all Unicode characters and a very viable choice in the general case.
"UTF-8 (no BOM)" is the name Windows gives to using UTF-8 without writing a Byte Order Marker. Since a BOM is not needed for UTF-8, it shouldn't be used and this would be the correct choice (pretty much everyone else calls this version simply "UTF-8"!).
UTF-16LE and UTF-16BE are the Little Endian and Big Endian versions of the UTF-16 encoding. As UTF-8, UTF-16 is capable of representing any Unicode character, however it is not ASCII-compatible.
Generally speaking UTF-8 is a great overall choice and has wide compatibility (just make sure not to write the BOM, because that's what most other software expects).
UTF-16 could take less space if the majority of your text is composed of non-ASCII characters (i.e. doesn't use the basic latin alphabet).
"Ansi" should only be used when you have a specific need to interact with a legacy application that doesn't support Unicode.
An important thing about any encoding is that they are meta-data that need to be communicated in addition to the data. This means that you must know the encoding of some byte stream to interpret it as a text correctly. So you should either use formats that document the actual encoding used (XML is a prime example here) or standardize on a single encoding in a given context and use only that.
For example, if you start a software project, then you can specify that all your source code is in a given encoding (again: I suggest UTF-8) and stick with that.
For Python files specifically, there's a way to specify the encoding of your source files.
Here. Note that "ANSI" is usually CP1252.
You'll probably get greatest utility with UTF-8 No BOM. Forget that ANSI and ASCII exist, they are deprecated dinosaurs.

Categories