Dealing with encoding inconsistencies/cleaning hidden characters from webpage

Dealing with encoding inconsistencies/cleaning hidden characters from webpage - python

I scraped the link below and I want to process the text for further analysis using Python. The segment at issue is "kwa vimada wake". I want to end up with text the corresponds to the way it is intended to display (and does display on my browser), as "kwa vimada wake". However, there are hidden characters around "vimada", which you can see if you copy the text and paste into a program like Notepad++. These mess with my tokenization and NLP processing (POS tagger doesn't recognize the word, for example) and seem not to stay consistent between my script and other programs (after using machine learning and then loading the results in my script, I end up with vimadaÃ, which it can't match with vimada�).
The webpage seems to be using UTF-8 encoding and my files are saved with UTF-8 encoding. If I could solve this issue and eliminate any strange/hidden characters, I would have no issues with consistency across files or using it as input into NLP tools.
My script is using # -- coding: utf-8 --
I would prefer to work with the text I've already downloaded because security changes to the site have made re-scraping it impractical. My database has it saved as "kwa âvimadaâ wake". The begin/end characters display in Notepad++ as three characters each: [â][PAD][SOS] and [â][PAD][SGCI].
I want to remove unicode white space/hidden characters and convert all variants of punctuation like apostrophes, quotation marks, hyphens, etc. into their ASCII equivalents. I would prefer to keep accented characters as is. However, not all accented characters are currently being interpreted correctly. Some are encoded incorrectly, some were changed on the website presumably due to software changes and show up as html code like é. So a simple deletion of a class of characters won't clean the data properly. I'm using python 2.7.
http://www.jamiiforums.com/threads/rais-dhaifu-ccm-uchaguzi-2015.459292/#post-6461865

Related

Writing from Python to a database with an encoding different from utf8

Python 3.7.2
I write the strings from my Python code into my database. My strings contain Latin and Cyrillic characters, so in the database I use 1-byte encoding koi8-r. The miracle is that my strings without distortion are written to the database, although utf8 and koi8r have completely different sequence of characters (for example, as in ascii and utf8). Sometimes characters of other layouts appear in the text and then write errors appear.
Therefore, the question appears:
Who converts strings: the database or the aiomysql library, that I use to write to the database.
How quickly in Python / MariaDB to remove non-koi8-r characters to avoid errors.
Is there a multibyte encoding that stores the Latin and Cyrillic characters in the first byte, and other layouts in other bytes.
Thank you in advance for participating in the conversation.

Here's the processing when INSERTing:
The Client has the characters encoded with charset-1.
You told MySQL that that was the case when you connected or via SET NAMES.
The column that the characters will be inserted into is declared to be charset-2.
The INSERT converts from charset-1 to charset-2. So, all is well.
Upon SELECTing, the same thing happens, except that the conversion is in the other direction.
What you are doing is OK. But, going forward, everyone 'should' use UTF-8 characters in clients and CHARACTER SET utf8mb4 for columns. You will essentially have to change to such if you ever branch out beyond what your character sets allow, which may be nothing more than Russian and English.

the different output of rare traditional Chinese with python running on Windows10 and Linux

I use Python3 code print(u'𩝫')running respectively on windows10 and Linux.
The Linux shows the supposed result'𩝫', but the windows shows the '口口'。
At first, I thought it was because the window system language was simplified Chinese, so I changed it with traditional Chinese(Taiwan). But still it didn't work. I have tried a lot of methods, including codec.encode() and codec.decode(),but all failed.
Now my question is: how to show the supposed result'𩝫' on windows?

Each character cell in the Windows console contains a single 16-bit character (WCHAR), which limits the console to the Unicode Basic Multilingual Plane (BMP), i.e. the first 65,535 code points up to U+00FFFF. The character "𩝫" is U+02976B, which has to be encoded as a pair of UTF-16 surrogate codes (U+00D865 and U+00DF6B) and stored in two consecutive character cells. Individually, surrogate codes are not valid Unicode characters. Typically they display as an empty box, or some other default glyph.
The surrogate pair can be copied from the console to the clipboard and pasted in another window. If the target window supports rendering non-BMP characters, then it should display properly as "𩝫".

New line with invisible character

I'm sure this has been answered before but after attempting to search for others who had the problem I didn't have much luck.
I am using csv.reader to parse a CSV file. The file is in the correct format, but on one of the lines of the CSV file I get the notification "list index out of range" indicating that the formatting is wrong. When I look at the line, I don't see anything wrong. However, when I go back to the website where I got the text, I see a square/rectangle symbol where there is a space. This symbol must be leading csv.reader to treat that as a new line symbol.
A few questions: 1) What is this symbol and why can't I see it in my text files? 2) How do I avoid having these treated as new lines? I wonder if the best way is to find and replace them given that I will be processing the file multiple times in different ways.
Here is the symbol:
Update: When I copy and paste the symbol into Google it searches for Â (a-circumflex). However, when I copy and paste Â into my documents, it shows up correctly. That leads me to believe that the symbol is not actually Â.

This looks like a charset problem. The "Â" is latin-1 for a non-breaking space in UTF-8. Assuming you are running Windows, you are using one of the latins as character set. UTF-8 is the default encoding for OSX and Linux-based OSs. The OS locale is used as default locale in most text editors, and thus encode files created with those programs as latin-1. A lot of programmers on OSX have problems with non-breaking spaces because it is very easy to mistakenly type it (it is Option+Spacebar) and impossible to see.
In python >= 3.1, the csv reader supports dialects for solving those kind of problems. If you know what program was used to create the csv file, you can manually specify a dialect, like 'excel'. You can use a csv sniffer to automatically deduce it by peeking into the file.
Life Management Advice: If you happen to see weird characters anywhere, always assume charset problems. There is an awesome charset problem debug table HERE.

Private Unicode Character displays differently in Python 3 Interpreter

So I created a unicode character privately using Private Character Editor on Windows 10. The character was saved with the code E000. I copied it from the Character Map and pasted into a text editor and it worked. However, when I paste it into the Python IDLE editor it changes to a different unicode character, even before running the program. I can't use u'unicode_string' or anything like that because my unicode character doesn't even work in the interpreter. I am new to programming.
My question is, how do I use my private unicode character in Python 3.4?
This is what I see on Notepad.
This is what I see on Python 3.4 interpreter.

Python isn't really the interesting part of this, rather the shell or terminal is. In our case, Windows uses special code points to represent private character encodings. To get those, you need to get a hex dump of the character on a shell in Windows, then you can render the character in Python.
NOTE: Use Unicode points E021 or higher, since lower number code points are usually used for control, and it seems that the Windows shell that the python interpreter and IDLE use doesn't let you override those with private characters.
Demonstration
I tested your issue by generating a private character of my own. I will put an image of my test here since it wouldn't be rendered properly in text here on Stack Overflow.
Explanation
I used the Character Map program in Windows 10 to copy the symbol and paste it into my python environment. The environment may truncate it on the right since it is a wide character and the environment didn't seem to like that. (I moved the cursor around to get it to render full-width.)
Then I proceeded to get the hexdump of the code point by encoding the character using the default utf-8 encoding, which turned out to be \xee\x80\xa1 as a bytes object.
Next I printed the data as a string to show you a common error, and what would be printed if you attempted to print a string of those bytes.
Then, I printed b'\xee\x80\xa1', which is how you would actually use the symbol in your software.

You can use the \u escape sequence in your Python source code, like so:
my_unicode_string = 'This is my Unicode character: \ue000'
print(my_unicode_string)

Python, Windows, Ansi - encoding, again

Hello there,
even if i really tried... im stuck and somewhat desperate when it comes to Python, Windows, Ansi and character encoding. I need help, seriously... searching the web for the last few hours wasn't any help, it just drives me crazy.
I'm new to Python, so i have almost no clue what's going on. I'm about to learn the language, so my first program, which ist almost done, should automatically generate music-playlists from a given folder containing mp3s. That works just fine, besides one single problem...
...i can't write Umlaute (äöü) to the playlist-file.
After i found a solution for "wrong-encoded" Data in the sys.argv i was able to deal with that. When reading Metadata from the MP3s, i'm using some sort of simple character substitution to get rid of all those international special chars, like french accents or this crazy skandinavian "o" with a slash in it (i don't even know how to type it...). All fine.
But i'd like to write at least the mentioned Umlaute to the playlist-file, those characters are really common here in Germany. And unlike the Metadata, where i don't care about some missing characters or miss-spelled words, this is relevant - because now i'm writing the paths to the files.
I've tried so many various encoding and decoding methods, i can't list them all here.. heck, i'm not even able to tell which settings i tried half an hour ago. I found code online, here, and elsewhere, that seemed to work for some purposes. Not for mine.
I think the tricky part is this: it seems like the Problem is the Ansi called format of the files i need to write. Correct - i actually need this Ansi-stuff. About two hours ago i actually managed to write whatever i'd like to an UFT-8 file. Works like charm... until i realized that my Player (Winamp, old Version) somehow doesn't work with those UTF-8 playlist files. It couldn't resolve the Path, even if it looks right in my editor.
If i change the file format back to Ansi, Paths containing special chars get corrupted. I'm just guessing, but if Winamp reads this UTF-8 files as Ansi, that would cause the Problem i'm experiencing right now.
So...
I DO have to write äöü in a path, or it will not work
It DOES have to be an ANSI-"encoded" file, or it will not work
Things like line.write(str.decode('utf-8')) break the funktion of the file
A magical comment at the beginning of the script like # -*- coding: iso-8859-1 -*- does nothing here (though it is helpful when it comes to the mentioned Metadata and allowed characters in it...)
Oh, and i'm using Python 2.7.3. Third-Party modules dependencies, you know...
Is there ANYONE who could guide me towards a way out of this encoding hell? Any help is welcome. If i need 500 lines of Code for another functions or classes, i'll type them. If there's a module for handling such stuff, let me know! I'd buy it! Anything helpful will be tested.
Thank you for reading, thanks for any comment,
greets!

As mentioned in the comments, your question isn't very specific, so I'll try to give you some hints about character encodings, see if you can apply those to your specific case!
Unicode and Encoding
Here's a small primer about encoding. Basically, there are two ways to represent text in Python:
unicode. You can consider that unicode is the ultimate encoding, you should strive to use it everywhere. In Python 2.x source files, unicode strings look like u'some unicode'.
str. This is encoded text - to be able to read it, you need to know the encoding (or guess it). In Python 2.x, those strings look like 'some str'.
This changed in Python 3 (unicode is now str and str is now bytes).
How does that play out?
Usually, it's pretty straightforward to ensure that you code uses unicode for its execution, and uses str for I/O:
Everything you receive is encoded, so you do input_string.decode('encoding') to convert it to unicode.
Everything you need to output is unicode but needs to be encoded, so you do output_string.encode('encoding').
The most common encodings are cp-1252 on Windows (on US or EU systems), and utf-8 on Linux.
Applying this to your case
I DO have to write äöü in a path, or it will not work
Windows natively uses unicode for file paths and names, so you should actually always use unicode for those.
It DOES have to be an ANSI-"encoded" file, or it will not work
When you write to the file, be sure to always run your output through output.encode('cp1252') (or whatever encoding ANSI would be on your system).
Things like line.write(str.decode('utf-8')) break the funktion of the file
By now you probably realized that:
If str as indeed an str instance, Python will try to convert it to unicode using the utf-8 encoding, but then try to encode it again (likely in ascii) to write it to the file
If str is actually an unicode instance, Python will first encode it (likely in ascii, and that will probably crash) to then be able to decode it.
Bottom line is, you need to know if str is unicode, you should encode it. If it's already encoded, don't touch it (or decode it then encode it if the encoding is not the one you want!).
A magical comment at the beginning of the script like # -- coding: iso-8859-1 -- does nothing here (though it is helpful when it comes to the mentioned Metadata and allowed characters in it...)
Not a surprise, this only tells Python what encoding should be used to read your source file so that non-ascii characters are properly recognized.
Oh, and i'm using Python 2.7.3. Third-Party modules dependencies, you know...
Python 3 probably is a big update in terms of unicode and encoding, but that doesn't mean Python 2.x can't make it work!
Will that solve your issue?
You can't be sure, it's possible that the problem lies in the player you're using, not in your code.
Once you output it, you should make sure that your script's output is readable using reference tools (such as Windows Explorer). If it is, but the player still can't open it, you should consider updating to a newer version.

On Windows there is special encoding available called mbcs, it converts between current default ANSI codepage and UNICODE.
For example on a Spanish Language PC:
u'ñ'.encode('mbcs') -> '\xf1'
'\xf1'.decode('mbcs') -> u'ñ'
On Windows ANSI means current default multi-byte code page. For western European languages Windows ISO-8859-1, for eastern European languages windows ISO-8859-2) encoded byte string and other encodings for other languages as appropriate.
More info available at:
https://docs.python.org/2.4/lib/standard-encodings.html
See also:
https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding

# -*- coding comments declare the character encoding of the source code (and therefore of byte-string literals like 'abc').
Assuming that by "playlist" you mean m3u files, then based on this specification you may be at the mercy of the mp3 player software you are using. This spec says only that the files contain text, no mention of what character encoding.
I have personally observed that various mp3 encoding software will use different encodings for mp3 metadata. Some use UTF-8, others ISO-8859-1. So you may have to allow encoding to be specified in configuration and leave it at that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.