I have a text file that contain lines like :
Name; Country
josué ségura;FR
Dr Gérald KIERZEK⚡👨â€âš•ï¸;FR #contains emoji
I need to decode this text in UTF-8, I don't find a solution in python.
I find on the internet a solution in javascript, but I never use javascript, I need a solution in python that makes it possible to decode all text (all lines) in UTF-8.
Thank you very much
This is text that was originally encoded as UTF-8 but has been decoded with an 8-bit encoding (perhaps cp1252 or some other Windows encoding, perhaps latin-1). This is known as mojibake.
It can be correctly decoded by encoding as latin-1 to get bytes, then decoding as UTF-8.
> s = '33;josué ségura;FR'
>>> s.encode('latin').decode('utf-8')
'33;josué ségura;FR'
Related
I have to process an input text file, which can be in ANSI and convert it to UTF8, whilst doing doing some processing of the lines read. In python, that'll amount to
with open(input_file_location, 'r', newline='\r\n', encoding='cp1252') as old, open(output_file_location, 'w', encoding='utf_8') as new:
for line in old:
modified = ... do processing here ....
new.write(modified)
However, this will work as expected only if the input file is ANSI (windows). If however, the input file was UTF8 originally, the above code works silently, reading it assuming ANSI and thus things in output are not as expected.
So - question is - how to handle the scenario if existing file was already UTF8, so either read it as UTF8, or better, avoid the whole of above processing.
Thanks
So - question is - how to handle the scenario if existing file was already UTF8, so either read it as UTF8, or better, avoid the whole of above processing.
UTF8 is more constraining than CP1252, and both are ascii compatible. So you can start by reading it as UTF8, if that works you're fine (it's either plain ASCII or valid UTF-8), if that does not fall back to CP1252.
Alternatively you could try running chardet on it, but that's not necessarily more reliable: every byte is "valid" in ISO-8859 encodings (of which CP1252 is a derivative), so every file "decodes properly", they just return garbage.
There isn't a guaranteed way to determine the encoding a file if it isn't known in advance. However if you are sure that the possibilities are restricted to UTF-8 and cp1252, then the following approach may work:
Open the file in binary mode and read the first three bytes. If these bytes are b'\xef\xbb\xbf' then the encoding is extremely likely to be 'utf-8-sig', a Microsoft variant of UTF-8 (unless you have cp1252 files that legitimately begin with "''"). See the final paragraph of this section of the codecs docs.
Assume UTF-8. Both UTF-8 and cp1252 will decode bytes in the ASCII range (0-127) identically. Single bytes with the high bit set are not valid UTF-8, so if the file is encoded as cp1252 and contains such bytes a UnicodeDecodeError will be raised.
Catch the above UnicodeDecodeError and try again with cp1252.
I want to treat Outlook .msg file as string and check if a substring exists in it.
So I thought importing win32 library, which is suggested in similar SO threads, would be an overkill.
Instead, I tried to just open the file the same way as a .txt file:
file_path= 'O:\\MAP\\177926 Delete comiitted position.msg'
mail = open(file_path)
mail_contents = mail.read()
print(mail_contents)
However, I get
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 870: character maps to <undefined>
Is there any decoding I can specify to make it work?
I have also tried
mail = open(file_path, encoding='utf-8')
which returns
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte
Unless you're willing to do a lot of work, you really should use a library for this.
First, a .msg file is a binary file, so the contents should not be read in as a string. A string is usually terminated with a null byte, and binary files can have a lot of those inside, which could mean you're not looking at all the data (might depend on the implementation).
Also, the .msg file can have plain ascii and/or unicode in different parts/blocks of the file, so it would be really hard to treat this as one string to search for a substring.
As an alternative you could save the mails as .eml (i.e. the plain text version of an e-mail), but there would still be some problems to overcome in order to search for a specific text:
All data in an e-mail are lower ascii (1-127) which means special characters have to be encoded to lower ascii bytes. There are several different encodings for headers (for example 'Subject'), body, attachment.
Body text: can be plain text or hml (or both). Lines and words can be split because there is a maximum line length. Different encodings can be used, even base64 in which you would never find the text you're looking for.
A lot more would have to be done to properly decode everything, but this should give you an idea of the work you would have to do in order to find the text you're looking for.
When you face these type of issues, it is good pratice to try the Python Latin-1 encoding.
mail = open(file_path, encoding='Latin-1')
We often confound the Windows cp1252 encoding with the actual Python's Latin-1. Indeed, the latter maps all possible byte values to the first 256 Unicode code points.
See this for more information.
My files are in US-ASCII and a command like a = file( 'main.html') and a.read() loads them as an ASCII text. How do I get it to load as UTF8?
The problem I am tring to solve is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 38: ordinal not in range(128)
I was using the content of the files for templating as in template_str.format(attrib=val). But the string to interpolate is of a superset of ASCII.
Our team's version control and text editors does not care about the encoding. So how do I handle it in the code?
You are trying to opening files without specifying an encoding, which means that python uses the default value (ASCII).
You need to decode the byte-string explicitly, using the .decode() function:
template_str = template_str.decode('utf8')
Your val variable you tried to interpolate into your template is itself a unicode value, and python wants to automatically convert your byte-string template (read from the file) into a unicode value too, so that it can combine both, and it'll use the default encoding to do so.
Did I mention already you should read Joel Spolsky's article on Unicode and the Python Unicode HOWTO? They'll help you understand what happened here.
A solution working in Python2:
import codecs
fo = codecs.open('filename.txt', 'r', 'ascii')
content = fo.read() ## returns unicode
assert type(content) == unicode
fo.close()
utf8_content = content.encode('utf-8')
assert type(utf8_content) == str
I suppose that you are sure that your files are encoded in ASCII. Are you? :) As ASCII is included in UTF-8, you can decode this data using UTF-8 without expecting problems. However, when you are sure that the data is just ASCII, you should decode the data using just ASCII and not UTF-8.
"How do I get it to load as UTF8?"
I believe you mean "How do I get it to load as unicode?". Just decode the data using the ASCII codec and, in Python 2.x, the resulting data will be of type unicode. In Python 3, the resulting data will be of type str.
You will have to read about this topic in order to learn how to perform this kind of decoding in Python. Once understood, it is very simple.
I read in some data from a danish text file. But i can't seem to find a way to decode it.
The original text is "dør" but in the raw text file its stored as "d√∏r"
So i tried the obvious
InputData = "d√∏r"
Print InputData.decode('iso-8859-1')
sadly resulting in the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range(128)
UTF-8 gives the same error.
(using Python 2.6.5)
How can i decode this text so the printed message would be "dør"?
C3 B8 is the UTF-8 encoding for "ø". You need to read the file in UTF-8 encoding:
import codecs
codecs.open(myfile, encoding='utf-8')
The reason that you're getting a UnicodeEncodeError is that you're trying to output the text and Python doesn't know what encoding your terminal is in, so it defaults to ascii. To fix this issue, use sys.stdout = codecs.getwriter('utf8')(sys.stdout) or use the environment variable PYTHONIOENCODING="utf-8".
Note that this will give you the text as unicode objects; if everything else in your program is str then you're going to run into compatibility issues. Either convert everything to unicode or (probably easier) re-encode the file into Latin-1 using ustr.encode('iso-8859-1'), but be aware that this will break if anything is outside the Latin-1 codepage. It might be easier to convert your program to use str in utf-8 encoding internally.
Is there a way to recognize if text file is UTF-8 in Python?
I would really like to get if the file is UTF-8 or not. I don't need to detect other encodings.
You mentioned in a comment you only need to detect UTF-8. If you know the alternative consists of only single byte encodings, then there is a solution that often works.
If you know it's either UTF-8 or single byte encoding like latin-1, then try opening it first in UTF-8 and then in the other encoding. If the file contains only ASCII characters, it will end up opened in UTF-8 even if it was intended as the other encoding. If it contains any non-ASCII characters, this will almost always correctly detect the right character set between the two.
try:
# or codecs.open on Python <= 2.5
# or io.open on Python > 2.5 and <= 2.7
filedata = open(filename, encoding='UTF-8').read()
except:
filedata = open(filename, encoding='other-single-byte-encoding').read()
Your best bet is to use the chardet package from PyPI, either directly or through UnicodeDamnit from BeautifulSoup:
chardet 1.0.1
Universal encoding detector
Detects:
ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-2, windows-1250 (Hungarian)
ISO-8859-5, windows-1251 (Bulgarian)
windows-1252 (English)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)
Requires Python 2.1 or later
However, some files will be valid in multiple encodings, so chardet is not a panacea.
Reliably? No.
In general, a byte sequence has no meaning unless you know how to interpret it -- this goes for text files, but also integers, floating point numbers, etc.
But, there are ways of guessing the encoding of a file, by looking at the byte order mark (if there is one) and the first chunk of the file (to see which encoding yields the most sensible characters). The chardet library is pretty good at this, but be aware it's only a heuristic, albeit a rather powerful one.