Encoding issue when writing to text file, with Python - python

I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.
While there are a number of questions out there on this topic, I didn't find a direct answer to my problem.
Detecting the system defaults won't help me in this case, because I need the program to be portable.
Here's the code:
def txt_to_JSON(csv_list):
...some manipulation of the list...
return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list
for i in range(0,len(lines)):
lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()

All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet
I highly recommend Ned Batchelder's presentation
http://nedbatchelder.com/text/unipain.html
for details.
There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?
TLDR:
Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.
Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.
In your case
filename = 'where your data lives'
with open(filename, 'rb') as f:
encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")
# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)
encoded_result = result.encode("UTF-16") #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
f.write(encoded_result)

You need to tell Python to use the Unicode character encoding to decode the Hebrew characters.
Here's a link to how you can read Unicode characters in Python: Character reading from file in Python

Related

Ignore UTF-8 decoding errors in CSV

I have a CSV file (which I have no control over). It's the result of concatenating multiple CSV files. Most of the file is UTF-8 but one of the files that went into it had fields that are encoded in what looks like Windows-1251.
I actually only care about one of the fields which contains a URL (so it's valid ASCII/UTF-8).
How do I ignore decoding errors in the other CSV fields if I only care about one field which I know is ASCII? Alternatively, for a more useful solution how do I change the encoding for each line of a CSV file if there's an encoding error?
csv.reader and csv.DictReader take lists of strings (a list of lines) as input, not just file objects.
So, open the file in binary mode (mode="rb"), figure out the encoding of each line, decode the line using that encoding and append it to a list and then call csv.reader on that list.
One simple heuristic is to try to read each line as UTF-8 and if you get a UnicodeDecodeError, try decoding it as the other encoding. We can make this more general by using the chardet library (install it with pip install chardet) to guess the encoding of each line if you can't decode it as UTF-8, instead of hardcoding which encoding to fall back on:
import codec
my_csv = "some/path/to/your_file.csv"
lines = []
with open(my_csv, "rb") as f:
for line in f:
detected_encoding = chardet.detect(line)["encoding"]
try:
line = line.decode("utf-8")
except UnicodeDecodeError as e:
line = line.decode(detected_encoding)
lines.append(line)
reader = csv.DictReader(lines)
for row in reader:
do_stuff(row)
If you do want to just hardcode the fallback encoding and don't want to use chardet (there's a good reason not to, it's not always accurate), you can just replace the variable detected_encoding with "Windows-1251" or whatever encoding you want in the code above.
This is of course not perfect because just because a line successfully decodes using some encoding doesn't mean it's actually using that encoding. If you don't have to do this more than a few times, it's better to do something like print out each line and its detected encoding and try and figure out where one encoding starts and the other ends by hand. Ultimately the right strategy to pursue here might be to try and reverse the step that lead to the broken input (concatenating of the the files) and then re-do it correctly (by normalizing them to the same encoding before concatenating).
In my case, I counted how many lines were detected as which encoding
import chardet
from collections import Counter
my_csv_file = "some_file.csv"
with open(my_csv_file, "rb") as f:
encodings = Counter(chardet.detect(line)["encoding"] for line in f)
print(encodings)
and realized that my whole file was actually encoded in some other, third encoding. Running chardet on the whole file detected the wrong encoding, but running it on each line detected a bunch of encodings and the second most common one (after ascii) was the correct encoding I needed to use to read the whole file. So ultimately all I needed was
with open(my_csv, encoding="latin_1") as f:
reader = csv.DictReader(f)
for row in reader:
do_stuff(row)
You could try using the Compact Encoding Detection library instead of chardet. It's what Google Chrome uses so maybe it'll work better, but it's written in C++ instead of Python.

How not to decode escaped sequences when reading from file but keep the string representation

I am reading in a text file that contains lines with binaray data dumped in a encoded fashion, but still as a string (at least in emacs):
E.g.:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242
This is perfectly fine for me and when I read in that file I want to keep this string and not decode or change it in any way. However, when I am reading in the file python does the decoding. How can I prevent that?
with open("/path/to/file") as file:
for line in file:
print line
the output will look like:
'���k���G�r��#�\0320^��\021�C\035\000�\016ׁ��'
but should look like:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242
Edit: However, this encoded data is not the only data contained but part of a larger text dump.
You can read the file as binary with 'rb' option and it will retain the data as it is
EX:
with open(PathToFile, 'rb') as file:
raw_binary_data = file.read()
print(raw_binary_data)
If you really want the octal representation you can define a fuction that prints it back out.
import string
def octal_print(s):
print(''.join(map(lambda x: x if x in string.printable else '\\'+oct(ord(x))[2:], s)))
s = '\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207'
octal_print(s)
# prints:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\320^\242\367\21\227C\35\0\207
based on the answer of James I adapted the octal_print function to discriminate between actual octals and innocent characters.
def octal_print(s):
charlist = list()
for character in s:
try:
character.decode('ascii')
charlist.append(character)
except:
charlist.append('\\'+oct(ord(character))[1:])
return ''.join(charlist)

Python - Unicode file IO

I have a one line txt file with a bunch of unicode characters with no spaces
example
🅾🆖🆕Ⓜ🆙🆚🈁🈂
And I want to output a txt file with one character on each line
When I try to do this I think end up splitting the unicode charachters, how can I go about this?
There is no such thing as a text file with a bunch of unicode characters, it only makes sense to speak about a "unicode object" once the file has been read and decoded into Python objects. The data in the text file is encoded, one way or another.
So, the problem is about reading the file in the correct way in order to decode the characters to unicode objects correctly.
import io
enc_source = enc_target = 'utf-8'
with io.open('my_file.txt', encoding=enc_source) as f:
the_line = f.read().strip()
with io.open('output.txt', mode='w', encoding=enc_target) as f:
f.writelines([c + '\n' for c in the_line])
Above I am assuming the target and source file encodings are both utf-8. This is not necessarily the case, and you should know what the source file is encoded with. You get to choose enc_target, but somebody has to tell you enc_source (the file itself can't tell you).
This works in Python 3.5
line = "😀👍"
with open("file.txt", "w", encoding="utf8") as f:
f.write("\n".join(line))

Python 3.5.1 mixed line code file UTF-8 and UTF-16

I have successfully been parsing data files that I recieve with a simple python script I wrote. The files I get are like this:
file.txt, ~50 columns of data, x 1000s of rows
abcd1,1234a,efgh1,5678a,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
Unfortunatly, sometimes some of the lines contain UTF-16 symbols, and look like this
abcd1,12341,efgh1,UTF-16 symbols here,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
I have been able to implement the "latin-1" coding for commands in my script like:
open('file fixed.txt', 'w', encoding="latin-1").writelines([line for line in open('file.txt', 'r', encoding="latin-1"])
My problem lies in code such as:
for line in fileinput.Fileinput('file fixed.txt', inplace=1):
line = line.replace(":",",")
print (line, ",")
I am unable to get past the encoding errors for the last command. I have tried enforcing the coding of:
# -*- coding: latin-1 -*-
At the top of the document as well as before the last mentioned command (find and replace). How can I get mixed encoded files to process for the above command? I would like to preserve the UTF-16 (unicode) symbols as they appear in the new file. Thanks in advance.
EDIT: Thanks to Alexis I was able to determine that filinput would not work for setting another encoding method. I used the below to resolve my issue.
f = open(filein,'r', encoding="latin-1")
filedata = f.read()
f.close()
newdata = filedata.replace("old data","new data")
f = open(fileout,'w', encoding="latin-1")
f.write(newdata)
f.close()
You can tell fileinput how to open your files. As the documentation says:
You can control how files are opened by providing an opening hook via the openhook parameter to fileinput.input() or FileInput(). The hook must be a function that takes two arguments, filename and mode, and returns an accordingly opened file-like object. Two useful hooks are already provided by this module.
So you'd do it like this:
def open_utf16(name, m):
return open(name, m, encoding="utf-16")
for line in fileinput.FileInput("file fixed.txt", openhook=open_utf16):
...
I use "utf-16" as the encoding since this is your file's encoding, not "latin-1". 8-bit encodings don't have error checking so Latin1 will read the bytes without noticing there's anything wrong, but you're likely to have problems down the line. If this gives you errors, your file is not in utf-16.
If your file has mixed encoding, you need to read it as binary and then decode different parts as necessary, or just process the whole thing as binary instead. The latin-1 solution in the question works by accident really.
In your example that would be something like:
with open('the/path', 'rb') as fi:
data = fi.read().replace(b'old data', b'new data')
with open('other/path', 'wb') as fo:
fo.write(data)
This is the closest to what you ask for - as far as I understand you don't even care about that field with potentially different encoding - you just want to change some content and copy the rest of the file as is. Binary mode allows you to do that.

Converting from utf-16 to utf-8 in Python 3

I'm programming in Python 3 and I'm having a small problem which I can't find any reference to it on the net.
As far as I understand the default string in is utf-16, but I must work with utf-8, I can't find the command that will convert from the default one to utf-8.
I'd appreciate your help very much.
In Python 3 there are two different datatypes important when you are working with string manipulation. First there is the string class, an object that represents unicode code points. Important to get is that this string is not some bytes, but really a sequence of characters. Secondly, there is the bytes class, which is just a sequence of bytes, often representing an string stored in an encoding (like utf-8 or iso-8859-15).
What does this mean for you? As far as I understand you want to read and write utf-8 files. Let's make a program that replaces all 'ć' with 'ç' characters
def main():
# Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
with open('output_file', 'w', encoding='utf-8') as out_file:
# read every line. We give open() the encoding so it will return a Unicode string.
for line in open('input_file', encoding='utf-8'):
#Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
print(line.replace('ć', 'ç'), out_file)
So when should you use bytes? Not often. An example I could think of would be when you read something from a socket. If you have this in an bytes object, you could make it a unicode string by doing bytes.decode('encoding') and visa versa with str.encode('encoding'). But as said, probably you won't need it.
Still, because it is interesting, here the hard way, where you encode everything yourself:
def main():
# Open the file in binary mode. So we are going to write bytes to it instead of strings
with open('output_file', 'wb') as out_file:
# read every line. Again, we open it binary, so we get bytes
for line_bytes in open('input_file', 'rb'):
#Convert the bytes to a string
line_string = bytes.decode('utf-8')
#Replace the characters we want.
line_string = line_string.replace('ć', 'ç')
#Make a bytes to print
out_bytes = line_string.encode('utf-8')
#Print the bytes
print(out_bytes, out_file)
Good reading about this topic (string encodings) is http://www.joelonsoftware.com/articles/Unicode.html. Really recommended read!
Source: http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
(P.S. As you see, I didn't mention utf-16 in this post. I actually don't know whether python uses this as internal decoding or not, but it is totally irrelevant. At the moment you are working with a string, you work with characters (code points), not bytes.

Categories