Python - Unicode file IO - python

I have a one line txt file with a bunch of unicode characters with no spaces
example
πŸ…ΎπŸ†–πŸ†•β“‚πŸ†™πŸ†šπŸˆπŸˆ‚
And I want to output a txt file with one character on each line
When I try to do this I think end up splitting the unicode charachters, how can I go about this?

There is no such thing as a text file with a bunch of unicode characters, it only makes sense to speak about a "unicode object" once the file has been read and decoded into Python objects. The data in the text file is encoded, one way or another.
So, the problem is about reading the file in the correct way in order to decode the characters to unicode objects correctly.
import io
enc_source = enc_target = 'utf-8'
with io.open('my_file.txt', encoding=enc_source) as f:
the_line = f.read().strip()
with io.open('output.txt', mode='w', encoding=enc_target) as f:
f.writelines([c + '\n' for c in the_line])
Above I am assuming the target and source file encodings are both utf-8. This is not necessarily the case, and you should know what the source file is encoded with. You get to choose enc_target, but somebody has to tell you enc_source (the file itself can't tell you).

This works in Python 3.5
line = "πŸ˜€πŸ‘"
with open("file.txt", "w", encoding="utf8") as f:
f.write("\n".join(line))

Related

Ignore UTF-8 decoding errors in CSV

I have a CSV file (which I have no control over). It's the result of concatenating multiple CSV files. Most of the file is UTF-8 but one of the files that went into it had fields that are encoded in what looks like Windows-1251.
I actually only care about one of the fields which contains a URL (so it's valid ASCII/UTF-8).
How do I ignore decoding errors in the other CSV fields if I only care about one field which I know is ASCII? Alternatively, for a more useful solution how do I change the encoding for each line of a CSV file if there's an encoding error?
csv.reader and csv.DictReader take lists of strings (a list of lines) as input, not just file objects.
So, open the file in binary mode (mode="rb"), figure out the encoding of each line, decode the line using that encoding and append it to a list and then call csv.reader on that list.
One simple heuristic is to try to read each line as UTF-8 and if you get a UnicodeDecodeError, try decoding it as the other encoding. We can make this more general by using the chardet library (install it with pip install chardet) to guess the encoding of each line if you can't decode it as UTF-8, instead of hardcoding which encoding to fall back on:
import codec
my_csv = "some/path/to/your_file.csv"
lines = []
with open(my_csv, "rb") as f:
for line in f:
detected_encoding = chardet.detect(line)["encoding"]
try:
line = line.decode("utf-8")
except UnicodeDecodeError as e:
line = line.decode(detected_encoding)
lines.append(line)
reader = csv.DictReader(lines)
for row in reader:
do_stuff(row)
If you do want to just hardcode the fallback encoding and don't want to use chardet (there's a good reason not to, it's not always accurate), you can just replace the variable detected_encoding with "Windows-1251" or whatever encoding you want in the code above.
This is of course not perfect because just because a line successfully decodes using some encoding doesn't mean it's actually using that encoding. If you don't have to do this more than a few times, it's better to do something like print out each line and its detected encoding and try and figure out where one encoding starts and the other ends by hand. Ultimately the right strategy to pursue here might be to try and reverse the step that lead to the broken input (concatenating of the the files) and then re-do it correctly (by normalizing them to the same encoding before concatenating).
In my case, I counted how many lines were detected as which encoding
import chardet
from collections import Counter
my_csv_file = "some_file.csv"
with open(my_csv_file, "rb") as f:
encodings = Counter(chardet.detect(line)["encoding"] for line in f)
print(encodings)
and realized that my whole file was actually encoded in some other, third encoding. Running chardet on the whole file detected the wrong encoding, but running it on each line detected a bunch of encodings and the second most common one (after ascii) was the correct encoding I needed to use to read the whole file. So ultimately all I needed was
with open(my_csv, encoding="latin_1") as f:
reader = csv.DictReader(f)
for row in reader:
do_stuff(row)
You could try using the Compact Encoding Detection library instead of chardet. It's what Google Chrome uses so maybe it'll work better, but it's written in C++ instead of Python.

Fixing corrupt encoding (with Python)

I have bunch of text files contains Korean characters with wrong encodings. Specifically, it seems the characters are encoded with EUC-KR, but the files themselves were saved with UTF8+BOM.
So far I managed to fix a file with the following:
Open a file with EditPlus (it shows the file's encoding is UTF8+BOM)
In EditPlus, save the file as ANSI
Lastly, in Python:
with codecs.open(html, 'rb', encoding='euc-kr') as source_file:
contents = source_file.read()
with open(html, 'w+b') as dest_file:
dest_file.write(contents.encode('utf-8'))
I want to automate this, but I have not been able to do so. I can open the original file in Python:
codecs.open(html, 'rb', encoding='utf-8-sig')
However, I haven't been able to figure out how to do the 2. part.
I am presuming here that you have text already encoded to EUC-KR, then encoded again to UTF-8. If so, encoding to Latin 1 (what Windows calls ANSI) is indeed the best way to get back to the original EUC-KR bytestring.
Open the file as UTF8 with BOM, encode to Latin1, decode as EUC-KR:
import io
with io.open(html, encoding='utf-8-sig') as infh:
data = infh.read().encode('latin1').decode('euc-kr')
with io.open(html, 'w', encoding='utf8') as outfh:
outfh.write(data)
I'm using the io.open() function here instead of codecs as the more robust method; io is the new Python 3 library also backported to Python 2.
Demo:
>>> broken = '\xef\xbb\xbf\xc2\xb9\xc3\x8c\xc2\xbc\xc3\xba'
>>> print broken.decode('utf-8-sig').encode('latin1').decode('euc-kr')
미술

Encoding issue when writing to text file, with Python

I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.
While there are a number of questions out there on this topic, I didn't find a direct answer to my problem.
Detecting the system defaults won't help me in this case, because I need the program to be portable.
Here's the code:
def txt_to_JSON(csv_list):
...some manipulation of the list...
return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list
for i in range(0,len(lines)):
lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()
All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet
I highly recommend Ned Batchelder's presentation
http://nedbatchelder.com/text/unipain.html
for details.
There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?
TLDR:
Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.
Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.
In your case
filename = 'where your data lives'
with open(filename, 'rb') as f:
encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")
# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)
encoded_result = result.encode("UTF-16") #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
f.write(encoded_result)
You need to tell Python to use the Unicode character encoding to decode the Hebrew characters.
Here's a link to how you can read Unicode characters in Python: Character reading from file in Python

How do I convert LF to CRLF?

I found a list of the majority of English words online, but the line breaks are of unix-style (encoded in Unicode: UTF-8). I found it on this website: http://dreamsteep.com/projects/the-english-open-word-list.html
How do I convert the line breaks to CRLF so I can iterate over them? The program I will be using them in goes through each line in the file, so the words have to be one per line.
This is a portion of the file: bitbackbitebackbiterbackbitersbackbitesbackbitingbackbittenbackboard
It should be:
bit
backbite
backbiter
backbiters
backbites
backbiting
backbitten
backboard
How can I convert my files to this type? Note: it's 26 files (one per letter) with 80,000 words or so in total (so the program should be very fast).
I don't know where to start because I've never worked with unicode. Thanks in advance!
Using rU as the parameter (as suggested), with this in my code:
with open(my_file_name, 'rU') as my_file:
for line in my_file:
new_words.append(str(line))
my_file.close()
I get this error:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
addWords('B Words')
File "D:\my_stuff\Google Drive\documents\SCHOOL\Programming\Python\Programming Class\hangman.py", line 138, in addWords
for line in my_file:
File "C:\Python3.3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7488: character maps to <undefined>
Can anyone help me with this?
Instead of converting, you should be able to just open the file using Python's universal newline support:
f = open('words.txt', 'rU')
(Note the U.)
You can use the replace method of strings. Like
txt.replace('\n', '\r\n')
EDIT :
in your case :
with open('input.txt') as inp, open('output.txt', 'w') as out:
txt = inp.read()
txt = txt.replace('\n', '\r\n')
out.write(txt)
You don't need to convert the line endings in the files in order to be able to iterate over them. As suggested by NPE, simply use python's universal newlines mode.
The UnicodeDecodeError happens because the files you are processing are encoded as UTF-8 and when you attempt to decode the contents from bytes to a string, via str(line), Python is using the cp1252 encoding to convert the bytes read from the file into a Python 3 string (i.e. a sequence of unicode code points). However, there are bytes in those files that cannot be decoded with the cp1252 encoding and that causes a UnicodeDecodeError.
If you change str(line) to line.decode('utf-8') you should no longer get the UnicodeDecodeError. Check out the Text Vs. Data Instead of Unicode Vs. 8-bit writeup for some more details.
Finally, you might also find The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky useful.
You can use cereja package
pip install cereja==1.2.0
import cereja cereja.lf_to_crlf(dir_or_file_path)
or
cereja.lf_to_crlf(dir_or_file_path, ext_in=[β€œ.py”,”.csv”])
You can substitute for any standard. See the filetools module

Write unicode content and unicode file name in Windows

#source file is encoded in utf8
import urllib2
import re
req = urllib2.urlopen('http://people.w3.org/rishida/scripts/samples/hungarian.html')
c = req.read()#.decode('utf-8')
p = r'title="This is Latin script \(Hungarian language\)">(.+)'
text = re.search(p, c).group(1)
name = text[:10]+'.txt' #file name will have special chars in it
f = open(name, 'wb')
f.write(text) #content of file will have special chars in it
f.close()
x = raw_input('done')
As you can see the script does a couple things:
- Reads content that is known to have unicode characters from a webpage into a variable
(The source file is saved in utf-8 but this should not make a difference unless unicode strings are actually being defined in the source code... As you can see the unicode string is being defined dynamially into a variable.. what encoding the source is shouldn't matter in this scenario)
Writes a file with a name containing unicode characters
Write unicode content into this file as well
Here's the weird behavior I get (Windows 7, Python 2.7) :
When I don't use the decode function:
c = req.read()
The NAME of the file will come out gibberish, but the CONTENT of the file will come out readable (that is you can see the correct unicode hungarian characters)
Yet, when I USE the decode function:
c = req.read().decode('utf-8')
It will NOT ERROR on opening the file (really creating it with 'w' mode)
and the resulting file's NAME will be readable, yep now it shows the correct unicode characters.
So far so good right?
Well, then it WILL ERROR on trying to write the unicode content to the file:
f.write(text) #content of file will have special chars in it
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 8: ordinal not in range(128)
You see, I can't seem to have the cake and eat it too...
Either I can correctly write the NAME of the file or I can correctly write the CONTENT of the file..
How can I do both?
I've also tried writing the file with
f = codecs.open(name, encoding='utf-8', mode='wb')
But it also errors..
The only problem for you seems to be just "unreadable" file name from your original source file. This can solve your problem:
f = open(name.decode('utf-8').encode( sys.getfilesystemencoding() ) , 'wb')
While winterTTR's answer does work.. I've realized that this approach is convoluted.
Rather, all you really need to do is encode the data you write to the file. The name you don't need to encode and both the name and the content will come out "readable".
content = '\xunicode chars'.decode('utf-8')
f = open(content[:5]+'.txt', 'wb')
f.write(content.encode('utf-8'))
f.close()

Categories