Converting a Massive Unicode Text File to ASCII - python

I've received several text files, where each file contains thousands of lines of text. Because the files use Unicode encoding, each file ends up being around 1GB. I know this might sound borderline ridiculous, but it unfortunately is the reality:
I'm using Python 2.7 on a Windows 7 machine. I've only started using Python but figured this would be a good chance to really start using the language. You've gotta use it to learn it, right?
What I'm hoping to do is to be able to make a copy of all of these massive files. The new copies would be using ASCII character encoding and would ideally be significantly smaller in size. I know that changing the character encoding is a solution because I've had success by opening a file in MS WordPad and saving it to a regular text file:
Using WordPad is a manual and slow process: I need to open the file, which takes forever because it's so big, and then save it as a new file, which also takes forever since it's so big. I'd really like to automate this by having a script run in the background while I work on other things. I've written a bit of Python to do this, but it's not working correctly. What I've done so far is the following:
def convertToAscii():
# Getting a list of the current files in the directory
cwd = os.getcwd()
current_files = os.listdir(cwd)
# I don't want to mess with all of the files, so I'll just pick the second one since the first file is the script itself
test_file = current_files[1]
# Determining a new name for the ASCII-encoded file
file_name_length = len(test_file)
ascii_file_name = test_file[:file_name_length - 3 - 1] + "_ASCII" + test_file[file_name_length - 3 - 1:]
# Then we open the new blank file
the_file = open(ascii_file_name, 'w')
# Finally, we open our original file for testing...
with io.open(test_file, encoding='utf8') as f:
# ...read it line by line
for line in f:
# ...encode each line into ASCII
line.encode("ascii")
# ...and then write the ASCII line to the new file
the_file.write(line)
# Finally, we close the new file
the_file.close()
convertToAscii()
And I end up with the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
But that doesn't make any sense.... The first line within all of the text files is either a blank line or a series of equal signs, such as ===========.
I was wondering if someone would be able to put me onto the right path for this. I understand that doing this operation can take a very long time since I'm essentially reading each file line by line and then encoding the string into ASCII. What must I do in order to get around my current issue? And is there a more efficient way to do this?

For characters that exist in ASCII, UTF-8 already encodes using single bytes. Opening a UTF8 file with only single byte characters then saving an ASCII file should be a non-operation.
For any size difference, your files would have to be some wider encoding of Unicode, like UTF-16 / UCS-2. That would also explain the utf8 codec complaining about unexpected bytes in the source file.
Find out what encoding your files actually are, then save using utf8 codec. That way your files will be just as small (equivalent to ASCII) for single byte characters, but if your source files happen to have any multibyte characters, the result file will still be able to encode them and you won't be doing a lossy conversion.

There's a potential speedup if you avoid splitting the file into lines, since the only thing that you're doing is joining the lines back together. This allows you to process the input in larger blocks.
Using the shutil.copyfileobj function (which is just read and write in a loop):
import shutil
with open('input.txt', encoding='u16') as infile, \
open('output.txt', 'w', encoding='u8') as outfile:
shutil.copyfileobj(infile, outfile)
(Using Python 3 here, by passing the encoding argument directly to open, but it should be the same as the library function io.open.)

Related

Ignore UTF-8 decoding errors in CSV

I have a CSV file (which I have no control over). It's the result of concatenating multiple CSV files. Most of the file is UTF-8 but one of the files that went into it had fields that are encoded in what looks like Windows-1251.
I actually only care about one of the fields which contains a URL (so it's valid ASCII/UTF-8).
How do I ignore decoding errors in the other CSV fields if I only care about one field which I know is ASCII? Alternatively, for a more useful solution how do I change the encoding for each line of a CSV file if there's an encoding error?
csv.reader and csv.DictReader take lists of strings (a list of lines) as input, not just file objects.
So, open the file in binary mode (mode="rb"), figure out the encoding of each line, decode the line using that encoding and append it to a list and then call csv.reader on that list.
One simple heuristic is to try to read each line as UTF-8 and if you get a UnicodeDecodeError, try decoding it as the other encoding. We can make this more general by using the chardet library (install it with pip install chardet) to guess the encoding of each line if you can't decode it as UTF-8, instead of hardcoding which encoding to fall back on:
import codec
my_csv = "some/path/to/your_file.csv"
lines = []
with open(my_csv, "rb") as f:
for line in f:
detected_encoding = chardet.detect(line)["encoding"]
try:
line = line.decode("utf-8")
except UnicodeDecodeError as e:
line = line.decode(detected_encoding)
lines.append(line)
reader = csv.DictReader(lines)
for row in reader:
do_stuff(row)
If you do want to just hardcode the fallback encoding and don't want to use chardet (there's a good reason not to, it's not always accurate), you can just replace the variable detected_encoding with "Windows-1251" or whatever encoding you want in the code above.
This is of course not perfect because just because a line successfully decodes using some encoding doesn't mean it's actually using that encoding. If you don't have to do this more than a few times, it's better to do something like print out each line and its detected encoding and try and figure out where one encoding starts and the other ends by hand. Ultimately the right strategy to pursue here might be to try and reverse the step that lead to the broken input (concatenating of the the files) and then re-do it correctly (by normalizing them to the same encoding before concatenating).
In my case, I counted how many lines were detected as which encoding
import chardet
from collections import Counter
my_csv_file = "some_file.csv"
with open(my_csv_file, "rb") as f:
encodings = Counter(chardet.detect(line)["encoding"] for line in f)
print(encodings)
and realized that my whole file was actually encoded in some other, third encoding. Running chardet on the whole file detected the wrong encoding, but running it on each line detected a bunch of encodings and the second most common one (after ascii) was the correct encoding I needed to use to read the whole file. So ultimately all I needed was
with open(my_csv, encoding="latin_1") as f:
reader = csv.DictReader(f)
for row in reader:
do_stuff(row)
You could try using the Compact Encoding Detection library instead of chardet. It's what Google Chrome uses so maybe it'll work better, but it's written in C++ instead of Python.

Python keeps showing cp1250 character encoding in files

I have excersise to make script which convert UTF-16 files to UTF-8, so I wanted to have one example file with UTF-16 coding. The problem is that all files encoding which Python shows me is 'cp1250'(no matter which format .csv or .txt). What am I missing here? I have also example files from the Internet, but Python recognize them as cp-1250. Even when I save file with UTF-8, Python shows cp-1250 coding.
This is the code I use:
with open('FILE') as f:
print(f.encoding)
The result from open simply is a file in your system's default encoding. To open it in something else, you have to specifically say so.
To actually convert a file, try something like
with open('input', encoding='cp1252') as input, open('output', 'w', encoding='utf-16le') as output:
for line in input:
output.write(line)
Converting a legacy 8-bit file to Unicode isn't really useful because it only exercises a small subset of the character set. See if you can find a good "hello world" sample file. https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html is one for UTF-8.

In Python how do I tell if file has a utf-8 signature before I start reading

I am writing a filter that reads a file (possibly stdin) and possibly writes to stdout. If the input file starts with a byte order mark I want the output file to have one, otherwise I don't. If I open a file with a BOM as utf-8 then the first character my program reads is the BOM (\uFEFF), which I don't want. If I open a file without a BOM as utf-8-sig it reads it properly, but I can't tell whether to open the output as utf-8 (which will have no signature) or utf-8-sig, which will start the file with a signature mark. So what I want to do is to look at the first character of the file and based on its value decide which encoding to use to open the file. If it were a disk file I could simply close it and reopen it, but because it could be stdin I can't do that. I can work around it by having my program check the first character it reads and output it if it is a signature mark, but I'm wondering if there's a better way. I observe that if I open the file as
f = io.open(inFile, encoding="utf-8", buffering=1)
and then do f.buffer.peek(1) I get back a Unicode object with the first characters of the file. That seems odd because what I would expect from peek() of the bufferd iobase class that has peek() would be a Unicode object of length 1. Now I could look at the first character of what peek returns, and if it is a signature mark read one character and write it to the output, or discard it and then open the output file with utf-8-sig, but I don't really like basing a solution on so much observed, rather than documented, behavior.
Any ideas how to do this right?
[I solved this in my case by having the parsing logic ignore the BOM.]
I also find that when I open the output as
fOut = io.open(sys.stdout.fileno(), mode="at", encoding="utf-8", closefd=False)
I get an extra <CR> at the end of each line (<CR><CR><LF>) which I don't get if I open an actual file.
I tried opening with mode="ab" which didn't work. What does seem to work is to add newline="\n":
fOut = io.open(sys.stdout.fileno(), mode="at", encoding="utf-8", newline="\n", closefd=False)
I think the reason this "works" is because of the same anomaly. As coded it should be using LF as the terminator. In fact, if you feed it input with UNIX line terminators it still outputs DOS line terminators.
I am running Python 2.7.13 on Windows 10.
Thanks.
Gary

Unpicking data pickled in Python 2.5, in Python 3.1 then uncompressing with zlib

In Python 2.5 I stored data using this code:
def GLWriter(file_name, string):
import cPickle
import zlib
data = zlib.compress(str(string))
file = open(file_name, 'w')
cPickle.dump(data, file)
It worked fine, I was able to read that data by doing that process in reverse. It didn't need to be secure, just something that wasn't readable to the human eye. If I put "test" into it and then opened the file it created, it looked like this:
S'x\x9c+I-.\x01\x00\x04]\x01\xc1'
p1
.
For various reasons we're forced to use Python 3.1 now and we need to code something that can read these data files.
Pickle no longer accepts a string input so I've had to open the file with "rb". When I do that and try opening it with pickle.load(file), I get this error:
File "<stdin>", line 1, in <module>
File "C:\Python31\lib\pickle.py", line 1365, in load
encoding=encoding, errors=errors).load()
UnicodeDecodingError: 'ascii' codec can't decode byte 0x9c in position 1: ordinal not in range(128)
Figuring that I might not be able to open the file in pickle, I started doing some research and found that pickle is just wrapping a few characters each side of the main block of data that zlib is producing. I then tried to trim it down to zlibs output and put that through zlib.decompress. My issue there is that it reads the file and interprets the likes of "\x04" as four characters rather than one. A lot of testing and searching later and I can't find a way to make pickle load the file, or make python recognise these codes so I can put it through zlib.
So my question is this:
How can I recover the original data using Python3.1?
I would love to ask my clients to install Python2.5 and do it manually but that's not possible.
Many thanks for your assistance!
The problem is that Python 3 is attempting to convert the pickled Python 2 string into a str object, when you really need it to be bytes. It does this using the ascii codec, which doesn't support all 256 8-bit characters, so you are getting an exception.
You can work around this by using the latin-1 encoding (which supports all 256 characters), and then encoding the string back into bytes:
s = pickle.load(f, encoding='latin1')
b = s.encode('latin1')
print(zlib.decompress(b))
Python 3 makes a distinction between binary data and strings. Pickle needs binary data, but you are opening the file as text. The solution is to use:
open(file_name, 'wb')

Can seek and tell work with UTF-8 encoded documents in Python?

I have an application that generates some large log files > 500MB.
I have written some utilities in Python that allows me to quickly browse the log file and find data of interest. But I now get some datasets where the file is too big to load it all into memory.
I thus want to scan the document once, build an index and then only load the section of the document into memory that I want to look at at a time.
This works for me when I open a 'file' read it one line at a time and store the offset with from file.tell().
I can then come back to that section of the file later with file.seek( offset, 0 ).
My problem is however that I may have UTF-8 in the log files so I need to open them with the codecs module (codecs.open(<filename>, 'r', 'utf-8')). With the resulting object I can call seek and tell but they do not match up.
I assume that codecs needs to do some buffering or maybe it returns character counts instead of bytes from tell?
Is there a way around this?
If true, this sounds like a bug or limitation of the codecs module, as it's probably confusing byte and character offsets.
I would use the regular open() function for opening the file, then seek()/tell() will give you byte offsets that are always consistent. Whenever you want to read, use f.readline().decode('utf-8').
Beware though, that using the f.read() function can land you in the middle of a multi-byte character, thus producing an UTF-8 decode error. readline() will always work.
This doesn't transparently handle the byte-order mark for you, but chances are your log files do not have BOMs anyway.
For UTF-8, you don't actually need to open the file with codecs.open. Instead, it is reliable to read the file as a byte string first, and only then decode an individual section (invoking the .decode method on the string). Breaking the file at line boundaries is safe; the only unsafe way to split it would be in the middle of a multi-byte character (which you can recognize from its byte value > 128).
Much of what goes on with UTF8 in python makes sense if you look at how it was done in Python 3. In your case, it'll make quite a bit more sense if you read the Files chapter in Dive into Python 3: http://diveintopython3.org/files.html
The short of it, though, is that file.seek and file.tell work with byte positions, whereas unicode characters can take up multiple bytes. Thus, if you do:
f.seek(10)
f.read(1)
f.tell()
You can easily get something other than 17, depending on what length the one character you read was.
Update: You can't do seek/tell on the object returned by codec.open(). You need to use a normal file, and decode the strings to unicode after reading.
I do not know why it doesn't work but I can't make it work. The seek seems to only work once, for example. Then you need to close and reopen the file, which is of course not useful.
The tell does not use character positions, but doesn't show you where your position in the stream is (but probably where the underlying file object is in reading from disk).
So probably because of some sort of underlying buffering, you can't do it. But deocding after reading works just fine, so go for that.

Categories