Identifying and removing lines with non utf-8 characters in files

Identifying and removing lines with non utf-8 characters in files - python

I have a python program that parses text files line for line. Some, very few, of these lines are corrupt meaning that they have non utf-8 characters. Once a line has a corrupt character, then the whole content of the line is waste. So solutions that delete, replace etc single characters won't do - I want to delete any line with none utf-8 characters as priority number 1, but saving it to another file to inspect it further is of interest if possible. All previous solutions I find only delete/replace non utf-8 characters.
My main language is python, however I am working in Linux so bash etc is a viable solution.

My main language is python, however I am working in Linux so bash etc is a viable solution.
I don't know python well enough to use it for an answer, so here's a perl version. The logic should be pretty similar:
#!/usr/bin/env perl
use warnings;
use strict;
use Encode;
# One argument: filename to log corrupt lines to. Reads from standard
# input, prints valid lines on standard output; redirect to another
# file if desired.
# Treat input and outputs as binary streams, except STDOUT is marked
# as UTF8 encoded.
open my $errors, ">:raw", $ARGV[0] or die "Unable to open $ARGV[0]: $!\n";
binmode STDIN, ":raw";
binmode STDOUT, ":raw:utf8";
# For each line read from standard input, print it to standard
# output if valid UTF-8, otherwise log it.
while (my $line = <STDIN>) {
eval {
# Default decode behavior is to replace invalid sequences with U+FFFD.
# Raise an error instead.
print decode("UTF-8", $line, Encode::FB_CROAK);
} or print $errors $line;
}
close $errors;

Related

How am I suppposed to handle the BOM while text processing using sys.stdin in Python 3?

Note: The possible duplicate concerns an older version of Python and this question has already generated unique answers.
I have been working on a script to process Project Gutenberg Texts texts into an internal file format for an application I am developing. In the script I process chapter headings with the re module. This works very well except in one case: the first line. My regex will always fail on the first Chapter marker at the first line if it includes the ^ caret to require the regex match to be at the beginning of the line because the BOM is consumed as the first character. (Example regex: ^Chapter).
What I've discovered is that if I do not include the caret, it won't fail on the first line, and then <feff> is included in the heading after I've processed it. An example:
<h1><feff>Chapter I</h1>
The advice according to this SO question (from which I learned of the BOM) is to fix your script to not consume/corrupt the BOM. Other SO questions talk about decoding the file with a codec but discuss errors I never encounter and do not discuss the syntax for opening a file with the template decoder.
To be clear:
I generally use pipelines of the following format:
cat -s <filename> | <other scripts> | python <scriptname> [options] > <outfile>
And I am opening the file with the following syntax:
import sys
fin = sys.stdin
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt')
for line in fin:
...Processing here...
My question is what is the proper way to handle this? Do I remove the BOM before processing the text? If so, how? Or do I use a decoder on the file before processing it (I am reading from stdin, so how would I accomplish this?)
The files are stored in UTF-8 encoding with DOS endings (\r\n). I convert them in vim to UNIX file format before processing using set ff=unix (I have to do several manual pre-processing tasks before running the script).

As a complement to the existing answer, it is possible to filter the UTF8 BOM from stdin with the codecs module. Simply you must use sys.stdin.buffer to access the underlying byte stream and decode it with a StreamReader
import sys
import codecs
# trick to process sys.stdin with a custom encoding
fin = codecs.getreader('utf_8_sig')(sys.stdin.buffer, errors='replace')
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt',
encoding='utf_8_sig', errors='replace')
for line in fin:
...Processing here...

In Python 3, stdin should be auto-decoded properly, but if it's not working for you (and for Python 2) you need to specify PythonIOEncoding before invoking your script like
PYTHONIOENCODING="UTF-8-SIG" python <scriptname> [options] > <outfile>
Notice that this setting also makes stdout working with UTF-8-SIG, so your <outfile> will maintain the original encoding.
For your -i parameter, just do open(path, 'rt', encoding="UTF-8-SIG")

You really don't need to import codecs or anything to deal with this. As lenz suggested in comments just check for the BOM and throw it out.
for line in input:
if line[0] == "\ufeff":
line = line[1:] # trim the BOM away
# the rest of your code goes here as usual

In Python 3.9 default encoding for standard input seems to be utf-8, at least on Linux:
In [2]: import sys
In [3]: sys.stdin
Out[3]: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
sys.stdin has the method reconfigure():
sys.stdin.reconfigure("utf-8-sig")
which should be called before any attempt of reading the standard input. This will decode the BOM, which will no longer appear when reading sys.stdin.

UnicodeEncodeError for UTF-8 file names returned from a tk dialog [duplicate]

This question already has answers here:
Unicode filenames on Windows with Python & subprocess.Popen()
(5 answers)
Closed 7 years ago.
Hi I'm trying to extract audio from a video file using ffmpeg with the following function (in Python 2):
def extractAudio(path):
command = ''.join(('ffmpeg -i "',path,'" -ab 160k -ac 2 -ar 44100 -vn audio.wav'))
print(command)
subprocess.call(command,shell=True)
the above print statement successfully prints the following:
ffmpeg -i "C:/Users/pruthvi/Desktop/vidrec/temp\TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4" -ab 160k -ac 2 -ar 44100 -vn audio.wav
but in the next statement, it fails and throws the following error:
Traceback (most recent call last):
File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 53, in <module>
main()
File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 46, in main
extractAudio(os.path.join(di,each))
File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 28, in extractAudio
subprocess.call(command,shell=True)
File "C:\Python27\lib\subprocess.py", line 522, in call
return Popen(*popenargs, **kwargs).wait()
File "C:\Python27\lib\subprocess.py", line 710, in __init__
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 928, in _execute_child
args = '{} /c "{}"'.format (comspec, args)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128)
I've tried all the possible solutions from previous questions like encoding with correct type , setting up PYTHONIOENCODING etc., but none seems to work . If I convert it to ascii, it'll no longer function because it removes non-ascii character and ends up as file not found and the audio will not be extracted from the target file. Any help is appreciated, thanks :)
To experiment, you can use the following code:
# -*- coding: utf-8 -*-
import subprocess
def extractAudio():
path = u'C:/Users/pruthvi/Desktop/vidrec/temp\TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4'
command = ''.join(('ffmpeg -i "',path,'" -ab 160k -ac 2 -ar 44100 -vn audio.wav'))
print(command)
subprocess.call(command,shell=True)
extractAudio()

Because you are passing in a unicode string to the subprocess.call, Python tries to encode this to an encoding it thinks the filesystem/OS will understand. For some reason it's choose ASCII which is wrong.
You can try using the correct encoding via
subprocess.call(command.encode(sys.getfilesystemencoding()))

You have two problems:
Python 2
The subprocess module breaks when using Unicode in any of the arguments. This issue is fixed in Python 3, you can pass any Unicode file names and arguments to subprocess, and it will properly forward these to the child process.
ffmpeg
ffmpeg itself cannot open these files, something that you can easily verify by just trying to run it from the command line:
C:\temp>fancy αβγ.m4v
... lots of other output
fancy a�?.m4v: Invalid data found when processing input
(my code page is windows-1252, note how the Greek α got replaced with a Latin a)
You cannot fix this problem, but you can work around it, see bobince's answer.

Same as in your previous question: most cross-platform software on Windows can't handle non-ASCII characters in filenames.
Python's subprocess module uses interfaces based on byte strings. Under Windows, the command line is based on Unicode strings (technically UTF-16 code units), so the MS C runtime converts the byte strings into Unicode strings using an encoding (the ‘ANSI’ code page) that varies from machine to machine and which can never include all Unicode characters.
If your Windows installation were a Korean one, your ANSI code page would be 949 Korean and you would be able to write the command by saying one of:
subprocess.call(command.encode('cp949'))
subprocess.call(command.encode('mbcs'))
(where mbcs is short for ‘multi-byte character set’, which is a synonym for the ANSI code page on Windows.) If your installation isn't Korean you'll have a different ANSI code page and you will be unable to write that filename into a command as your command line encoding won't have any Hangul in it. The ANSI encoding is never anything sensible like UTF-8 so no-one can reliably use subprocess to execute commands with all Unicode characters in.
As discussed in the previous question, Python includes workarounds for Unicode filenames to use the native Win32 API instead of the C standard library. In Python 3 it also uses Win32 Unicode APIs for creating processes, but this is not the case back in Python 2. You could potentially hack something up yourself by calling the Win32 CreateProcessW command through ctypes, which gives you direct access to Windows APIs, but it's a bit of a pain.
...and it would be of no use anyway, because even if you did get non-ANSI characters into the command line, the ffmpeg command would itself fail. This is because ffmpeg is also a cross-platform application that uses the C standard libraries to read command lines and files. It would fail to read the Korean in the command line argument, and even if you got it through somehow it would fail to read the file of that name!
This is a source of ongoing frustration on the Windows platform: although it supports Unicode very well internally, most tools that run over the top of it can't. The answer should have been for Windows to support UTF-8 in all the byte-string interfaces it implements, instead of the sad old legacy ANSI code pages that no-one wants. Unfortunately Microsoft have repeatedly declined to take even the first steps towards making UTF-8 a first-class citizen on Windows (namely fixing some of the bugs that stop UTF-8 working in the console). Sorry.
Unrelated: this:
''.join(('ffmpeg -i "',path,'"...
is generally a bad idea. There are a number of special characters in filenames that would break that command line and possibly end up executing all kinds of other commands. If the input paths were controlled by someone untrusted that would be a severe security hole. In general when you put a command line together from variables you need to apply escaping to make the string safe for inclusion, and the escaping rules on Windows are complex and annoying.
You can avoid both the escaping problem and the Unicode problem by keeping everything inside Python. Instead of launching a command to invoke the ffmpeg code, you could use a module that brings the functionality of ffmpeg into Python, such as PyFFmpeg.
Or a cheap 'n' cheerful 'n' crappy workaround would be to copy/move the file to a known-safe name in Python, run the ffmpeg command using the static filename, and then rename/copy the file back...

How to process huge text files that contain EOF / Ctrl-Z characters using Python on Windows?

I have a number of large comma-delimited text files (the biggest is about 15GB) that I need to process using a Python script. The problem is that the files sporadically contain DOS EOF (Ctrl-Z) characters in the middle of them. (Don't ask me why, I didn't generate them.) The other problem is that the files are on a Windows machine.
On Windows, when my script encounters one of these characters, it assumes it is at the end of the file and stops processing. For various reasons, I am not allowed to copy the files to any other machine. But I still need to process them.
Here are my ideas so far:
Read the file in binary mode, throwing out bytes that equal chr(26). This would work, but it would take approximately forever.
Use something like sed to eliminate the EOF characters. Unfortunately, as far as I can tell, sed on Windows has the same problem and will quit when it sees the EOF.
Use some kind of Notepad program and do a find-and-replace. But it turns out that Notepad-type programs don't cope well with 15GB files.
My IDEAL solution would be some way to just read the file as text and simply ignore the Ctrl-Z characters. Is there a reasonable way to accomplish this?

It's easy to use Python to delete the DOS EOF chars; for example,
def delete_eof(fin, fout):
BUFSIZE = 2**15
EOFCHAR = chr(26)
data = fin.read(BUFSIZE)
while data:
fout.write(data.translate(None, EOFCHAR))
data = fin.read(BUFSIZE)
import sys
ipath = sys.argv[1]
opath = ipath + ".new"
with open(ipath, "rb") as fin, open(opath, "wb") as fout:
delete_eof(fin, fout)
That takes a file path as its first argument, and copies the file but without chr(26) bytes to the same file path with .new appended. Fiddle to taste.
By the way, are you sure that DOS EOF characters are your only problem? It's hard to conceive of a sane way in which they could end up in files intended to be treated as text files.

Difference between binary and text I/O in python on Windows

I know that I should open a binary file using "rb" instead of "r" because Windows behaves differently for binary and non-binary files.
But I don't understand what exactly happens if I open a file the wrong way and why this distinction is even necessary. Other operating systems seem to do fine by treating both kinds of files the same.

Well this is for historical (or as i like to say it, hysterical) reasons. The file open modes are inherited from C stdio library and hence we follow it.
For Windows, there is no difference between text and binary files, just like in any of the Unix clones. No, i mean it! - there are (were) file systems/OSes in which text file is completely different beast from object file and so on. In some you had to specify the maximum length of lines in advance and fixed size records were used... fossils from the times of 80-column paper punch-cards and such. Luckily, not so in Unices, Windows and Mac.
However - all other things equal - Unix, Windows and Mac hystorically differ in what characters they use in output stream to mark end of one line (or, same thing, as separator between lines). In Unix, \x0A (\n) is used. In Windows, sequence of two characters \x0D\x0A (\r\n) is used; on Mac - just \xOD (\r). Here are some clues on the origin of use of those two symbols - ASCII code 10 is called Line Feed (LF) and when sent to teletype, would cause it to move down one line (Y++), without changing its horizontal (X) position. Carriage Return (CR) - ASCII 13 - on the other hand, would cause the printing carriage to return to the beginning of the line (X=0) without scrolling one line down. So when sending output to the printer, both \r and \n had to be send, so that the carriage will move to the beginning of a new line. Now when typing on terminal keyboard, operators naturally are expected to press one key and not two for end of line. That on Apple][ was the key 'Return' (\r).
At any rate, this is how things settled. C's creators were concerned about portability - much of Unix was written in C, unlike before, when OSes were written in assembler. So they did not want to deal with each platform quirks about text representation, so they added this evil hack to their I/O library depending on the platform, the input and output to that file will be "patched" on the fly so that the program will see the new lines the righteous, Unix-way - as '\n' - no matter if it was '\r\n' from Windows or '\r' from Mac. So the developer need not worry on what OS the program ran, it could still read and write text files in native format.
There was a problem, however - not all files are text, there are other formats and in they are very sensitive to replacing one character with another. So they though, we will call those "binary files" and indicate that to fopen() by including 'b' in the mode - and this will flag the library not to do any behind-the-scenes conversion. And that's how it came to be the way it is :)
So to recap, if file is open with 'b' in binary mode, no conversions will take place. If it was open in text mode, depending on the platform, some conversions of the new line character(s) may occur - towards Unix point of view. Naturally, on Unix platform there is no difference between reading/writing to "text" or "binary" file.

This mode is about conversion of line endings.
When reading in text mode, the platform's native line endings (\r\n on Windows) are converted to Python's Unix-style \n line endings. When writing in text mode, the reverse happens.
In binary mode, no such conversion is done.
Other platforms usually do fine without the conversion, because they store line endings natively as \n. (An exception is Mac OS, which used to use \r in the old days.) Code relying on this, however, is not portable.

In Windows, text mode will convert the newline \n to a carriage return followed by a newline \r\n.
If you read text in binary mode, there are no problems. If you read binary data in text mode, it will likely be corrupted.

For reading files there should be no difference. When writing to text-files Windows will automatically mess up your line-breaks (it will add \r's before the \n's). That's why you should use "wb".

Line reading chokes on 0x1A

I have the following file:
abcde
kwakwa
<0x1A>
line3
linllll
Where <0x1A> represents a byte with the hex value of 0x1A. When attempting to read this file in Python as:
for line in open('t.txt'):
print line,
It only reads the first two lines, and exits the loop.
The solution seems to be to open the file in binary (or universal newline mode) - 'rb' or 'rU'. Can you explain this behavior ?

0x1A is Ctrl-Z, and DOS historically used that as an end-of-file marker. For example, try using a command prompt, and "type"ing your file. It will only display the content up the Ctrl-Z.
Python uses the Windows CRT function _wfopen, which implements the "Ctrl-Z is EOF" semantics.

Ned is of course correct.
If your curiosity runs a little deeper, the root cause is backwards compatibility taken to an extreme. Windows is compatible with DOS, which used Ctrl-Z as an optional end of file marker for text files. What you might not know is that DOS was compatible with CP/M, which was popular on small computers before the PC. CP/M's file system didn't keep track of file sizes down to the byte level, it only kept track by the number of floppy disk sectors. If your file wasn't an exact multiple of 128 bytes, you needed a way to mark the end of the text. This Wikipedia article implies that the selection of Ctrl-Z was based on an even older convention used by DEC.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.