Remove non-ASCII characters interpreted as EOF from text file - python

Following up on my earlier question here: Row limit in read.table.ffdf?
I have a text file with >285 million records, but about two-thirds of the way through there are several non-ASCII characters that are being interpreted by AWK as well as several R packages (ff, data.table) as EOF bytes. It appears that the characters were originally entered as degree signs, but appear in text editors as boxes (see example here). When I try to read in the text file using these methods it just stops when it encounters the first character, with no error messages as if it's complete.
For now I was able to open the file in a text editor to remove these characters. But this is not a long-term solution for this dataset given its size; I need to be able to remove or bypass them without having to open the whole file. I've tried using the quote option in R, and tried replacing all non-ASCII and 'CTRL-M' characters specifically during an awk import, but the read process always stops at the first character. Any solutions? I'm using R and awk now, but am open to other options (python?). Thanks!

gawk -v BINMODE=3 '{gsub(/[[:cntrl:]]/,"")}1
will remove them.

Related

Python file.write inserts unintended special characters

The codes in the screenshot below does a simple regex based file string replacement, replacing two tab characters in the source text file with "TT" into the target file. The replacement works fine, but for some reason, this operation adds the weird special characters (question mark in a diamond) to the file.
How can I avoid this?
From the documentation of file.truncate:
The current file position is not changed.
That means the system probably replaced the old contents of the file with null bytes. You have to f.seek(0) after truncating.

Python: how to get rid of non-ascii characters being read from a file

I am processing, with python, a long list of data that looks like this
The digraphs are probably due to encoding problems. (I am not sure whether these characters will be preserved in this site)
29/07/2016 04:00:12 0.125143
Now, when I read such file into a script using something like open and readlines, there is an error, reading
SyntaxError: EOL while scanning string literal
I know (or may look up usage of) replace and regex functions, but I cannot do them in my script. The biggest problem is that anywhere I include or read such strange character, error occurs, pointing on the very line it is read. So I cannot do anything to them.
Are you reading a file? If so, try to extract values using regexps, not to remove extra characters:
re.search(r'^([\d/: ]{19})', line).group(1)
re.search(r'([\d.]{7})', line).group(1)
I find that the re.findall works. (I am sorry I do not have time to test all other methods, since the significance of this job has vanished, and I even forget this question itself.)
def extract_numbers(str_i):
pat="(\d+)/(\d+)/(\d+)\D*(\d+):(\d+):(\d+)\D*(\d+)\.(\d+)"
match_h = re.findall(pat, str_i)
return match_h[0]
# ....
# `f` is the handle of the file in question
lines =f.readlines()
for l in lines:
ls_f =extract_numbers(l)
# process them....

Saving file with apostrophe in the name (Python 3.4)

Trying to save image files in batches. Works nicely, but the list of names for each file sometimes includes apostrophes, and everything stops.
The offending script is:
pic.save(r"C:\Python34\Scripts\{!s}.jpg".format(name))
The apostrophes in the names aren't a problem when I embed them in a url with selenium
browser.get("https://website.com/{!s}".format(name))
or when I print the destination file name, e.g.
print(r"C:\Python34\Scripts\{!s}.jpg".format(name))
Which is fine to turn out like
C:\Python34\Scripts['It's fine'].jpg
so I assume this kind of problem has something to do with the save function.
The trace back calls the pic.save line of code in PIL\Image.py and says the OSError: [Errno 22] is an Invalid argument in the save destination.
Using Windows 7 if that matters.
Probably super-novice error, but I've been reading threads and can't figure this out--workaround would be cleaning the list of apostrophes before using it, which would be annoying but acceptable.
Any help appreciated.
---edited to fix double quotes as single, just mistyped when writing this post...doh.
It's not a Python problem, but Windows, or rather the file system, file naming rules. From the MSDN:
Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
The following reserved characters
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
On UNIX type systems, all except the / would be valid (although most would be a bad idea). A further "character", binary zero 0x00, is invalid on most file systems.
Rules for URLs are different again.
So you are going to have to write a sanitiser for filenames avoiding these characters. A regular expression would probably be the easiest, but you will have to choose replacement characters that don't occur naturally.
Edit: I was assuming that Error 22 was reporting an invalid filename, but I was wrong, it actually means "The device does not recognise the command".
See https://stackoverflow.com/questions/19870570/pil-giving-oserror-errno-22-when-opening-gif. The accepted reply is rather weird though.
I Google'd "python PIL OSError Errno 22", you might like to try the same and see if any of the conditions apply to you, but clearly you are not alone, if that's any consolation.
Sorry I can't do more.

How to exclude \n and \r from tell() count in Python 2.7

I want to keep track of the file pointer on a simple text file (just a few lines), after having used readline() on it. I observed that the tell() function also counts the line endings.
My questions:
How to instruct the code to skip counting the line endings ?
How to do the first question regardless the line ending type (to work the same in case the text file uses just \n, or just \r, or both) ?
You are navigating into trouble.
DOn't do that: either use the number "tell" tells you about, or count what you have in memory, regardless of the file contents.
You won't be able to correlate a position in text, read in memory, to a physicall place in a text file: text files are not meant for that. They are meant to be read one line at a time, or in whole: your pogram consumes the text, and let the OS worry about the file position.
You can open your file in binary mode, read its contents as they are into memory, and have some method of retrieving readable text from those contents as needed - doing this with a proper class can make it not that messy.
Consider the problem you already have with the line-endings which could be either "\n" or "\r\n" and still count as a single character, and now, imagine that situation one hundred fold more complex if the file has a single utf-8 encoded character that takes more than one byte to encode.
And even in binary files, knowing the absolute file pointer position can only be useful in a handful situations where, usually, one would be better using a database engine to start with.
tell is tell. It counts the number of bytes from the start of the file to the cursor. \n and \r are bytes, so they get counted. If you want to count the number of bytes, but not count certain characters, you will have to do it manually:
data_read = … # data you have already read
len([b for b in data_read if b not in '\r\n'])
The bad news is that it's far more annoying to do this than just looking at tell. The good news is that it answers both your questions.
or, I suppose you could do
yourfile.tell() - data_read.count('\r') - data_read.count('\n')
result = re.sub("[\r\n]", "", subject)
http://regex101.com/r/kM6dA1
Match a single character present in the list below «[\r\n]»
A carriage return character «\r»
A line feed character «\n»

Ignoring special characters when processing txt using python

My python program processes txt file but it stops once it meets some bad characters.
I managed to ignore some of them by
while line.endswith(u'\u0085') or line.endswith(u'\u001E') or line.endswith(u'\u001D') or line.endswith(u'\u001C') or line.endswith(u'\u001A') or line.endswith(u'\u2028'):
But well, this solution doesn't work if there is still other bad character in the document.
So, I just want to ignore ALL special characters that don't show up in notepad++.
Maybe another solution would be to delete all such special characters that don't show up in notepad++
Either solution will be great.

Categories