Ignoring special characters when processing txt using python - python

My python program processes txt file but it stops once it meets some bad characters.
I managed to ignore some of them by
while line.endswith(u'\u0085') or line.endswith(u'\u001E') or line.endswith(u'\u001D') or line.endswith(u'\u001C') or line.endswith(u'\u001A') or line.endswith(u'\u2028'):
But well, this solution doesn't work if there is still other bad character in the document.
So, I just want to ignore ALL special characters that don't show up in notepad++.
Maybe another solution would be to delete all such special characters that don't show up in notepad++
Either solution will be great.

Related

Python: how to get rid of non-ascii characters being read from a file

I am processing, with python, a long list of data that looks like this
The digraphs are probably due to encoding problems. (I am not sure whether these characters will be preserved in this site)
29/07/2016 04:00:12 0.125143
Now, when I read such file into a script using something like open and readlines, there is an error, reading
SyntaxError: EOL while scanning string literal
I know (or may look up usage of) replace and regex functions, but I cannot do them in my script. The biggest problem is that anywhere I include or read such strange character, error occurs, pointing on the very line it is read. So I cannot do anything to them.
Are you reading a file? If so, try to extract values using regexps, not to remove extra characters:
re.search(r'^([\d/: ]{19})', line).group(1)
re.search(r'([\d.]{7})', line).group(1)
I find that the re.findall works. (I am sorry I do not have time to test all other methods, since the significance of this job has vanished, and I even forget this question itself.)
def extract_numbers(str_i):
pat="(\d+)/(\d+)/(\d+)\D*(\d+):(\d+):(\d+)\D*(\d+)\.(\d+)"
match_h = re.findall(pat, str_i)
return match_h[0]
# ....
# `f` is the handle of the file in question
lines =f.readlines()
for l in lines:
ls_f =extract_numbers(l)
# process them....

Convert .py files to correct encoding for Python 3

I just pulled from a git repo where the users are on Python 2. My system is running Python 3 and with no changes in the code, I am getting this error:
TabError: inconsistent use of tabs and spaces in indentation
It appears that the solution is to change the char set encoding of the .py files, but working in emacs, I'm not clear how to do this. I'm seeing these instructions:
https://www.emacswiki.org/emacs/ChangingEncodings
but I don't understand how to apply these for utf-8. I'd appreciate any suggestions.
Exists a command untabify:
Convert all tabs in region to multiple spaces, preserving columns.
If called interactively with prefix ARG, convert for the entire
buffer.
I.e. call it with C-u to convert all TABs in buffer.
As comment points out correctly: tabify does the inverse, converts multiple spaces to tabs - while using spaces seems a common convention not just in Python.
This is not a python 2/3 issue, it looks like something in that git repo has wrong indentation. The easiest fix would be to replace all tab characters in all the files with spaces using something like sed

Remove non-ASCII characters interpreted as EOF from text file

Following up on my earlier question here: Row limit in read.table.ffdf?
I have a text file with >285 million records, but about two-thirds of the way through there are several non-ASCII characters that are being interpreted by AWK as well as several R packages (ff, data.table) as EOF bytes. It appears that the characters were originally entered as degree signs, but appear in text editors as boxes (see example here). When I try to read in the text file using these methods it just stops when it encounters the first character, with no error messages as if it's complete.
For now I was able to open the file in a text editor to remove these characters. But this is not a long-term solution for this dataset given its size; I need to be able to remove or bypass them without having to open the whole file. I've tried using the quote option in R, and tried replacing all non-ASCII and 'CTRL-M' characters specifically during an awk import, but the read process always stops at the first character. Any solutions? I'm using R and awk now, but am open to other options (python?). Thanks!
gawk -v BINMODE=3 '{gsub(/[[:cntrl:]]/,"")}1
will remove them.

How to open a file in its default program with python

I want to open a file in python 3.5 in its default application, specifically 'screen.txt' in Notepad.
I have searched the internet, and found os.startfile(path) on most of the answers. I tried that with the file's path os.startfile(C:\[directories n stuff]\screen.txt) but it returned an error saying 'unexpected character after line continuation character'. I tried it without the file's path, just the file's name but it still didn't work.
What does this error mean? I have never seen it before.
Please provide a solution for opening a .txt file that works.
EDIT: I am on Windows 7 on a restricted (school) computer.
It's hard to be certain from your question as it stands, but I bet your problem is backslashes.
[EDITED to add:] Or actually maybe it's something simpler. Did you put quotes around your pathname at all? If not, that will certainly not work -- but once you do, you will find that then you need the rest of what I've written below.
In a Windows filesystem, the backslash \ is the standard way to separate directories.
In a Python string literal, the backslash \ is used for putting things into the string that would otherwise be difficult to enter. For instance, if you are writing a single-quoted string and you want a single quote in it, you can do this: 'don\'t'. Or if you want a newline character, you can do this: 'First line.\nSecond line.'
So if you take a Windows pathname and plug it into Python like this:
os.startfile('C:\foo\bar\baz')
then the string actually passed to os.startfile will not contain those backslashes; it will contain a form-feed character (from the \f) and two backspace characters (from the \bs), which is not what you want at all.
You can deal with this in three ways.
You can use forward slashes instead of backslashes. Although Windows prefers backslashes in its user interface, forward slashes work too, and they don't have special meaning in Python string literals.
You can "escape" the backslashes: two backslashes in a row mean an actual backslash. os.startfile('C:\\foo\\bar\\baz')
You can use a "raw string literal". Put an r before the opening single or double quotes. This will make backslashes not get interpreted specially. os.startfile(r'C:\foo\bar\baz')
The last is maybe the nicest, except for one annoying quirk: backslash-quote is still special in a raw string literal so that you can still say 'don\'t', which means you can't end a raw string literal with a backslash.
The recommended way to open a file with the default program is os.startfile. You can do something a bit more manual using os.system or subprocess though:
os.system(r'start ' + path_to_file')
or
subprocess.Popen('{start} {path}'.format(
start='start', path=path_to_file), shell=True)
Of course, this won't work cross-platform, but it might be enough for your use case.
For example I created file "test file.txt" on my drive D: so file path is 'D:/test file.txt'
Now I can open it with associated program with that script:
import os
os.startfile('d:/test file.txt')

How to recognize special eol character when I see it, using Python?

I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.
It's not a '\n', or, I don't think, '\r\n'. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, ''), I can remove all of these characters from one of my files.
So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol) is the character in its functional usage as a newline.
Please help! Thanks, and sorry if I'm missing something obvious.
To determine what specific character that is, you can use str.encode('unicode_escape') or repr() to get (in Python 2) a ASCII-printable representation of the character:
>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'

Categories