How to recognize special eol character when I see it, using Python? - python

I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.
It's not a '\n', or, I don't think, '\r\n'. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, ''), I can remove all of these characters from one of my files.
So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol) is the character in its functional usage as a newline.
Please help! Thanks, and sorry if I'm missing something obvious.

To determine what specific character that is, you can use str.encode('unicode_escape') or repr() to get (in Python 2) a ASCII-printable representation of the character:
>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'

Related

Can I override u-strings (u'example') in Python 2?

In debugging upgrading to Python 3, it would be useful to be able to override the u'' string prefix to call my own function or replace with a non-u string.
I've tried things like unichr = chr which is useful for my debugging but doesn't accomplish the above.
module.uprefix = str is the type of solution I'm looking for.
You basically can't; as others have noted in the comments, the u-prefix is handled very early, well before anything where an in-code assignment would take effect.
About the best you could do is use ast.parse to read a module on disk (without importing it) and find all the u'' strings; it distinguishes the prefixes. That would help you find them in a Python-aware way, more reliably than just searching for u' and u", but the difference probably wouldn't be large, especially if you search with word boundaries (regex \bu['"]). Unless you somehow have a lot of u' and u" in your program that aren't the prefixes?
>>> ast.dump(ast.parse('"abc"', mode='eval'))
"Expression(body=Constant(value='abc', kind=None))"
>>> ast.dump(ast.parse('u"abc"', mode='eval'))
"Expression(body=Constant(value='abc', kind='u'))"
Per the comments, what are you trying to do? I've migrated a lot of code from Python 2 to Python 3 and never needed this... There may be a different way to achieve the same goal?

Python: how to get rid of non-ascii characters being read from a file

I am processing, with python, a long list of data that looks like this
The digraphs are probably due to encoding problems. (I am not sure whether these characters will be preserved in this site)
29/07/2016 04:00:12 0.125143
Now, when I read such file into a script using something like open and readlines, there is an error, reading
SyntaxError: EOL while scanning string literal
I know (or may look up usage of) replace and regex functions, but I cannot do them in my script. The biggest problem is that anywhere I include or read such strange character, error occurs, pointing on the very line it is read. So I cannot do anything to them.
Are you reading a file? If so, try to extract values using regexps, not to remove extra characters:
re.search(r'^([\d/: ]{19})', line).group(1)
re.search(r'([\d.]{7})', line).group(1)
I find that the re.findall works. (I am sorry I do not have time to test all other methods, since the significance of this job has vanished, and I even forget this question itself.)
def extract_numbers(str_i):
pat="(\d+)/(\d+)/(\d+)\D*(\d+):(\d+):(\d+)\D*(\d+)\.(\d+)"
match_h = re.findall(pat, str_i)
return match_h[0]
# ....
# `f` is the handle of the file in question
lines =f.readlines()
for l in lines:
ls_f =extract_numbers(l)
# process them....

Python CSV writer, how to handle quotes in order to avoid triple quotes in output

I am working with Python's CSV module, specifically the writer. My question is how can I add double quotes to a single item in a list and have the writer write the string the same way as a print statement would?
for example:
import csv
#test "data"
test = ['item1','01','001',1]
csvOut = csv.writer(open('file.txt','a')) #'a' used for keeping past results
test[1] = '"'+test[1]+'"'
print test
#prints: ['item1', '"01"', '001', 1]
csvOut.writerow(test)
#written in the output file: item1,"""01""",001,1
#I was expecting: item1,"01",001,1
del csvOut
I tired adding a quoting=csv.QUOTE_NONE option, but that raised an error. I am guessing this is related to the many csv dialects, I was hoping to avoid digging too far into that.
In retrospect I could probably have built my initial data set smarter and perhaps avoided the need for this situation but at this point curiosity is really getting the better of me (this is a simplified example): how do you keep the written output from adding those extra quotes?
It's not actually triple-quoting, although it looks that way. Try it with another example to see:
test = ['item1', 'abc"def']
Now you'll see that it writes this:
"abc""def"
In other words, it's just wrapping quotes around your string, and escaping the literal quote characters by doubling them, because that's how default Excel-style CSV handles quote characters.
The question is, what format do you want here? Almost anything you want (within reason) is doable, but you have to pick something. Backslash-escaping quotes? Backslash-escaping everything instead of using quotes in the first place? Single quotes instead of double quotes?
For example, this looks like an answer:
csvOut = csv.writer(open('file.txt','a'), quotechar="'")
… until you have an item like Filet O'Fish and the whole thing gets single-quoted and the ' gets doubled and you have the exact same problem you were trying to avoid. If you're aiming for human readability, and ' is a lot less common in your data than ", that may actually be the right answer, but it's not a perfect answer.
And really, no answer can be perfect: you need some way to either quote or escape commas—and other things, like newlines—and the way you do that is going to add at least one more character that needs to be quote-doubled or escaped. If you know there are never any commas, newlines, etc. in your data, and there's at least one other character you know will never show up, you can get away with setting either quotechar to that other character, or escapechar to that other character and quoting=QUOTE_NONE. But the first time someone unexpectedly uses the character you were sure would never appear, your code will break, so you'd better actually be sure.
Quotes get escaped because your data could contain a comma. You probably don't want a CSV file if you don't want quotes escaped. Just join on a comma (this will break downstream if your data has a comma in it)

pydoc.render_doc() adds characters - how to avoid that?

There are already some questions touching this but no one seems to actually solve it.
import pydoc
hlpTxt = pydoc.render_doc(help)
already does what I want! looks flawless when printed to the (right) console but it has those extra characters included:
_\x08_H\x08He\x08el\x08lp\x08pe\x08er\x08r
In Maya for instance it looks like its filled up with ◘-symbols! While help() renders it flawless as well.
Removing \x08 leaves me with an extra letter each:
__HHeellppeerr
which is also not very useful.
Someone commented that it works for him when piped to a subprocess or into a file. I also failed to do that already. Is there another way than
hlpFile = open('c:/help.txt', 'w')
hlpFile.write(hlpTxt)
hlpFile.close()
? Because this leaves me with the same problem. Notepad++ actually shows BS symbols at the places. Yes for backspace obwiously.
Anyway: There must be a reason that these symbols are added and removing them afterwards might work but I can't imagine there isn't a way to have them not created in the first place!
So finally is there another pydoc method I'm missing? Or a str.encode/decode thing I have not yet seen?
btw: I'm not looking for help.__doc__!
In python 2, you can remove the boldface sequences with pydoc.plain:
pydoc.plain(pydoc.render_doc(help))
>>> help(pydoc.plain)
Help on function plain in module pydoc:
plain(text)
Remove boldface formatting from text.
In python 3 pydoc.render_doc accepts a renderer:
pydoc.render_doc(help, renderer=pydoc.plaintext)

Python Printing from python32

I can't get Python to print a word doc. What I am trying to do is to open the Word document, print it and close it. I can open Word and the Word document:
import win32com.client
msword = win32com.client.Dispatch("Word.Application")
msword.Documents.Open("X:\Backoffice\Adam\checklist.docx")
msword.visible= True
I have tried next to print
msword.activedocument.printout("X:\Backoffice\Adam\checklist.docx")
I get the error of "print out not valid".
Could someone shed some light on this how I can print this file from Python. I think it might be as simple as changing the word "printout". Thanks, I'm new to Python.
msword.ActiveDocument gives you the current active document. The PrintOut method prints that document: it doesn't take a document filename as a parameter.
From http://msdn.microsoft.com/en-us/library/aa220363(v=office.11).aspx:
expression.PrintOut(Background, Append, Range, OutputFileName, From, To, Item,
Copies, Pages, PageType, PrintToFile, Collate, FileName, ActivePrinterMacGX,
ManualDuplexPrint, PrintZoomColumn, PrintZoomRow, PrintZoomPaperWidth,
PrintZoomPaperHeight)
Specifically Word is trying to use your filename as a boolean Background which may be set True to print in the background.
Edit:
Case matters and the error is a bit bizarre. msword.ActiveDocument.Printout() should print it. msword.ActiveDocument.printout() throws an error complaining that 'PrintOut' is not a property.
I think what happens internally is that Python tries to compensate when you don't match the case on properties but it doesn't get it quite right for methods. Or something like that anyway. ActiveDocument and activedocument are interchangeable but PrintOut and printout aren't.
You probably have to escape the backslash character \ with \\:
msword.Documents.Open("X:\\Backoffice\\Adam\\checklist.docx")
EDIT: Explanation
The backslash is usually used to declare special characters. For example \n is the special character for a new-line. If you want a literal \ you have to escape it.

Categories