Python - how to read .pages document in python?

Python - how to read .pages document in python? - python

I am new to python and I am trying to read/edit .pages document in python.
I used this code:
with open(directory+"example.pages") as f:
print(f.read())
However, the output is unreadable. I am trying to guess the root cause of the issue and it seems to be an encoding issue.
['PK\x03\x04\x14\x00\x00\x00\x00\x00}\xa2\x99G^8\xfd\\\x00\x0f\x00\x00\x00\x0f\x00\x00\x12\x00\x00\x00Index/Document.iwa\x00\xfc\x0e\x00\xa75\xf0\xa54\x08\x01\x120\x08\x90N\x12\x03\x01\x00\x05\x18\xb3\x01"']
['\x03']
['\x01', '\x10\x01\x18\x00*\x18\xc2\x13\xa0\x11\x99\x11\x9c\x11\x9f\x11\x9a\x11\xa1\x11\xc3\x13\x9b\x11\x9e\x11\x9d\x11\x98\x11\x12\x03\x08\x9d\x11\x1a\x03\x08\x9f\x11"\x03\x08\xa0\x112\x03\x08\x9c\x11:\x03\x08\xa1\x11z=']
['\x05:\x03\x08\x98\x11\x1a\x02en"\x03\x08\x99\x11*\x03\x08\xc3\x132\x03\x08\x9b\x11:\x03\x08\x9a\x11#\x00J\x15Application/Blank/ISOj\x03\x08\xc2\x13\xa2\x01\x03\x08\x9e\x11\xf5\x01\xec\xd1\x14D\xfd\x01\xf6xRD\x85\x02\x8c\xc5bB\x8d\x05\x06\x00\x95\x05\x06\x00\x9d\x05\x06\xf0J\xa5\x02w\xbb', 'B\xad\x02(\x14*B\xb5\x02\x00\x00\x80?\xb8\x02\x00\xc0\x02\x00\xd0\x02\x00\xda\x02\x07mfp2462\xe2\x02\x06iso-a4\xe8\x02\x00\x0f\x08\xc2\x13\x12']
['\x08\xe0\x04\x12\x03\x01\x05\x00\x18\x00\x19\x08\xa0\x11\x12\x14\x08\xd1\x0f\x12\x05\xf90g*\x08\xd9\x11\xbd\x11\xeb\x11\xec\x11\x08\x00', '\xdf$\x05salut*']
['\x07\x05\x120\xd9\x112\x08']
['\x06\x08\x00\x10\x00\x18\x00:\x11\x15\x10\xbd\x11P\x01b\x11', '\x08\xeb\x11r\x15"\x04\x8a\x01\x11\x16', '\xec\x11\x9a\x01\x05\x17\x10\x12\x02en\xc2', '\x01D\x1c$\x08\xec\x11\x12\x1f\x08\x9b-z\x005)y\xf0L\x1d\x10\x01\x18\x01*\x08\xa5\x13\xa6\x13\xa7\x13\xa4\x13\x88\x01\x01\x90\x01\x00\x98\x01\x00\xa0\x01\x00\.....

I don't know Pages in particular. But a text editor with that much formatting, layout etc is likely to be almost unreadable by humans.
What you are displaying is a lot of characters, most are non-printable, I see a couple commas and spaces but that is it. Have you tried look at the file in a standard text editor (like TextEditor for iOS, Notepad for Windows) to see what it is you are dealing with?

Related

Specify format(s) of text copied from Qt app? Somehow crashes other app when pasted

I'm making a small Qt/Pyside2/QML app whose entire purpose is to conveniently generate and copy some text for pasting into another program. (In case it matters, it's https://lackeyccg.com/; I suspect the only fact about it that may be relevant is that it's old enough that it doesn't play nicely with Unicode at all.) I'm currently running macOS 10.13, in case the particular clipboard is relevant.
In PySide, I copy text like so:
clipboard = QGuiApplication.clipboard()
clipboard.setText(text_to_copy)
Calling clipboard.mimeData().formats() tells me that the data is formatted as text/plain. Switching to a text editor, web browser, etc. and pasting works just fine. Additionally, if I then select what I've pasted, copy that, and paste it into LackeyCCG, everything is again fine. Unfortunately, pasting directly into LackeyCCG after copying in Qt crashes Lackey.
I've verified this with several test strings, ruling out possible causes like non-ASCII characters or newlines; it seems the only thing that doesn't cause a crash is an empty string.
I'm guessing this has something to do with which text formats Qt provides. By running osascript -e 'the clipboard as record' | less on the command line, I can inspect the contents of the system clipboard. Text, when copied from several text editors, as well as from Chrome, contains the formats <<class utf8>>, <<class ut16>>, and string. (The string version has newlines replaced by carriage returns, oddly enough.) In contrast, the text copied from my Qt app contains string, Unicode text, and <<class ut16>> (and its string has ordinary newlines).
I don't have the firmest grasp on the particulars of text encodings, but it seems possible that the operative difference here is the lack of a UTF-8 version. Obviously most modern apps are smart enough to interpret what Qt gives them, even though it's different from what most apps apparently produce. But for those of us trying to paste into abandonware, is there a way to force Qt/PySide2 to output text in specific formats? (Or any insight on what the problem could be, if that's not it?)

While I still don't know if it's possible with PySide2's own mechanisms, I found a blindingly simple solution: https://pypi.org/project/pyperclip/
import pyperclip
pyperclip.copy(text_to_copy)
The clipboard then contains a version in UTF-8 and everything works perfectly.

PyPDF2 can't read non-English characters, returns empty string on extractText()

i'm working on a script that will extract data from a large PDF File (40-60 plus, pages long)
that isn't in English but the file contains Greek characters and all seems good until i run the extractText() function of PyPDF2 to get the givens page contents, then it returns an empty string.
I'm new to this library and i don't know what to do, to fix this problem!!

PyPDF2's "Extract Text" looks like it will either Work Just Fine, or Fail Completely. There's no parameters you can pass in to try to get things to work properly. It'll work or it won't.
You may not be able to fix this problem. If you can successfully copy/paste the text in Acrobat/Reader, then it's possible to extract the text. So what happens when you try to copy/paste out of Reader? Don't try this with some other third party PDF viewer, use Adobe software. You'll probably have to abandon PyPDF2 and move on to some other PDF API, but if Reader can do it, it's a fixable problem.
There are three different things in a PDF that can look like letters to the human eye.
Letters in the PDF in some text encoding. There are several fixed encodings, plus PDF allows you to embed your own custom encodings (often used with font subsets). Software can create PDFs that look fine but can't really be copy/pasted from, even by Adobe.
Path art that just happens to look an awful lot like letters. "Start drawing a line here, draw a straight line to there, then a curve like this to there" and so on. If you're curious, PDF uses Bezier curves to define its curves. Not terribly related to your question, but interesting.
Bit maps (.jpeg/gif/etc images) that define a grid of pixels.
In the past, Reader has only been able to handle text type 1 above, and then only if the text was encoded properly. Broken custom encodings are alarmingly common (or were 7+ years ago when I stopped working on PDF software).
With broken type 1s, and all of 2 and 3, the only thing you can do is to run OCR on the PDF. OCR: Optical Character Recognition. There are several open source OCR projects out there, as well as commercial ones.

New line with invisible character

I'm sure this has been answered before but after attempting to search for others who had the problem I didn't have much luck.
I am using csv.reader to parse a CSV file. The file is in the correct format, but on one of the lines of the CSV file I get the notification "list index out of range" indicating that the formatting is wrong. When I look at the line, I don't see anything wrong. However, when I go back to the website where I got the text, I see a square/rectangle symbol where there is a space. This symbol must be leading csv.reader to treat that as a new line symbol.
A few questions: 1) What is this symbol and why can't I see it in my text files? 2) How do I avoid having these treated as new lines? I wonder if the best way is to find and replace them given that I will be processing the file multiple times in different ways.
Here is the symbol:
Update: When I copy and paste the symbol into Google it searches for Â (a-circumflex). However, when I copy and paste Â into my documents, it shows up correctly. That leads me to believe that the symbol is not actually Â.

This looks like a charset problem. The "Â" is latin-1 for a non-breaking space in UTF-8. Assuming you are running Windows, you are using one of the latins as character set. UTF-8 is the default encoding for OSX and Linux-based OSs. The OS locale is used as default locale in most text editors, and thus encode files created with those programs as latin-1. A lot of programmers on OSX have problems with non-breaking spaces because it is very easy to mistakenly type it (it is Option+Spacebar) and impossible to see.
In python >= 3.1, the csv reader supports dialects for solving those kind of problems. If you know what program was used to create the csv file, you can manually specify a dialect, like 'excel'. You can use a csv sniffer to automatically deduce it by peeking into the file.
Life Management Advice: If you happen to see weird characters anywhere, always assume charset problems. There is an awesome charset problem debug table HERE.

txt file appears blank with .write() python

I am on a windows machine and am trying to write a couple thousand lines to a text file using ipython. To test this I am just trying to get some text to appear in the file.
my code is as follows:
path="\Users\\*****\Desktop"
with open(path+'newheaders.txt','wb') as f:
f.write('new text')
This question (.write not working in Python) is answered and seems like it should have solved my issue but when I open the text file it is still blank.
I tested the file using the code below and the text appears to be there.
with open(path+'newheaders.txt','r') as f:
print f.read()
any ideas?

This 'should' work as written. A few things to try (I would put this in a comment but I lack sufficient reputation):
Delete the file and make sure the program is creating the file
Try writing as 'wt' rather than binary to see if we can narrow down the problem that way.
Remove all the business with the path and just try to write the file in the current directory.
What text editor are you using? Is it possible it's not refreshing the blank file?

create pdf from python

I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?

borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch

As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.

I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.