Determining centered pdf content in python - python

I'm trying to analyze texts from movie scripts and need a way to grab the specific character lines. Character lines are easily visible because they are always centered and formatted like a block quote. Here's an example.
So I would want to get characters' blocks of lines. However, when I read the pdf with something like pdfplumber, it doesn't specify that there was any difference in formatting there, so it will print out something like:
--
CLEMENTINE
God, yes. You've saved my life! Brrr!
The waitress pours the coffee.
WAITRESS
You know what you want?
--
I don't want the "The waitress pours the coffee," line to be clumped into the character's actual speaking lines. Is there anyway (using pdfplumber or any other module) that I could extract that centering/changed margins somehow? I don't know how else really to be able to specify that this text is different. It's easy to eyeball, but the program isn't grabbing the difference.
Thanks!

Unfortunatly in PDF compilations you can throw all human concept out of the pram.
ALL text is generally treated as an equal but some can be more so.
So there is no such thing as tabs, or centered since normally all lines are centered between their start and end.
SO how many of those justified lines are also centered?
However there is no flag for justification or aligned left or right those terms are meaningless to a printer it just blobs out big letters, small letters or letters that may look like ALL CAPS but there is no need for words in printing. Literally a PDF is just go here go there and put some characters or marks on the page.
If we load the URL for page 4 into a PDF editor we can see how it was constructed.
So it is unusual that the text is only ragged right (just like it would be from a line printer or typewriter), I had expected ragged left too. However in either case there is no way to differentiate any one text line from another. The typewritten face is naturally one height and thus only human intelligence can say what is dialogue and what is a stage direction.
So you ask how to tell the difference and the answer is clear, Luckily unlike other PDFs this one has a semblance of indentation (very rare). Built using Microsoft Word but following stagecraft conventions "Professional screenwriting software takes care of this by automatically tabbing down to a new line in dialogue. There may be small discrepancies between them but nothing to get too hung up on."
approaches her with a coffee pot.
CLEMENTINE
Hi, it's me again! My home away from...
It may vary from document to document but in the PDF copy linked above
++ 6 spaces is a Stage direction or slugline
& 27 spaces is a CHARACTER
& 17 spaces is a "Dialogue" but without any quote marks
However flies always turn up in the ointment, and here there are two characters, thus the characters are moved and the dialogue starts where a stage direction (Mixed Case) or slugline (ALL CAPS) would be expected.
She scrambles in the window. Joel looks around, panicked.
^ JOEL VOICE-OVER
^ (whisper) I couldn't believe you did
Clementine. that. I was paralyzed with
^ fear.

Related

Extracting Data from Unstructured PDFs in Python

Like it says, I'm trying to find a method to extract data from PDFs in Python. I've explored a few solutions already, but I'm not finding any solution that fit the need.
The PDF I have is scanned in, but I can use Tesseract to turn it into a text pdf if necessary. The goal in the short term is to grab a few values from the PDF and store them. The large scale goal is to get a large number of these PDFs and perform this task automatically. I know how to store the data if I can get it out of the PDF, my problem is actually getting the values out.
I'm not at liberty to display the PDF, below is an example of what the document looks like.
Sorry for my crude art, I figured this would be easier than recreating an empty copy of the PDF, but I can make a better mock up if necessary. The fields I would like to extract are highlighted in red. Wherever it says TITLE: next to a field is where title would appear on the document, usually on a separate line, save for the field at the bottom.
I've tried using a few tools, notably Azure Cognitive Services and PyPDF2, however the issues I'm usually running into is that the output has each group of words as an individual line in the output, which does not work if the title of a form field is above it, like the example table below
left
center
right
One
Two
Three
The output returns left, then center, then right, then One, then Two, then Three. If the field for Two or One was left blank, searching for 3 rows below right would not give me the expected output.
I've run into a few other bugs with other solutions, like needing to have bounding boxes on my PDF for it to work, but I'm starting to run out of solutions to find, and I was wondering if anyone had any ideas for how I can get this task done.
There are multiple pages, however I only really need 1-2, and I only have 1 scanned with Tesseract. The format stays relatively the same, although each pdf is independently scanned in so there could be minor changes there.
Any and all help is greatly appreciated.

How can I style a single letter or word in a sentence drawn with PIL?

It seems there's no out-of-the-box way to do this (I've skimmed the docs), but just in case: if I want to draw a sentence to an image with PIL but want control over individual words (or even individual letters), how can I do that?
I'm also open to solutions that use a library other than PIL.
The image below shows the kind of effect I'm going for.
There is no a direct way to do this. However, you are able to determine the text size - http://pillow.readthedocs.io/en/5.2.x/reference/ImageDraw.html#PIL.ImageDraw.PIL.ImageDraw.ImageDraw.textsize.
You could draw the first part of your text, get the size of that text to calculate the new position for the next part of text, and so on. That way, you could style each part of the text in any matter you see fit.

Tell if text of PDF is visible or not

I'm parsing some PDF files using the pdfminer library.
I need to know if the document is a scanned document, where the scanning machine places the scanned image on top and OCR-extracted text in the background.
Is there a way to identify if text is visible, as OCR machines do place it on the page for selection.
Generally the problem is distinguishing between two very different, but similar looking cases.
In one case there's an image of a scanned document that covers most of the page, with the OCR text behind it.
Here's the PDF as text with the image truncated: http://pastebin.com/a3nc9ZrG
In the other case there's a background image that covers most of the page with the text in front of it.
Telling them apart is proving difficult for me.
Your question is a bit confusing so I'm not really sure what is going to help you the most. However, you describe two ways to "hide" text from OCR. Both I think are detectable but one is much easier than the other.
Hidden text
Hidden text is regular or invisible text that is placed behind something else. In other words, you use the stacking order of objects to hide some of them. The only way you can detect this type of case is by figuring out where all of the text objects on the page are (calculating their bounding boxes isn't trivial but certainly possible) and then figuring out whether any of the images on the page overlaps that text and is in front of it. Some additional comments:
Theoretically it could be something else than an image hiding it, but in your OCR case I would guess it's always an image.
Though an image may be overlapping it, it may also be transparent in some way. In that case, the text that is underneath may still shine through. In your case of a general OCR engine, probably not likely.
Invisible text
PDF supports invisible text. More precisely, PDF supports different text rendering modes; those rendering modes determine whether characters are filled, outlined, filled + outlined, or invisible (there are other possibilities yet). In the PDF file you posted, you find this fragment:
BT
3 Tr
0.00 Tc
/F3 8.5 Tf
1 0 0 1 42.48 762.96 Tm
(Chicken ) Tj
That's an invisible chicken right there! The instruction "3 Tr" sets the text rendering mode to "3", which is equal to "invisible" or "neither stroked nor filled" as the PDF specification very elegantly puts it.
It's worthwhile mentioning that these two techniques can be used interchangeably by OCR engines. Placing invisible text on top of a scanned image is actually good practice because it means that most PDF viewers will allow you to select the text. Some PDF viewers that I looked at at some point didn't allow text selection if the text was "behind" the image.
I don't have a copy of the PDF 1.7 specification, but I suspect that the objects on a page are rendered in order, that is, the preceding objects end up covered up by succeeding objects.
Thus, you would have to iterate through the layout objects (See Performing Layout Analysis) and calculate where everything falls on the page, their dimensions, and their rendering order (and possibly their transparency).
As the pdfminer documentation mentions, PDF is evil.

Python Pygtk making colored tags in Text View dynamically

I have seen some ways of making colored text in textview in Python pygtk. the issue seems that it will just print text in that colour or make the whole line that color rather than for certain items make them a certain colour.
I want it to where I type "" that is will colour is blue. or if there is "string" in the text view it will be orange or any kind of
and if there is an '#comment' then it will be italicized and grey.
not sure if it helps, but I have a part where as I am typing it writes the text to a page. is it possible to to keep this kind of syntax coloring active?
I hope this makes sense.
any help is much appreciated! Thank you!
Use GtkSourceView for syntax highlighting. Don't reinvent the wheel.
In general, what you are looking for, I'd say, is to use regular expressions (re module, there are abundant of questions on this here...probably some for the exact patterns you need) to find the patterns you mention above in your TextBuffer. That means you need to connect a signal to the buffer so you see what the user types. Then you'll need a set of TextTags (one tag per formatting rule/pattern) to apply to regions of the buffer where the regular expressions match the patterns you've described. Finally you want to apply the tags to the buffer and those TextTags can reformat the text-display in the TextView in an array of ways (as the documentation says here).
Without any supplied code, it's hard to be precise on where you might be having a problem.
Hope it points you in the right direction...
Mind though that if you overwrite the GTK-theme, that another user could have a theme with e.g. orange background in the TextViews, so you should be careful with making sure that it will work visually independent of what theme you have.

best way to print data in columnar format?

I am using Python to read in data in a user-unfriendly format and transform it into an easier-to-read format. The records I am outputting are usually going to be just a last name, first name, and room code. I
I would like to output a series of pages, each containing a contiguous subset of the total records, divided into multiple columns, each of which contains a contiguous subset of the total records on the page. (So in other words, you'd read down the first column, move to the next column, move to the next column, etc., and then start over on the next page...)
The problem I am facing now is that for output formats, I'm almost certainly limited to HTML (and Javascript, CSS, etc.) What is the best way to get the data into this columnar format? If I knew for certain that the printable area of the paper would hold 20 records vertically and five horizontally, for instance, I could easily print tables of 5x20, but I don't know if there's a way to indicate a page break -- and I don't know if there's any way to calculate programmatically how many records will fit on the page.
How would you approach this?
EDIT: The reason I said that I was limited in output: I have to produce the file on one computer, then bring it to a different computer upon which we cannot install new software and on which the selection of existing software is not optimal. The file itself is only going to be used to make a physical printout (which is what the end users will actually work with), but my time on the computer that I can print from is going to be limited, so I need to have the file all ready to go and print right away without a lot of tweaking.
Right now I've managed to find a word processor that I can use on the target machine, so I'm going to see if I can target a format that the word processor uses.
EDIT: Once I knew there was a word processor I could use, I made a simple skeleton file with the settings that I wanted (column and tab settings, monospaced font in a small point size, etc.) and then measured how many characters I got per line of a column and how many lines I got per column. I've watched the runs pretty carefully to make sure that there weren't some strange lines that somehow overflowed the characters-per-line guideline (which shouldn't happen with monospaced font, of course, but how many times do you end up having to figure out why that thing that "shouldn't" happen is happening anyways?)
If there hadn't been a word processor on the target machine that I could use, I probably would have looked at PDF as an output format.
"If I knew for certain that the printable area of the paper would hold 20 records vertically and five horizontally"
You do know that.
You know the size of your paper. You know the size of your font. You can easily do the math.
"almost certainly limited to HTML..." doesn't make much sense. Is this a web application? The page can have a "Previous" and "Next" button to step through the pages? Pick a size that looks good to you and display one page full with "Previous" and "Next" buttons.
If it's supposed to be one HTML page that prints correctly, that's hard. There are CSS things you can do, but you'll be happier creating a PDF file.
Get PyX or ReportLab and create a PDF that prints properly.
I -- personally -- have no patience with any of this. I try put this kind of thing into a CSV file. My users can then open CSV with a tool spreadsheet (Open Office Org has a good one) and then adjust the columns and print with it.

Categories