I am using Python to read in data in a user-unfriendly format and transform it into an easier-to-read format. The records I am outputting are usually going to be just a last name, first name, and room code. I
I would like to output a series of pages, each containing a contiguous subset of the total records, divided into multiple columns, each of which contains a contiguous subset of the total records on the page. (So in other words, you'd read down the first column, move to the next column, move to the next column, etc., and then start over on the next page...)
The problem I am facing now is that for output formats, I'm almost certainly limited to HTML (and Javascript, CSS, etc.) What is the best way to get the data into this columnar format? If I knew for certain that the printable area of the paper would hold 20 records vertically and five horizontally, for instance, I could easily print tables of 5x20, but I don't know if there's a way to indicate a page break -- and I don't know if there's any way to calculate programmatically how many records will fit on the page.
How would you approach this?
EDIT: The reason I said that I was limited in output: I have to produce the file on one computer, then bring it to a different computer upon which we cannot install new software and on which the selection of existing software is not optimal. The file itself is only going to be used to make a physical printout (which is what the end users will actually work with), but my time on the computer that I can print from is going to be limited, so I need to have the file all ready to go and print right away without a lot of tweaking.
Right now I've managed to find a word processor that I can use on the target machine, so I'm going to see if I can target a format that the word processor uses.
EDIT: Once I knew there was a word processor I could use, I made a simple skeleton file with the settings that I wanted (column and tab settings, monospaced font in a small point size, etc.) and then measured how many characters I got per line of a column and how many lines I got per column. I've watched the runs pretty carefully to make sure that there weren't some strange lines that somehow overflowed the characters-per-line guideline (which shouldn't happen with monospaced font, of course, but how many times do you end up having to figure out why that thing that "shouldn't" happen is happening anyways?)
If there hadn't been a word processor on the target machine that I could use, I probably would have looked at PDF as an output format.
"If I knew for certain that the printable area of the paper would hold 20 records vertically and five horizontally"
You do know that.
You know the size of your paper. You know the size of your font. You can easily do the math.
"almost certainly limited to HTML..." doesn't make much sense. Is this a web application? The page can have a "Previous" and "Next" button to step through the pages? Pick a size that looks good to you and display one page full with "Previous" and "Next" buttons.
If it's supposed to be one HTML page that prints correctly, that's hard. There are CSS things you can do, but you'll be happier creating a PDF file.
Get PyX or ReportLab and create a PDF that prints properly.
I -- personally -- have no patience with any of this. I try put this kind of thing into a CSV file. My users can then open CSV with a tool spreadsheet (Open Office Org has a good one) and then adjust the columns and print with it.
Related
Like it says, I'm trying to find a method to extract data from PDFs in Python. I've explored a few solutions already, but I'm not finding any solution that fit the need.
The PDF I have is scanned in, but I can use Tesseract to turn it into a text pdf if necessary. The goal in the short term is to grab a few values from the PDF and store them. The large scale goal is to get a large number of these PDFs and perform this task automatically. I know how to store the data if I can get it out of the PDF, my problem is actually getting the values out.
I'm not at liberty to display the PDF, below is an example of what the document looks like.
Sorry for my crude art, I figured this would be easier than recreating an empty copy of the PDF, but I can make a better mock up if necessary. The fields I would like to extract are highlighted in red. Wherever it says TITLE: next to a field is where title would appear on the document, usually on a separate line, save for the field at the bottom.
I've tried using a few tools, notably Azure Cognitive Services and PyPDF2, however the issues I'm usually running into is that the output has each group of words as an individual line in the output, which does not work if the title of a form field is above it, like the example table below
left
center
right
One
Two
Three
The output returns left, then center, then right, then One, then Two, then Three. If the field for Two or One was left blank, searching for 3 rows below right would not give me the expected output.
I've run into a few other bugs with other solutions, like needing to have bounding boxes on my PDF for it to work, but I'm starting to run out of solutions to find, and I was wondering if anyone had any ideas for how I can get this task done.
There are multiple pages, however I only really need 1-2, and I only have 1 scanned with Tesseract. The format stays relatively the same, although each pdf is independently scanned in so there could be minor changes there.
Any and all help is greatly appreciated.
I am currently building a set of multiple graphs for my personal company using Datadog. I love how it works but there is only one thing I have not been able to sort out. Whenever my data is generated every 5 minutes, there are times where one or multiple values will come in at '0' which is what I want. The problem is Datadog is for some reason not taking these values into account and so until that same value finally comes in with something other than '0' then nothing will show up saying the value was '0' and then it changed to something else. Instead the graph chooses to create a straight line from the last recorded non-zero value straight to the newest non-zero value. I would love to know how I can get Datadog to consider the zeros and graph them.
In addition, if possible, I would also love to know how I could say something like "if this previous value existed and then on the next set of data it does not show up at all (not even as "0") just assign a "0" to it until it once again appears on the data". Of course for this to be looked into I would need the first problem dealt with.
Here is an image of how it is looking right now which is NOT how I wanted to look. The Red line shows where all the "0" values land, the Green boxes show the last recorded non-zero values.
Example of '0' values not being graph properly
I have tried looking through most of the documentation of Datadog as well as their posted YouTube videos with no luck. They for some reason do not address this even when it is in front of them when showing examples. I expected to find some info online but there seems to be little resources at the moment. This resulted in me thinking this could be the best place to finally get an answer.
I believe you are looking for Interpolation. There are a couple of use cases you specified, you may have to experiment with a few of the options depending on what your data looks like. For example Fill Zero satisfies one of them.
Datadog Graph Functions
I'm trying to analyze texts from movie scripts and need a way to grab the specific character lines. Character lines are easily visible because they are always centered and formatted like a block quote. Here's an example.
So I would want to get characters' blocks of lines. However, when I read the pdf with something like pdfplumber, it doesn't specify that there was any difference in formatting there, so it will print out something like:
--
CLEMENTINE
God, yes. You've saved my life! Brrr!
The waitress pours the coffee.
WAITRESS
You know what you want?
--
I don't want the "The waitress pours the coffee," line to be clumped into the character's actual speaking lines. Is there anyway (using pdfplumber or any other module) that I could extract that centering/changed margins somehow? I don't know how else really to be able to specify that this text is different. It's easy to eyeball, but the program isn't grabbing the difference.
Thanks!
Unfortunatly in PDF compilations you can throw all human concept out of the pram.
ALL text is generally treated as an equal but some can be more so.
So there is no such thing as tabs, or centered since normally all lines are centered between their start and end.
SO how many of those justified lines are also centered?
However there is no flag for justification or aligned left or right those terms are meaningless to a printer it just blobs out big letters, small letters or letters that may look like ALL CAPS but there is no need for words in printing. Literally a PDF is just go here go there and put some characters or marks on the page.
If we load the URL for page 4 into a PDF editor we can see how it was constructed.
So it is unusual that the text is only ragged right (just like it would be from a line printer or typewriter), I had expected ragged left too. However in either case there is no way to differentiate any one text line from another. The typewritten face is naturally one height and thus only human intelligence can say what is dialogue and what is a stage direction.
So you ask how to tell the difference and the answer is clear, Luckily unlike other PDFs this one has a semblance of indentation (very rare). Built using Microsoft Word but following stagecraft conventions "Professional screenwriting software takes care of this by automatically tabbing down to a new line in dialogue. There may be small discrepancies between them but nothing to get too hung up on."
approaches her with a coffee pot.
CLEMENTINE
Hi, it's me again! My home away from...
It may vary from document to document but in the PDF copy linked above
++ 6 spaces is a Stage direction or slugline
& 27 spaces is a CHARACTER
& 17 spaces is a "Dialogue" but without any quote marks
However flies always turn up in the ointment, and here there are two characters, thus the characters are moved and the dialogue starts where a stage direction (Mixed Case) or slugline (ALL CAPS) would be expected.
She scrambles in the window. Joel looks around, panicked.
^ JOEL VOICE-OVER
^ (whisper) I couldn't believe you did
Clementine. that. I was paralyzed with
^ fear.
I have a logic analyser project that records several hundred million 16bit values (~100-500 million) and I need to display anything from a few hundred samples to the entire capture as the user zooms.
When you zoom out the whole system gets a huge performance hit as it's loading a massive chunk from the file.
I just though this morning that it would be more efficient to "stride" through the file at the users screen resolution. You can't physically display anything between pixels anyways. This doesn't solve the massive file size hit in memory though.
Is there away I can take a huge data set and stream chunk it down efficiently?
I was thinking streaming from start to start + view size by horiz resolution. This makes a very choppy zoom though.
Program uses python but I am open to calling something in c if it already exists.
Well, I don't know if this is actually question on programming or design overall.
For "zooming" problem with vizualizations I suggest:
Have pre-computed/cached version for some zoom levels. Ideally, gradation should be calculated based on user behaviour.
When user zooms-in, you simultaneously
calculate "proper" data or load pre-computed aggregated data of deeper zoom layer and crop it by your view frame
cheat by rendering low-res data from previous layer or smooth it by some approximation (but make sure to somehow tell user that data is not finalized)
Aside of it, think if you can optimize the way you store data. Trees may make your life way easier, both for partial disk read/search and for storing aggregated data.
In my opinion, there is no point to display even a few hundred samples unless they form some kind of image/shape. I guess one can look at hundred numbers if they are properly structured (colored). Several hundred - doubt it - here you replace actual data with some visualization (plots, charts, maps, ...).
To approach the problem you may define some rule to stop displaying actual data at all. For instance, if digit height becomes less than, say, 10 pixels you display some kind of message selected numbers are from rows 200...300, columns 400..500 or some graphical alterantive with corner coordinates and amount of numbers.
I have (financial) data that I get in real time using an API and I'd like to display it in a customised manner (a bit like the result of a javascript code). For example, if I want to display 10x10 prices and update them as I receive the data and customise them to be green if it is higher than the previous price, red if lower or so, how should I do, what should I use?
I assume there exist a way to do so using python, but I can't formulate my demand briefly so I only get results that confuse me more using search engines...
Could someone help me by explaining where I can get started with that?
I'll give you an overview because what you want is a generalized approach and most UI packages (if not all) should be able to handle this. First, you need to pick a package to write your UI with. There are a number of these available for Python: see here. I'm not sure what your other requirements are so you'll have to choose the one you want yourself. Once you've picked it out, you'll basically go through and create a grid structure composed of individual cells. Each cell will contain a currency value. You'll then add an event for each cell that captures an "on-change" event for the value in the cell. If the new value is greater than the old one, you color it green. If it's less, color it red. You may also want to add a timer for each cell so that the color fades after a period of time.