How to find table grid lines in PDF files? - python

To more accurately extract table-like data embedded within table cells, I would like to be able to identify table cell boundaries in PDFs like this:
I have tried extracting such tables using Camelot, pdfplumber, and PyMuPDF, with varying degrees of success. But due to the inconsistency of the PDFs we receive, I'm not able to reliably get accurate results, even when specifying the table bounds.
I find that the results are better if I extract each table cell individually, by specifying the cell boundaries explicitly. I have tested this by manually entering the boundaries, which I get using Camelot's visual debugging tool.
My challenge is how to identify table cell boundaries programmatically, since the table may start anywhere on the page, and the cells are of variable vertical height.
It seems to me that one could do this by finding the coordinates of the row separator lines, which are so obvious visually to a human. But I have not figured out how to find these lines using python tools. Is this possible, or are there other/better ways to solve this problem?

I recently had a similar use case where I needed to figure out the boundaries via code itself. For your use case, there are two options:
If you want to identify the boundary of the entire table, you can do the following:
import pdfplumber
pdf = pdfplumber.open('file_name.pdf')
p0 = pdf.pages[req_page] # go to the required page
tables = p0.debug_tablefinder() # list of tables which pdfplumber identifies
req_table = tables.tables[i] # Suppose you want to use ith table
req_table.bbox # gives you the bounding box of the table (coordinates)
You want to visit each cell in the table and extract, say words, from them:
import pdfplumber
pdf = pdfplumber.open('file_name.pdf')
p0 = pdf.pages[req_page] # go to the required page
tables = p0.debug_tablefinder() # list of tables which pdfplumber identifies
req_table = tables.tables[i] # Suppose you want to use ith table
cells = req_table.cells # gives list of all cells in that table
for cell in cells[i:j]: # iterating through the required cells
p0.crop(cell).extract_words() # extract the words

Related

Is it possible to calculate column totals/sums within a python panel tabulator table?

I can't seem to find anywhere of anyone being able to do this.
I've search all of panel tabulator, and it seems that they reference maybe being able to do it using the tabulator api essentially, but I can't seem to figure it out.
You can use the calcs option in the Tabulator constructor by specifying which columns should have calculations applied to them and the type of calculation such as sum, average, minimum, maximum, etc.
import tabulator
# Load data into Tabulator
table = tabulator.Stream("data.csv")
# Specify calculation for a column
table.calcs("sum", columns=[{"field":"column_name", "title":"Column Total"}])
# Stream the data and display the table
table.stream().interactive()

(python) using a csv file to store details of a map

i am creating a text based game in python. in this, i will be using a CSV file to store the different tiles on the map. i would like to know what code i would need to essentially request the 'co-ordinates' of the map tile.
for example, if i was to create a tile with the co-ordinates x = 5, y = 6; it would store the information (GRASS1S2s1w, for example) in the 5th column and the sixth row.
i would also like to know how to call the specific cell in which the data is stored.
any alternate ways of doing this (not CSV) will be ignored. this is for a school project and i am too far through to change from CSV (i would have to change a lot of words in my plan.)
note: GRASS1S2I3Sc means 'grass tile' (GRASS), "stone" (1S), "scrap" (2S) and "wood" (1W)
Make a 2d list containing all the information. That way you can access a value of a specific coordinate like
list[x][y]
Then save the list with csv.writer
You can read the existing csv file as a list similarly to access the info.

Adding a border to PDF from ArcMap export using Arcpy

I'm trying to add a border to a PDF exported from ArcMap, using arcpy. I've not been able to find the answer to this anywhere, nor does arcpy seem to have any documentation on this.
Oddly enough, the map layout from which I'm exporting already has a black border around it, but when I export to a PDF, there is no border. My code here:
#Export to PDF
currentMXD_Map = (r"myMap.mxd")
mxd_Map = arcpy.mapping.MapDocument(currentMXD_Map)
df_Map = arcpy.mapping.ListDataFrames(mxd_Map,"*")[0]
arcpy.mapping.ExportToPDF(mxd_Map, r"myMap.pdf", df_Map,
df_export_width=3300,
df_export_height=2550)
mxd_Map.save()
I would think arcpy.mapping has a method to add border to a PDF export (or in the map layout). What can I try next?
arcpy is not designed for map or layout authoring. It is designed to manipulate existing layouts or maps. Here's a quote from the documentation
The arcpy.mapping module was designed so that it can be used to modify
existing elements within already existing map documents (.mxd) or
layer files (.lyr). In other words, it helps with the automation of
existing features but it can't be used to author new objects.
The easiest way to "add" a border is to have a border already in your map layout with size set to 0 or positioned off screen and then to use arcpy to make it visible or move it where you want. It seems you already have the border so maybe it's not in the right place or is set to 0 width.
Either way, you can access the border element by giving it a name in arcmap and then accessing with ListLayoutElements.
First fill in the "Element Name" in the elements properties in arcmap. notice how I've set the height and width to 0 so that it won't be visible normally.
Then access the element with ListLayoutElements
#we want the first border element because we are assuming there is only one.
#iterate or change index depending on your scenario
borderElement = arcpy.mapping.ListLayoutElements(mxd, "GRAPHIC_ELEMENT", "border_element")[0]
borderElement.elementHeight = y
borderElement.elementWidth = x

Using Python to Recognize Fixed Width Text

I am provided with a text file that contains many fixed width tables as well as accompanying text that does not belong to any table. Say my file looks something like the following:
The following is a fixed width table. This paragraph contains some description or summary of the results in the table below.
[Insert Fixed Width Table Here]
This is a second paragraph describing what is in the second fixed width table.
[Insert Second Fixed Width Table Here]
This is a third paragraph describing a third table.
[Third Table Goes Here]
...
Ideally, I want to parse this text file into something like a list of tuples, where each tuple contains the description of the table as its first element (so a string such as "This is a third paragraph describing a third table."), and a pandas data frame containing the actual table data as the second element.
Now I already know that the pandas package has a read_fwf that can intelligently parse fixed width text into a data frame. However, before I can call read_fwf, I have to separate the contents of the fixed width tables from the rest of the text first. Is there any way that I can easily use python to figure out where my fixed width tables begin and where they end?
The paragraphs of text describing the tables come in many different forms, so I can't "easily" label certain lines as being paragraph lines and not table lines based on what words they contain. Also, there are no extra line breaks in between each table and the beginning of the text describing the next table, so I can't use the existence of an empty line to figure out where a table ends either. Instead, I have to actually look at the contents of the text and see if the text is "fixed width". (I guess I could look for the existence of two or more spaces right next to each other to figure out if a line is possibly fixed width, but that seems like an imperfect solution because plain text could possibly contain two or more subsequent spaces as well).

set_style in Python xlwt library does not work for cells containing text

I'm working with an Excel spreadsheet in Python. I want to change the background color of an entire row if a condition is not respected. However, after running my code, only the background color of the empty cells of the rows (the one not containing any character) is changed. My first 9 columns contain information and my code only changes the background color from column J to Z.
from xlrd import open_workbook
from xlwt import Workbook, easyxf
Error_Style = easyxf('pattern: pattern solid, fore_colour red;',)
[...]
else:
w_sheet.row(row_index).set_style(Error_Style)
[...]
I was wondering if I am using the wrong pattern with easyxf.
Whenever you write a cell using xlwt, you write not only the value but also the style for that cell, which overrides any previous styling in that cell. You unfortunately can't just "inherit" the row styling. For now, the way to achieve what you want is to create styles that you will use when writing to the affected rows, and include those styles at the same time that you are writing the values.
For example, I have a report in which the backgrounds are supposed to alternate between white and gray. Though I'd like to just set all the even-numbered rows to gray independently of the values, instead I have to have a pair of styles, and choose the appropriate one at the appropriate time. It goes something like this:
styles = (easyxf(), easyxf('pattern: pattern solid, fore_color gray25'))
for rx, record in enumerate(records, start=1):
style = styles[rx % 2] # Do your own conditional style selection here
for cx, value in enumerate(record):
ws.write(rx, cx, value, style)
In actuality, I have more than just one pair of styles, I have several pairs. (Various different columns have different numeric formats, some percentages, some dates, etc.) So for my own case, it's even more complicated than what I've shown above. But hopefully, this illustrates what I mean by "choose the appropriate style at the appropriate time".

Categories