Using Python to Recognize Fixed Width Text - python

I am provided with a text file that contains many fixed width tables as well as accompanying text that does not belong to any table. Say my file looks something like the following:
The following is a fixed width table. This paragraph contains some description or summary of the results in the table below.
[Insert Fixed Width Table Here]
This is a second paragraph describing what is in the second fixed width table.
[Insert Second Fixed Width Table Here]
This is a third paragraph describing a third table.
[Third Table Goes Here]
...
Ideally, I want to parse this text file into something like a list of tuples, where each tuple contains the description of the table as its first element (so a string such as "This is a third paragraph describing a third table."), and a pandas data frame containing the actual table data as the second element.
Now I already know that the pandas package has a read_fwf that can intelligently parse fixed width text into a data frame. However, before I can call read_fwf, I have to separate the contents of the fixed width tables from the rest of the text first. Is there any way that I can easily use python to figure out where my fixed width tables begin and where they end?
The paragraphs of text describing the tables come in many different forms, so I can't "easily" label certain lines as being paragraph lines and not table lines based on what words they contain. Also, there are no extra line breaks in between each table and the beginning of the text describing the next table, so I can't use the existence of an empty line to figure out where a table ends either. Instead, I have to actually look at the contents of the text and see if the text is "fixed width". (I guess I could look for the existence of two or more spaces right next to each other to figure out if a line is possibly fixed width, but that seems like an imperfect solution because plain text could possibly contain two or more subsequent spaces as well).

Related

How to find table grid lines in PDF files?

To more accurately extract table-like data embedded within table cells, I would like to be able to identify table cell boundaries in PDFs like this:
I have tried extracting such tables using Camelot, pdfplumber, and PyMuPDF, with varying degrees of success. But due to the inconsistency of the PDFs we receive, I'm not able to reliably get accurate results, even when specifying the table bounds.
I find that the results are better if I extract each table cell individually, by specifying the cell boundaries explicitly. I have tested this by manually entering the boundaries, which I get using Camelot's visual debugging tool.
My challenge is how to identify table cell boundaries programmatically, since the table may start anywhere on the page, and the cells are of variable vertical height.
It seems to me that one could do this by finding the coordinates of the row separator lines, which are so obvious visually to a human. But I have not figured out how to find these lines using python tools. Is this possible, or are there other/better ways to solve this problem?
I recently had a similar use case where I needed to figure out the boundaries via code itself. For your use case, there are two options:
If you want to identify the boundary of the entire table, you can do the following:
import pdfplumber
pdf = pdfplumber.open('file_name.pdf')
p0 = pdf.pages[req_page] # go to the required page
tables = p0.debug_tablefinder() # list of tables which pdfplumber identifies
req_table = tables.tables[i] # Suppose you want to use ith table
req_table.bbox # gives you the bounding box of the table (coordinates)
You want to visit each cell in the table and extract, say words, from them:
import pdfplumber
pdf = pdfplumber.open('file_name.pdf')
p0 = pdf.pages[req_page] # go to the required page
tables = p0.debug_tablefinder() # list of tables which pdfplumber identifies
req_table = tables.tables[i] # Suppose you want to use ith table
cells = req_table.cells # gives list of all cells in that table
for cell in cells[i:j]: # iterating through the required cells
p0.crop(cell).extract_words() # extract the words

Using Python & NLP, how can I extract certain text strings & corresponding numbers preceding the strings from Excel column having a lot of free text?

I am relatively new to Python and very new to NLP (and nltk) and I have searched the net for guidance but not finding a complete solution. Unfortunately the sparse code I have been playing with is on another network, but I am including an example spreadsheet. I would like to get suggested steps in plain English (more detailed than I have below) so I could first try to script it myself in Python 3. Unless it would simply be easier for you to just help with the scripting... in which case, thank you.
Problem: A few columns of an otherwise robust spreadsheet are very unstructured with anywhere from 500-5000 English characters that tell a story. I need to essentially make it a bit more structured by pulling out the quantifiable data. I need to:
1) Search for a string in the user supplied unstructured free text column (The user inputs the column header) (I think I am doing this right)
2) Make that string a NEW column header in Excel (I think I am doing this right)
3) Grab the number before the string (This is where I am getting stuck. And as you will see in the sheet, sometimes there is no space between the number and text and of course, sometimes there are misspellings)
4) Put that number in the NEW column on the same row (Have not gotten to this step yet)
I will have to do this repeatedly for multiple keywords but I can figure that part out, I believe, with a loop or something. Thank you very much for your time and expertise...
If I'm understanding this correctly, first we need to obtain the numbers from the string of text.
cell_val = sheet1wb1.cell(row=rowNum,column=4).value
This will create a list containing every number in the string
new_ = [int(s) for s in cell_val.split() if s.isdigit()]
print(new_)
You can use the list to assign the values to the column.
Then define the value of the 1st number in the list to the 5th column
sheet1wb1.cell(row=rowNum, column=5).value = str(new_[1])
I think I have found what I am looking for. https://community.esri.com/thread/86096 has 3 or 4 scripts that seem to do the trick. Thank you..!

Offset in reading columns in a textfile with matplotlib

I have a text file containing an array of numbers from which I want to plot certain columns vs other columns. I defined a column function so I can assign a name to each column and then plot them, as in this sample code:
def column(matrix,i):
return [float(row.split()[i]) for row in matrix]
Db = file('ResolutionEffects', 'r' )
HIcontour = column(Db,1)
Db.seek(1)
However when I display a column in my terminal to check that Python is indeed reading the right one, it appears that the first value of the column (as returned in my terminal) is actually the first value of the NEXT column in the text file. All the other numbers are from the correct column. There are no blank spaces or lines in the text file. As far as I can tell this offset happens to every column after the first one.
If anyone can tell why this is happening, or find a more robust way to read columns in text files I would greatly appreciate it.
Indeed I found loadtext to be a lot more robust. After converting my text file to a data file (.dat) I simply use this:
a=np.loadtxt('ResolutionEffects.dat', usecols=(0,1,11,12))
ax1.plot(a[:,0], a[:,1], 'dk', label='HI')
ax1.plot(a[:,2], a[:,3], 'dr', label='CO')
No weird offsets or bugs anymore :) Thanks Ajean and jedwards!

How to insert a carriage return in a ReportLab paragraph?

Is there a way to insert a carriage return in a Paragraph in ReportLab? I am trying to concatenate a "\n" to my paragraph string but this isnt working.
Title = Paragraph("Title" + "\n" + "Page", myStyle)
I want to do this since I am putting names into cells and want to control how many names lie on a line in a cell (ideally 1). One cell can contain multiple names but within that cell i would like each name to be on its own line, hence the need to insert a new line.
At some point im getting a flowable to large for frame error (I think it has something to do with a table being too large OR having too many merged rows). The only way i can think to suppress this is to have only one name per line in a cell so that i can limit table size based on a count of names and segment the tables into smaller tables.
Seems like there has to be a much cleaner way of doing this. Any suggestions?
If you want to start a new paragraph (regardless of whether you are in a table or not), you can use the <br/> tag. This should work for you as well:
Title = Paragraph("Title" + "<br/>" + "Page", myStyle)
(credit: Reportlab - how to introduce line break if the paragraph is too long for a line)
A Paragraph is a Flowable in reportlab. A newline character will not work within a flowable in the way you want it to. If your Paragraph is within a table (as you suggest), you might consider creating a cell without a flowable. For example, you might do this:
data = [['Title\nPage', 'Name', 'Exists'], # note the newline character
['', 'George', 'True']]
t = Table(data, style=style_)
...
The above example will make the first data cell two rows tall (but part of the same cell).
If you really need to preserve the style of the Paragraph flowable, however, you could insert two paragraphs into the same cell:
title1 = Paragraph("Title", myStyle)
title2 = Paragraph("Page", myStyle)
cell = [title1, title2] # put this in a single cell of your table

how to group objects in reportlab, so that they stay together across new pages

I'm generating some pdf files using reportlab. I have a certain section that is repeated. It contains of a header and a table:
Story.append(Paragraph(header_string, styleH))
Story.append(table)
How can I group the paragraph with the table (in latex I would put them into the same environment) so that in case of a page brake, the paragraph and table stay together? Currently the paragraph sometimes floats at the end of one page and the table starts on top of the next page.
You can try to put them together in a KeepTogether flowable, like so:
Story.append(KeepTogether([Paragraph(header_string, styleH), table])
However be aware that, last I checked, the implementation was not perfect and would still split up items too frequently. I know it does a good job of keeping a single flowable together that would otherwise split, like if you were to say:
Story.append(KeepTogether(Paragraph(header_string, styleH))
then that paragraph would not get split unless it was impossible for it not to be.
If KeepTogether doesn't work for you, I'd suggest creating a custom Flowable with your paragraph and table inside it and then during layout make sure your custom Flowable subclass does not allow itself to be split up.
this is the solution that I found going through the reportlab source code:
paragraph = Paragraph(header_string, styleH)
paragraph.keepWithNext = True
Story.append(paragraph)
Story.append(table)
Using a ParagraphStyle might actually be better so i figured i'd add it to this super old answer.
Found this in their changelog after seeing #memyself's answer.
* `KeepWithNext` improved:
Paragraph styles have long had an attribute keepWithNext, but this was
buggy when set to True. We believe this is fixed now. keepWithNext is important
for widows and orphans control; you typically set it to True on headings, to
ensure at least one paragraph appears after the heading and that you don't get
headings alone at the bottom of a column.
header = ParagraphStyle(name='Heading1', parent=normal, fontSize=14, leading=19,
spaceAfter=6, keepWithNext=1)

Categories