Extracting tables from PDF using tabula-py fails to properly detect rows

Extracting tables from PDF using tabula-py fails to properly detect rows - python

Problem
I want to extract a 70-page vocabulary table from a PDF and turn it into a CSV to use in [any vocabulary learning app].
Tabula-py and its read_pdf function is a popular solution to extract the tables, and it did detect the columns ideally without any fine-tuning. But, it only detected the columns well and had difficulties with the multi-line rows, splitting each line into a different row.
E.g., in the PDF you will have columns 2 and 3. The table on Stackoverflow doesn't seem to allow multi-line content either, so I added row numbers. Just merge the row 1 in your head.
Row number
German
Latin
1
First word
Translation for first word
1
with many lines of content
[phonetic vocabulary thingy]
1
and more lines
2
Second word
Translation for second word
Instead of fine-tuning the read_pdf parameters, are there ways around that?

You may want to use PyMuPDF. As your table cells are wrapped by lines, this is a relatively easy case.
I have published a script to answer a similar question here.

Possible solution
Instead of experimenting with tabula-py, which is perfectly legit of course, you can export a pdf in Adobe Reader using File->Export a PDF->HTML Web Page
You then read it using
import pandas as pd
dfs = pd.read_html("file.html", header=0,encoding='utf-8')
to get a list of pandas dataframes. You could also use BeautifulSoup4 or similar solutions to extract the tables.
To match tables with the same column names (e.g., in a vocabulary table) and save them as csv, you can do this:
from collections import defaultdict
unique_columns_to_dataframes = defaultdict(list)
# We need to get a hashable key for the dictionary, so we join the df.columns.values. Strings can be hashed.
possible_column_variations = [("%%".join(list(df.columns.values)), i) for i, df in enumerate(dfs)]
for k, v in possible_column_variations:
unique_columns_to_dataframes[k].append(v)
for k, v in unique_columns_to_dataframes.items():
new_df = pd.concat([dfs[i] for i in v])
new_df.reset_index(drop=True,inplace=True)
# Save file with a unique name. Unique name is a hash out from the characters in the column_names, not collision-free but unlikely to collide for small number of tables
new_df.to_csv("Df_"+str(sum([ord(c) for c in k]))+".csv", index=False, sep=";", encoding='utf-8')

Related

Extraction of complete rows from CSV using list , we dont know row indices

Can somebody help me in solving the below problem
I have a CSV, which is relatively large with over 1 million rows X 4000 columns. Case ID is one of the first column header in csv. Now I need to extract the complete rows belonging to the few case Ids, which are documented in list as faulty IDs.
Note: I dont know the indices of the required case IDs
Example > the CSV is - production_data.csv and the faulty IDs, faulty_Id= [ 50055, 72525, 82998, 1555558]
Now, we need to extract the complete rows for faulty_Id= [ 50055, 72525, 82998, 1555558]
Best Regards

If your faculty_id is present as header in csv file, you can use pandas dataframe to read_csv and set index as faculty_id and extract rows based on the faculty_id. For more info attach sample data of csv

CSV, which is relatively large with over 1 million rows X 4000 columns
As CSV are just text files and it is probably to big to be feasible to load it as whole I suggest using fileinput built-in module, if ID is 1st column, then create extractfaults.py as follows:
import fileinput
faulty = ["50055", "72525", "82998", "1555558"]
for line in fileinput.input():
if fileinput.lineno() == 0:
print(line, end='')
elif line.split(",", 1)[0] in faulty:
print(line, end='')
and use it following way
python extractfaults.py data.csv > faultdata.csv
Explanation: keep lines which are either 1st line (header) or have one of provided ID (I used optional 2nd .split argument to limit number of splits to 1). Note usage of end='' as fileinput keeps original newlines. My solution assumes that IDs are not quoted and ID is first column, if any of these does not hold true, feel free to adjust my code to your purposes.

The best way for you is to use a database like Postgres or MySQL. You can copy your data to the database first and then easily operate rows and columns. The file in your case is not the best solution since you need to upload all the data from the file to the memory to be able to process it. And file opening takes a lot of time in addition.

How to make read_csv more flexibile with numbers and whitespaces

I want to read a txt.file with Pandas and the Problem is the seperator/delimiter consits of a number and Minimum two blanks afterwards.
I already tried it similiar to this code (How to make separator in pandas read_csv more flexible wrt whitespace?):
pd.read_csv("whitespace.txt", header=None, delimiter=r"\s+")
This is only working if there is only a blank or more. So I adjustet it to the following code.
delimiter=r"\d\s\s+"
But this is seperating my dataframe when it sees two blanks or more, but i strictly Need the number before it followed by at least two blanks, anyone has an idea how to fix it?
My data Looks as follows:
I am an example of a dataframe
I have Problems to get read
100,00
So How can I read it
20,00
so the first row should be:
I am an example of a dataframe I have Problems to get read 100,00
followed by the second row:
So HOw can I read it 20,00

Id try it like this.
Id manipulate the text file before I attempt to parse it to a dataframe as follows:
import pandas as pd
import re
f = open("whitespace.txt", "r")
g = f.read().replace("\n", " ")
prepared_text = re.sub(r'(\d+,\d+)', r'\1#', g)
df = pd.DataFrame({'My columns':prepared_text.split('#')})
print(df)
This gives the following:
My columns
0 I am an example of a dataframe I have Problems...
1 So How can I read it 20,00
2
I guess this'd suffice as long as the input file wasnt too large but using the re module and substitiution gives you the control you seek.
The (\d+,\d+) parentheses mark a group which we want to match. We're basically matching any of your numbers in your text file.
Then we use the \1 which is called a backreference to the matched group which is referred to when specifying a replacement. So \d+,\d+ is replaced by \d+,\d+#.
Then we use the inserted character as a delimiter.
There are some good examples here:
https://lzone.de/examples/Python%20re.sub

Finding most and least frequent rows of a text file in python

I have a text file with only one column containing text content. I want to find out the top 3 most frequent items and the 3 least frequent items. I have tried some of the solutions in other posts, but I am not able to get what I want. I tried finding the mode as shown below, but it just outputs all the rows. I also tried using the counter and the most common functions, but they do the same thing i.e. print all the rows in the file. Any help is appreciated.
# My Code
import pandas as pd
df = pd.read_csv('sample.txt')
print(df.mode())

You can use Python's built-in counter.
from collections import Counter
# Read file directly into a Counter
with open('file') as f:
cnts = Counter(l.strip() for l in f)
# Display 3 most common lines
cnts.most_common(3)
# Display 3 least common lines
cnts.most_common()[-3:]

How to merge columns with no header name in a python script?

My Python script parsed some text of a Excel file. It strips white-space from an Excel file and changes the delimiters
(from " : "--> " , ")
and my script outputs to a CSV file. Much of the data looks like this
(what data looks like in Excel)
Separated by a single column due to there being a extra comma or two.
CSV == Comma separated values.
I have tried using if statements to add or subtract commas to try shore it up but it ends up completely messing up the relative order it was first in. Driving me nuts!
To try do it another way installed the pandas library (a data manipulating library) using pip.
Is it possible to merge columns that have no column headers inside a single Data Frame? There's plenty of advice regarding separate DataFrames but much for one single one.
Furthermore how can I merge the columns while retaining the row position. The emails are in the correct row position but not the column position.
Or am I on the wrong track completely, is pandas overkill for a simple parsing script? I've been learning python as I go along to try complete the script so I might have missed a simple way of doing it.
Some sample data:
C5XXEmployeeNumXX,C5XXEmployeeNumXX,JohnSmith,1,,John,,Smith,,IT Supp.Centre,EU,,London1,,,59XXXX,ITServiceDesk,LOND01,,,,Notmaintained,,,,,,,,john.smith#company.com,
Snippet of parsing logic
for line in f:
#finds the identifier for users
if ':LON ' in line:
#parsing logic.
#Delimiters are swapped. Whitespace is scrubbed
line = line.replace(':', ',')
line = line.replace(' ', '')

You can user a separator/delimiter of your choice. Check out: https://docs.python.org/2/library/csv.html#csv.Dialect.delimiter.
Also, regarding the order, if you are reading in a list it should be fine but if you are reading the contents of a row in a dict then it is normal that the order is not preserved.

CSV to Python Dictionary with all column names?

I'm still pretty new to using python to program from scratch so as an exercise I though I'd take a file that I process using SQL an try to duplicate the functionality using Python. It seems that I want to take my (compressed, zip) csv file and create a Dict of it (or maybe a dict of dicts?). When I use dict reader I get the 1st row as a key rather than each column as its own key? E.g.
import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file = zipfile.ZipFile(sys.argv[0])
items_file = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file,dialect='excel'):
pass
Yields:
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])
key=MLS_ACCT PARCEL_ID AREA COUNTY STREET_NUM STREET_NAME CITY ZIP STATUS PROP_TYPE LIST_PRICE LIST_DATE DOM DATE_MODIFIED BATHS_HALF BATHS_FULL BEDROOMS ACREAGE YEAR_BUILT YEAR_BUILT_DESC OWNER_NAME SOLD_DATE WITHDRAWN_DATE STATUS_DATE SUBDIVISION PENDING_DATE SOLD_PRICE,
value=492859 28-15-3-009-001.0000 200 JEFF 3828 ORLEANS RD MOUNTAIN BROOK 35243 A SFR 324900 3/3/2011 2 3/4/2011 12:04:11 AM 0 2 3 0 1968 EXIST SPARKS 3/3/2011 11:54:56 PM KNOLLWOOD
So what I'm looking for is a column for MLS_ACCT and a separate one for PARCEL_ID etc so I can then do things like average prices by all items that contain KNOLLWOOD in the SUBDIVISION field With a further sub section by date range, date sold etc.
I know well how to do it with SQL but As I said I'm tying to gain some Python skills here.
I have been reading for the last few days but have yet to find any very simple illustrations on this sort of use case. Pointers to said docs would be appreciated. I realize I could use memory resident SQL-lite but again my desire is to get the Python approach learned.I've read some on Numpy and Scipy and have sage loaded but still can't find some useful illustrations since those tools seem focussed on arrays with only numbers as elements and I have a lot of string matching I need to do as well as date range calculations and comparisons.
Eventually I'll need to substitute values in the table (since I have dirty data), I do this now by having a "translate table" which contains all dirty variants and provides a "clean" answer for final use.

Are you sure that this is a file with comma-separated values? It seems like the lines are being delimited by tabs.
If this is correct, specify a tab delimiter in the DictReader constructor.
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
for key in row:
print 'key=%s, value=%s' % (key, row[key])
Source: http://docs.python.org/library/csv.html

Writing the operation in pure Python is certainly possible, but you'll have to choose your algorithms then. The row output you've posted above looks a whole lot like the parsing has gone wrong; in fact, it seems not to be a CSV at all, is it a TSV? Try passing delimiter='\t' or dialect=csv.excel_tab to DictReader.
Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.
An example of getting a column-oriented data structure (however, involving loading the whole file):
import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary. The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:
matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate functions like datetime.datetime.strptime.

At first glance it seems like your input might not actually be CSV, but maybe is tab delimited instead. Check out the docs at python.org, you can create a Dialect and use that to change the delimiter.
import csv
csv.register_dialect('exceltab', delimiter='\t')
for row in csv.DictReader(items_file,dialect='exceltab'):
pass

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.