tabula extract table from pdf remove line break

tabula extract table from pdf remove line break - python

I have a table with wrapped text in a pdf file
I used tabula to extract table from the pdf file
file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1,lattice=True)
table[0]
However, the end result looking like this:
is there a way to interpret line break or wrapped text for table in pdf as its own row? not extra rows?
End result should be looking like this using tabula:

You need to add a parameter. Replace
file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1)
table[0]
with
file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1, lattice = True)
table[0]
All this according to the documention here
Here is an example:
Se the article "https://effectivehealthcare.ahrq.gov/sites/default/files/pdf/methods-guidance-tests-bias_methods.pdf"
import tabula
import io
import pandas as pd
file1 = r"C:\Users\s-degossondevarennes\.......\Desktop\methods-guidance-tests-bias_methods.pdf"
table = tabula.read_pdf(file1,pages=3,lattice=True, )
df = table[0]
df = df.drop(['Unnamed: 1','Unnamed: 2','Description','Unnamed: 3'],axis=1)
df
returns:
Unnamed: 0 \
0 NaN
1 Spectrum effect
2 Context bias
3 Selection bias
4 NaN
5 Variation in test execution
6 Variation in test technology
7 Treatment paradox
8 Disease progression bias
9 NaN
10 Inappropriate reference\rstandard
11 Differential verification bias
12 Partial verification bias
13 NaN
14 Review bias
15 Clinical review bias
16 Incorporation bias
17 Observer variability
18 NaN
19 Handling of indeterminate\rresults
20 Arbitrary choice of threshold\rvalue
Source of Systematic Bias
0 Population
1 Tests may perform differently in various sampl...
2 Prevalence of the target condition varies acco...
3 The selection process determines the compositi...
4 Test Protocol: Materials and Methods
5 A sufficient description of the execution of i...
6 When the characteristics of a medical test cha...
7 Occurs when treatment is started on the basis ...
8 Occurs when the index test is performed an unu...
9 Reference Standard and Verification Procedure
10 Errors of imperfect reference standard bias th...
11 Part of the index test results is verified by ...
12 Only a selected sample of patients who underwe...
13 Interpretation
14 Interpretation of the index test or reference ...
15 Availability of clinical data such as age, sex...
16 The result of the index test is used to establ...
17 The reproducibility of test results is one det...
18 Analysis
19 A medical test can produce an uninterpretable ...
20 The selection of the threshold value for the i...
The three dots in the column Source of Systematic Bias show that everything that was in that cell, with line breaks i considered as a single cell (item), not multiple cells. Another proof of that is
df.iloc[2,1]
returns the cell content:
'Prevalence of the target condition varies according to setting and may affect\restimates of test performance. Interpreters may consider test results to be\rpositive more frequently in settings with higher disease prevalence, which may\ralso affect estimates of test performance.'
There must be something with your pdf. If it's available online, share the link and I'll take a look.

Related

Pandas: Article category prediction performance slow

I'm working on a naive multinomial bayes classifier for articles in Pandas and have run into a bit of an issue with performance. My repo is here if you want the full code and the dataset I'm using: https://github.com/kingcodefish/multinomial-bayesian-classification/blob/master/main.ipynb
Here's my current setup with two dataframes: df for the articles with lists of tokenized words and word_freq to store precomputed frequency and P(word | category) values.
for category in df['category'].unique():
category_filter = word_freq.loc[word_freq['category'] == category]
cat_articles = df.loc[df['category'] == category].shape[0] # The number of categorized articles
p_cat = cat_articles / df.shape[0] # P(Cat) = # of articles per category / # of articles
df[category] = df['content'].apply(lambda x: category_filter[category_filter['word'].isin(x)]['p_given_cat'].prod()) * p_cat
Example data:
df
category content
0 QUEER VOICES [online, dating, thoughts, first, date, grew, ...
1 COLLEGE [wishes, class, believe, generation, better, j...
2 RELIGION [six, inspiring, architectural, projects, revi...
3 WELLNESS [ultramarathon, runner, micah, true, died, hea...
4 ENTERTAINMENT [miley, cyrus, ball, debuts, album, art, cyrus...
word_freq
category word freq p_given_cat
46883 MEDIA seat 1.0 0.333333
14187 CRIME ends 1.0 0.333333
81317 WORLD NEWS seat 1.0 0.333333
12463 COMEDY living 1.0 0.200000
20868 EDUCATION director 1.0 0.500000
Please note that the word_freq table is a cross product of the categories x words, so every word appears once and only once in each category, so the table does contain duplicates. Also, the freq column has been increased by 1 to avoid zero values (Laplace smoothed).
After running the above, I do this to find the max category P (each category's P is stored in a column after its name) and get the following:
df['predicted_category'] = df[df.columns.difference(['category', 'content'])].idxmax(axis=1)
df = df.drop(df.columns.difference(['category', 'content', 'predicted_category']), axis=1).reset_index(drop = True)
category content \
0 POLITICS [bernie, sanders, campaign, split, whether, fi...
1 COMEDY [bill, maher, compares, police, unions, cathol...
2 WELLNESS [busiest, people, earth, find, time, relax, th...
3 ENTERTAINMENT [lamar, odom, gets, standing, ovation, first, ...
4 GREEN [lead, longer, life, go, gut]
predicted_category
0 ARTS
1 ARTS
2 ARTS
3 TASTE
4 GREEN
This method seems to work well, but it is unfortunately really slow. I am using a large dataset of 200,000 articles with short descriptions and operating on only 1% of this is taking almost a minute. I know it's because I am looping through the categories instead of relying on vectorization, but I am very very new to Pandas and trying to formulate this in a groupby succinctly escapes me (especially with the two data tables, also might be unnecessary), so I'm looking for suggestions here.
Thanks!

Just in case someone happens to come across this later...
Instead of representing my categories x words as a cross product of every possible word of every category, which inflated to over 3 million rows in my data set, I decided to reduce them to only the necessary ones per category and provide a default value for ones that did not exist, which ended up being about 600k rows.
But the biggest speedup came from changing to the following:
for category in df['category'].unique():
# Calculate P(Category)
category_filter = word_freq.loc[word_freq['category'] == category]
cat_articles = df.loc[df['category'] == category].shape[0]
p_cat = cat_articles / df.shape[0]
# Create a word->P(word | category) dictionary for quick lookups
category_dict = category_filter.set_index('word').to_dict()['p_given_cat']
# For every article, find the product of P(word | category) values of the words, then multiply by P(category) to get bayes.
df[category] = df['content'].apply(lambda x: np.prod([category_dict.get(y, 0.001 / (cat_articles + 0.001)) for y in x])) * p_cat
I created a dictionary from the two columns word and the P(word | category) as the key-value respectively. This reduced the problem to a quick dictionary lookup for each element of each list and computing that product.
This ended up being about 100x faster, parsing the whole dataset in ~40 seconds.

pandas - series.str.extract is dropping the first character of the capture group

I need to extract the dates from the next Series:
0 03/25/93 Total time of visit (in minutes):\n
1 6/18/85 Primary Care Doctor:\n
2 sshe plans to move as of 7/8/71 In-Home Servic...
3 7 on 9/27/75 Audit C Score Current:\n
4 2/6/96 sleep studyPain Treatment Pain Level (N...
5 .Per 7/06/79 Movement D/O note:\n
6 4, 5/18/78 Patient's thoughts about current su...
7 10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8 3/7/86 SOS-10 Total Score:\n
9 (4/10/71)Score-1Audit C Score Current:\n
10 (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC...
11 4/09/75 SOS-10 Total Score:\n
12 8/01/98 Communication with referring physician...
13 1/26/72 Communication with referring physician...
14 5/24/1990 CPT Code: 90792: With medical servic...
15 1/25/2011 CPT Code: 90792: With medical servic...
16 4/12/82 Total time of visit (in minutes):\n
17 1; 10/13/1976 Audit C Score, Highest/Date:\n
I am trying it so with the following regex
df.str.extract('.(\d{1,4}/\d{1,4}/\d{1,4}).')
But why is it dropping the first number in the first couple of numbers since I am specifying {1,4}?
For example, from row 7 an 17, it should extract '10/24/89' and '10/13/1976' respectively, instead of '0/24/89' and '0/13/1976'
I am also trying with adding a '?:' at the beginning of the capture group but it does not work
Thanks beforehand!

I would put word boundaries around the dates, i.e. use this pattern:
\b(\d+/\d+/\d+)\b
Update code:
df['date'] = df['col'].str.extract('\b(\d+/\d+/\d+)\b')
Here is a regex demo showing that the above pattern be working correctly.

Reading csv in loop stops at row that does not match

I am trying to read a csv then iterate through an sde to find matching features, their fields, and then print them.
There is a table in the list and I'm not able to skip over it and continue reading the csv.
I get the "IOError: table 1 does not exist" and I only get the features that come before the table.
import arcpy
from arcpy import env
import sys
import os
import csv
with open('C:/Users/user/Desktop/features_to_look_for.csv', 'r') as t1:
objectsinESRI = [r[0] for r in csv.reader(t1)]
env.workspace = "//conn/features#dev.sde"
fcs = arcpy.ListFeatureClasses('sometext.*')
for fcs in objectsinESRI:
fieldList = arcpy.ListFields(fcs)
for field in fieldList:
print fcs + " " + ("{0}".format(field.name))
Sample csv rows (can't seem to post a screenshot of the excel file)
feature 1
feature 2
feature 3
feature 4
table 1
feature 5
feature 6
feature 7
feature 8
feature 9
Result
feature 1
feature 2
feature 3
feature 4
Desired Result
feature 1
feature 2
feature 3
feature 4
feature 5
feature 6
feature 7
feature 8
feature 9

So as stated, I have no clue about arcpy but this seems the way so start. Looking at the docs, your objectsInEsri seems to be the equivalent of the datasets in the example. From there I extrapolate the following code which, depending on what print(fc) is printing, you may need to extend with yet another for.
So try this:
for object in objectsInEsri:
for fc in fcs:
print(fc)
Or maybe this:
for object in objectsInEsri:
for fc in fcs:
for field in arcpy.ListFields(fc)
print(object + " " + ("{0}".format(field.name)))
Then I may be completely wrong ofc but then just write first the outermore for, see what is giving to you, and keep building from there :)

Python3, Pandas - New Column Value based on Column To Left Data (Dynamic)

I have a spreadsheet with several columns containing survey responses. This spreadsheet will be merged into others and I will then have duplicate rows similar to the ones below. I will then need to take all questions with the same text and calculate the percentages of the answers based on the entirety of the merged document.
Example Excel Data
**Poll Question** **Poll Responses**
The content was clear and effectively delivered  37 Total Votes
Strongly Agree 24.30%
Agree 70.30%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
The Instructor(s) were engaging and motivating  37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
I would attend another training session delivered by this Instructor(s) 37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 5.40%
Disagree 0.00%
Strongly Disagree 0.00%
This was a good format for my training  37 Total Votes
Strongly Agree 24.30%
Agree 62.20%
Neutral 8.10%
Disagree 2.70%
Strongly Disagree 2.70%
Any comments/suggestions about this training course?  5 Total Votes
My method for calculating a non-percent number of votes will be to convert the percentages to a number. E.G. find and extract 37 from 37 Total Votes, then use the following formula to get the number of users that voted on that particular answer: percent * total / 100.
So 24.30 * 37 / 100 = 8.99 rounded up means 9 out of 37 people voted for "Strongly Agree".
Here's an example spreadsheet of what I'd like to be able to do:
**Poll Question** **Poll Responses** **non-percent** **subtotal**
... 37 Total Votes 0 37
... 24.30% 9 37
... 70.30% 26 37
... 2.70% 1 37
... 2.70% 1 37
... 0.00% 0 37
(note: non-percent and subtotal would be newly created columns)
Currently I take a folder full of .xls files and I loop through that folder, saving them to another in an .xlsx format. Inside that loop, I've added a comment block that contains my # NEW test CODE where I'm trying to put the logic to do this.
As you can see, I'm trying to target the cell and get the value, then get some regex and extract the number from it, (then add it to the subtotal column in that row. I then want to add it till I see a new instance of a row containing x Total Votes.
Here's my current code:
import numpy as np
import pandas as pd
files = get_files('/excels/', '.xls')
df_array = []
for i, f in enumerate(files, start=1):
sheet = pd.read_html(f, attrs={'class' : 'reportData'}, flavor='bs4')
event_id = get_event_id(pd.read_html(f, attrs={'id' : 'eventSummary'}))
event_title= get_event_title(pd.read_html(f, attrs={'id' : 'eventSummary'}))
filename = event_id + '.xlsx'
rel_path = 'xlsx/' + filename
writer = pd.ExcelWriter(rel_path)
for df in sheet:
# NEW test CODE
q_total = 0
df.columns = df.columns.str.strip()
if df[df['Poll Responses'].str.contains("Total Votes")]:
# if df['Poll Responses'].str.contains("Total Votes"):
q_total = re.findall(r'.+?(?=\sTotal\sVotes)', df['Poll Responses'].str.contains("Total Votes"))[0]
print(q_total)
# df['Question Total'] = np.where(df['Poll Responses'].str.contains("Total Votes"), 'yes', 'no')
# END NEW test Code
df.insert(0, 'Event ID', event_id)
df.insert(1, 'Event Title', event_title)
df.to_excel(writer,'sheet')
writer.save()
# progress of entire list
if i <= len(files):
print('\r{:*^10}{:.0f}%'.format('Converting: ', i/len(files)*100), end='')
print('\n')
TL;DR
This seems very convoluted, but if I can get the two new columns that contain the total votes for a question and the number (not percentage) of votes for an answer, then I can do some VLOOKUP magic for this on the merged document. Any help or methodology suggestions would be greatly appreciated. Thanks!

I solved this, I'll post the pseudo code below:
I loop through each sheet. Inside that loop, I loop through each row using for n, row in enumerate(df.itertuples(), 1):.
I get the value of the field that might contain "Poll Response" poll_response = str(row[3])
Using an if / else I check if the poll_response contains the text "Total Votes". If it does, it must be a question, otherwise it must be a row with an answer.
In the if for the question I get the cells that contain the data I need. I then have a function that compares the question text with all objects question text in the array. If it's a match, then I simply update the fields of the object, otherwise I create a new question object.
else the row is an answer row, and I use the question text to find the object in the array and update/add the answers or data.
This process loops through all the rows in each spreadsheet, and now I have my array full of unique question objects.

Selecting elements of a pandas dataframe that fall above a critical threshold

I have a pandas.df and I'm trying to remove all hypotheses that can be rejected.
Here is a snippet of the df in question:
best value p_value
0 11.9549 0.986927
1 11.9588 0.986896
2 12.1185 0.985588
3 12.1682 0.985161
4 12.3907 0.983131
5 12.4148 0.982899
6 12.6273 0.980750
7 12.9020 0.977680
8 13.4576 0.970384
9 13.5058 0.969679
10 13.5243 0.969405
11 13.5886 0.968439
12 13.8025 0.965067
13 13.9840 0.962011
14 14.1896 0.958326
15 14.3939 0.954424
16 14.6229 0.949758
17 14.6689 0.948783
18 14.9464 0.942626
19 15.1216 0.938494
20 15.5326 0.928039
21 17.7720 0.851915
22 17.8668 0.847993
23 17.9662 0.843822
24 19.2481 0.785072
25 19.5257 0.771242
I want to remove the elements with a p_value greater then a critical threshold alpha by selecting the ones fall below alpha. The p value is calculated using scipy.stats.chisqprob(chisq,df) where chisq is the chi squared statistic and df is the degrees of freedom. This is all done using the custom method self.get_p_values shown below.
def reject_null_hypothesis(self,alpha,df):
assert alpha>0
assert alpha<1
p_value=self.get_p_values(df) #calculates the data frame above
return p_value.loc[p_value['best value']
Im then calling this method using:
PE=Modelling_Tools.PE_Results(PE_file) #Modelling.Tools is the module and PE_Results is the class which is given the data 'PE_file'
print PE.reject_null_hypothesis(0.5,25)
From what I've read this should do what I want but I'm new to pandas.df and this code returns the unchanged

Are you getting any errors when you run this? I ask because:
print PE.reject_null_hypothesis(0.5, 25)
is passing into reject_null_hypothesis() 25, an int object instead of a pandas.DataFrame object, in the last argument position.
(Apologies. I would respond with this as a comment instead of an answer, but I only have 46 reputation at the moment, and 50 is needed to comment.)

refer indexging with boolean array
df[ df.p_value < threshold ]

Turns out there is a simple way to do what I want. Here is the code for those who want to know.
def reject_null_hypothesis(self,alpha,df):
'''
alpha = critical threshold for chisq statistic
df=degrees of freedom
values below this critical threshold are rejected.
values above this threshold are not 'proven' but
cannot be rejected and must therefore be subject to
further statistics
'''
assert alpha>0
assert alpha<1
p_value=self.get_p_values(df)
passed= p_value[p_value.loc[:,'p_value']>alpha].index
return p_value[:max(passed)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

tabula extract table from pdf remove line break - python

Related

Pandas: Article category prediction performance slow

pandas - series.str.extract is dropping the first character of the capture group

Reading csv in loop stops at row that does not match

Python3, Pandas - New Column Value based on Column To Left Data (Dynamic)

Selecting elements of a pandas dataframe that fall above a critical threshold

Categories

Resources