I am new to python. I am trying to extract mixed fractions from pdf file using Python. But I have no idea which tool I should use to extract. My sample pdf contains only one page with simple text. I would like to extract Part name and length of part using Python. Screenshot of sample pdf page is as shown in image link Page 1 of Pdf- Screenshot. Pdf file can be downloaded from the following link (Sample Pdf)
EDIT 1: - UPDATED
Thank you for suggesting Pdfplumber. It is a great tool. I could extract information with it. Though in some cases, when I extract length, I get the whole number combined with denominator. Say, if I have 36 1/2 as length (as shown in screenshot), then I get the value as 362 inches.
import pdfplumber
with pdfplumber.open("Sample.pdf") as pdf:
first_page = pdf.pages[0]
text = first_page.extract_text()
for row in text.split('\n'):
if 'inches' in row:
num = row.split()[0]
print(num)
Output: 362
This code works for me in most cases. Just in some cases, I get 362 as my output, instead of getting 36 as a separate value. How could I resolve this issue?
pdfplumber gives output like that
shape: square
part name: square
1
36 𝑖𝑛𝑐ℎ𝑒𝑠
2
I would suggest to use PDF Pluber, it's a very powerful and well documented tool for extracting text, table, images from PDFs.
Moreover, it has a very convenient function, called crop, that allows you to crop and extract just the portion of the page that you need.
Just as an example, the code would be something like this (note that this will work with any number of pages):
filename = 'path/to/your/PDF'
crop_coords = [x0, top, x1, bottom]
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
for i, page in enumerate(pdf.pages):
my_width = page.width
my_height = page.height
# Crop pages
my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
page_crop = page.crop(bbox=my_bbox)
text = text+str(page_crop.extract_text()).lower()
pages.append(page_crop)
Here is the explanation of coords:
x0 = % Distance from left vertical cut to left side of page.
top = % Distance from upper horizontal cut to upper side of page.
x1 = % Distance from right vertical cut to right side of page.
bottom = % Distance from lower horizontal cut to lower side of page.
Related
We have paper invoices coming in, which are in paper format. We take images of these invoices, and wish to extract the information contained within the cells of the tabular region(s), and export them as CSV or similar.
The tables include multiple columns, and the cells contain numbers and words.
I have been searching around for ML-based Python procedures to have this performed, expecting this to be a relatively straightforward task (or maybe I'm mistaken), yet not much luck in coming across a procedure.
I can detect the horizontal and vertical lines, and combine them to locate the cells. But retrieving the information contained within the cells seems to be problematic.
Could I please get help?
I followed one procedure from this reference, yet came across an error with "bitnot":
import pytesseract
extract=[]
for i in range(len(order)):
for j in range(len(order[i])):
inside=''
if(len(order[i][j])==0):
extract.append(' ')
else:
for k in range(len(order[i][j])):
side1,side2,width,height = order[i][j][k][0],order[i][j][k][1], order[i][j][k][2],order[i][j][k][3]
final_extract = bitnot[side2:side2+h, side1:side1+width]
final_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 1))
get_border = cv2.copyMakeBorder(final_extract,2,2,2,2, cv2.BORDER_CONSTANT,value=[255,255])
resize = cv2.resize(get_border, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
dil = cv2.dilate(resize, final_kernel,iterations=1)
ero = cv2.erode(dil, final_kernel,iterations=2)
ocr = pytesseract.image_to_string(ero)
if(len(ocr)==0):
ocr = pytesseract.image_to_string(ero, config='--psm 3')
inside = inside +" "+ ocr
extract.append(inside)
a = np.array(extract)
dataset = pd.DataFrame(a.reshape(len(hor), total))
dataset.to_excel("output1.xlsx")
The error I get is this:
final_extract = bitnot[side2:side2+h, side1:side1+width]
NameError: name 'bitnot' is not defined`
I'm trying to extract information from a PDF using the package PDFQuery. The information is not in the same location every time so I need to have a query tag. First, I wrote the function:
def clean_text_data(text):
return text.split(':')[1]
I then wrote a function to extract the text:
Date = clean_text_data(pdf.pq('LTTextLineHorizontal:contains("Date")').text())
The problem, however, is that (for some reason) almost all of the data is on the next 'LTTextHorizontal'.
The XML looks like this:
<LTTextLineHorizontal bbox="[58.501, 377.094, 78.501, 385.094]" height="8.0" width="20.0" word_margin="0.1" x0="58.501" x1="78.501" y0="377.094" y1="385.094"><LTTextBoxHorizontal bbox="[58.501, 377.094, 78.501, 385.094]" height="8.0" index="39" width="20.0" x0="58.501" x1="78.501" y0="377.094" y1="385.094">Date: </LTTextBoxHorizontal></LTTextLineHorizontal>
<LTTextLineHorizontal bbox="[107.249, 377.334, 147.281, 385.334]" height="8.0" width="40.032" word_margin="0.1" x0="107.249" x1="147.281" y0="377.334" y1="385.334"><LTTextBoxHorizontal bbox="[107.249, 377.334, 147.281, 385.334]" height="8.0" index="40" width="40.032" x0="107.249" x1="147.281" y0="377.334" y1="385.334">02/26/2020 </LTTextBoxHorizontal></LTTextLineHorizontal>
Here the Date is 02/26/2020, but it is in the box immediately following. How do I create a function to extract the following box?
You do something like this:
label = pdf.pq('LTTextLineHorizontal:contains("Date")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
In this first part, I'm finding the area of the PDF that contains "Date" and extracting the source coordinates of it's bounding box, so x0:y0 corresponds to the lower-left corner of wherever "Date" is written
name = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (
left_corner, bottom_corner - 12, left_corner + 350, bottom_corner)).text()
Afterward, I offset those coordinates to create a new bbox that has the information I'm actualy looking for, and I get it's .text().
The coordinates are offset in points, which you can measure with Acrobat's ruler.
Source is here: https://pypi.org/project/pdfquery/#quick-start
The quickstart guide has a really good example.
I am trying to make a program that will scrape the text off of a screenshot using tesseract and python, and am having no issue getting one piece of it, however some text is lighter colored and is not being picked up by tesseract. Below is an example of a picture I am using:
I am am to get the text at the top of the picture, but not the 3 options below.
Here is the code I am using for grabbing the text
result = pytesseract.image_to_string(
screen, config="load_system_dawg=0 load_freq_dawg=0")
print("below is the total value scraped by the tesseract")
print(result)
# Split up newlines until we have our question and answers
parts = result.split("\n\n")
question = parts.pop(0).replace("\n", " ")
q_terms = question.split(" ")
q_terms = list(filter(lambda t: t not in stop, q_terms))
q_terms = set(q_terms)
parts = "\n".join(parts)
parts = parts.split("\n")
answers = list(filter(lambda p: len(p) > 0, parts))
I when I have plain text in black without a colored background I can get the answers array to be populated by the 3 below options, however not in this case. Is there any way I can go about fixing this?
You're missing binarization, or thresholding step.
In your case you can simply apply binary threshold on grayscale image.
Here is result image with threshold = 177
Here1 you can learn more about Thresholding with opencv python library
I would like to test the accuracy of a Highcharts graph presenting data from a JSON file (which I already read) using Python and Selenium Webdriver.
How can I read the Highchart data from the website?
thank you,
Evgeny
The highchart data is converted to an SVG path, so you'd have to interpret the path yourself. I'm not sure why you would want to do this, actually: in general you can trust 3rd party libraries to work as advertised; the testing of that code should reside in that library.
If you still want to do it, then you'd have to dive into Javascript to retrieve the data. Taking the Highcharts Demo as an example, you can extract the data points for the first line as shown below. This will give you the SVG path definition as a string, which you can then parse to determine the origin and the data points. Comparing this to the size of the vertical axis should allow you to calculate the value implied by the graph.
# Get the origin and datapoints of the first line
s = selenium.get_eval("window.jQuery('svg g.highcharts-tracker path:eq(0)')")
splitted = re.split('\s+L\s+', s)
origin = splitted[0].split(' ')[1:]
data = [p.split(' ') for p in splitted[1:]]
# Convert to floats
origin = [float(origin[1]), float(origin[2])]
data = [[float(x), float(y)] for x, y in data]
# Get the min and max y-axis value and position
min_y_val = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:first').text()")
max_y_val = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:last').text()")
min_y_pos = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:first').attr('y')")
max_y_pos = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:last').attr('y')")
# Calculate the value based on the retrieved positions
y_scale = min_y_pos - max_y_pos
y_range = max_y_val - min_y_val
y_percentage = data[0][1] * 100.0 / y_scale
value = max_y_val - (y_range * percentage)
Disclaimer: I didn't have to time to fully verify it, but something along these lines should give you what you want.
I'm automatically generating a PDF-file with Platypus that has dynamic content.
This means that it might happen that the length of the text content (which is directly at the bottom of the pdf-file) may vary.
However, it might happen that a page break is done in cases where the content is too long.
This is because i use a "static" spacer:
s = Spacer(width=0, height=23.5*cm)
as i always want to have only one page, I somehow need to dynamically set the height of the Spacer, so that the "rest" of the space that is left on the page is taken by the Spacer as its height.
Now, how do i get the "rest" of height that is left on my page?
I sniffed around in the reportlab library a bit and found the following:
Basically, I decided to use a frame into which the flowables will be printed. f._aH returns the height of the Frame (we could also calculate this by hand). Subtracting the heights of the other two flowables, which we get through wrap, we get the remaining height which is the height of the Spacer.
elements.append(Flowable1)
elements.append(Flowable2)
c = Canvas(path)
f = Frame(fx, fy,fw,fh,showBoundary=0)
# compute the available height for the spacer
sheight = f._aH - (Flowable1.wrap(f._aW,f._aH)[1] + Flowable2.wrap(f._aW,f._aH)[1])
# create spacer
s = Spacer(width=0, height=sheight)
# insert the spacer between the two flowables
elements.insert(1,s)
# create a frame from the list of elements
f.addFromList(elements,c)
c.save()
tested and works fine.
As far as i can see you want to have footer, right?
Then you should do it like:
def _laterPages(canvas, doc):
canvas.drawImage(os.path.join(settings.PROJECT_ROOT, 'templates/documents/pics/footer.png'), left_margin, bottom_margin - 0.5*cm, frame_width, 0.5*cm)
doc = BaseDocTemplate(filename,showBoundary=False)
doc.multiBuild(flowble elements, _firstPage, _laterPages)