I am trying to create a series of name badges and rather than doing it all by hand I'm trying to do it via python. I am using the win32com.client approach to create a table in msword to hold each name badge however the images I am inserting into each cell are pushed up against the top of the cell whereas I want them moved down a bit (Image is oversized I know but that can be dealt with later).
As you can see the image is write against the top of the border, I want it pushing down, I have tried adding newlines before (demonstrated below) but this seems to have had no effect. This is my loop for generating the badges.
for i in range(10):
cell_col = i % cols + 1
cell_row = i / cols + 1
cell_range = table.Cell(cell_row, cell_col).Range
cell_range.ParagraphFormat.SpaceBefore = 0
cell_range.ParagraphFormat.SpaceAfter = 3
table.Cell(cell_row, cell_col).Range.InsertBefore('\n')
cell_range.InlineShapes.AddPicture(os.path.join(os.path.abspath("."), filename))
table.Cell(cell_row, cell_col).Range.InsertAfter('\n'+hold[i])
table.Cell(cell_row, cell_col).Height = 150
table.Cell(cell_row, cell_col).Width = 250
We have paper invoices coming in, which are in paper format. We take images of these invoices, and wish to extract the information contained within the cells of the tabular region(s), and export them as CSV or similar.
The tables include multiple columns, and the cells contain numbers and words.
I have been searching around for ML-based Python procedures to have this performed, expecting this to be a relatively straightforward task (or maybe I'm mistaken), yet not much luck in coming across a procedure.
I can detect the horizontal and vertical lines, and combine them to locate the cells. But retrieving the information contained within the cells seems to be problematic.
Could I please get help?
I followed one procedure from this reference, yet came across an error with "bitnot":
import pytesseract
for i in range(len(order)):
for j in range(len(order[i])):
extract.append(' ')
for k in range(len(order[i][j])):
side1,side2,width,height = order[i][j][k][0],order[i][j][k][1], order[i][j][k][2],order[i][j][k][3]
final_extract = bitnot[side2:side2+h, side1:side1+width]
final_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 1))
get_border = cv2.copyMakeBorder(final_extract,2,2,2,2, cv2.BORDER_CONSTANT,value=[255,255])
resize = cv2.resize(get_border, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
dil = cv2.dilate(resize, final_kernel,iterations=1)
ero = cv2.erode(dil, final_kernel,iterations=2)
ocr = pytesseract.image_to_string(ero)
ocr = pytesseract.image_to_string(ero, config='--psm 3')
inside = inside +" "+ ocr
a = np.array(extract)
dataset = pd.DataFrame(a.reshape(len(hor), total))
The error I get is this:
final_extract = bitnot[side2:side2+h, side1:side1+width]
NameError: name 'bitnot' is not defined`
I am new to python. I am trying to extract mixed fractions from pdf file using Python. But I have no idea which tool I should use to extract. My sample pdf contains only one page with simple text. I would like to extract Part name and length of part using Python. Screenshot of sample pdf page is as shown in image link Page 1 of Pdf- Screenshot. Pdf file can be downloaded from the following link (Sample Pdf)
Thank you for suggesting Pdfplumber. It is a great tool. I could extract information with it. Though in some cases, when I extract length, I get the whole number combined with denominator. Say, if I have 36 1/2 as length (as shown in screenshot), then I get the value as 362 inches.
import pdfplumber
with pdfplumber.open("Sample.pdf") as pdf:
first_page = pdf.pages[0]
text = first_page.extract_text()
for row in text.split('\n'):
if 'inches' in row:
num = row.split()[0]
Output: 362
This code works for me in most cases. Just in some cases, I get 362 as my output, instead of getting 36 as a separate value. How could I resolve this issue?
pdfplumber gives output like that
shape: square
part name: square
36 𝑖𝑛𝑐ℎ𝑒𝑠
I would suggest to use PDF Pluber, it's a very powerful and well documented tool for extracting text, table, images from PDFs.
Moreover, it has a very convenient function, called crop, that allows you to crop and extract just the portion of the page that you need.
Just as an example, the code would be something like this (note that this will work with any number of pages):
filename = 'path/to/your/PDF'
crop_coords = [x0, top, x1, bottom]
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
for i, page in enumerate(pdf.pages):
my_width = page.width
my_height = page.height
# Crop pages
my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
page_crop = page.crop(bbox=my_bbox)
text = text+str(page_crop.extract_text()).lower()
Here is the explanation of coords:
x0 = % Distance from left vertical cut to left side of page.
top = % Distance from upper horizontal cut to upper side of page.
x1 = % Distance from right vertical cut to right side of page.
bottom = % Distance from lower horizontal cut to lower side of page.
I'm currently having a little issue with a fits file. The data is in table format, a format I haven't previously used. I'm a python user, and rely heavily on astropy.fits to manipulate fits images. A quick output of the info gives:
No. Name Type Cards Dimensions Format
0 PRIMARY PrimaryHDU 60 ()
1 BinTableHDU 29 3072R x 2C [1024E, 1024E]
The header for the BinTableHDU is as follows:
XTENSION= 'BINTABLE' /Written by IDL: Mon Jun 22 23:28:21 2015
BITPIX = 8 /
NAXIS = 2 /Binary table
NAXIS1 = 8192 /Number of bytes per row
NAXIS2 = 3072 /Number of rows
PCOUNT = 0 /Random parameter count
GCOUNT = 1 /Group count
TFIELDS = 2 /Number of columns
TFORM1 = '1024E ' /Real*4 (floating point)
TFORM2 = '1024E ' /Real*4 (floating point)
TUNIT1 = '1e-6cts/s/arcmin^2' /
TUNIT2 = '1e-6cts/s/arcmin^2' /
HISTORY g000m90r1b120pm.fits created on 10/08/97. PI channel range: 8: 19
PIXTYPE = 'HEALPIX ' / HEALPIX pixelisation
ORDERING= 'NESTED ' / Pixel ordering scheme, either RING or NESTED
NSIDE = 512 / Healpix resolution parameter
NPIX = 3145728 / Total number of pixels
OBJECT = 'FULLSKY ' / Sky coverage, either FULLSKY or PARTIAL
FIRSTPIX= 0 / First pixel # (0 based)
LASTPIX = 3145727 / Last pixel # (zero based)
GRAIN = 0 / GRAIN = 0: No index,
COMMENT GRAIN =1: 1 pixel index for each pixel,
COMMENT GRAIN >1: 1 pixel index for Grain consecutive pixels
BAD_DATA= -1.63750E+30 / Sentinel value given to bad pixels
COORDSYS= 'G ' / Pixelization coordinate system
COMMENT G = Galactic, E = ecliptic, C = celestial = equatorial
I'd like to access the fits image which is stored within the TTYPE labeled 'COUNT-RATE', and then have this in a format with which I can then add to other count-rate arrays with the same dimensions.
I started with my usual prodcedure for opening a fits file:
hdulist_RASS_SXRB_R1 = fits.open('/Users/.../RASS_SXRB_R1.fits')
image_XRAY_SKYVIEW_R1 = hdulist_RASS_SXRB_R1[1].data
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
image_XRAY_SKYVIEW_header_R1 = hdulist_RASS_SXRB_R1[1].header
But this is coming back with IndexError: too many indices for array. I've had a look at accessing table data in the astropy documentation here (Accessing data stored as a table in a multi-extension FITS (MEF) file)
If anyone has a tried and tested method for accessing such images from a fits table I'd be very grateful! Many thanks.
I can't be sure without seeing the full traceback but I think the exception you're getting is from this:
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
There's no reason to manually wrap numpy.array() around the array. It's already a Numpy array. But in this case it's a structured array (see http://docs.scipy.org/doc/numpy/user/basics.rec.html).
#Andromedae93's answer is right one. But also for general documentation on this see: http://docs.astropy.org/en/stable/io/fits/index.html#working-with-table-data
However, the way you're working (which is fine for images) of manually calling fits.open, accessing the .data attribute of the HDU, etc. is fairly low level, and Numpy structured arrays are good at representing tables, but not great for manipulating them.
You're better off generally using Astropy's higher-level Table interface. A FITS table can be read directly into an Astropy Table object with Table.read(): http://docs.astropy.org/en/stable/io/unified.html#fits
The only reason the same thing doesn't exist for FITS images is there's no a generic "Image" class yet.
I used astropy.io.fits during my internship in Astrophysics and this is my process to open file .fits and make some operations :
# Opening the .fits file which is named SMASH.fits
field = fits.open(SMASH.fits)
# Data fits reading
tbdata = field[1].data
Now, with this kind of method, tbdata is a numpy.array and you can make lots of things.
For example, if you have data like :
ID, Name, Object
1, HD 1527, Star
2, HD 7836, Star
3, NGC 6739, Galaxy
If you want to print data along one condition :
Data_name = tbdata['Name']
You will get :
HD 1527
HD 7836
NGC 6739
I don't know what do you want exactly with your data, but I can help you ;)
I have a script which is designed to place inset maps onto specific pages while exporting Data Driven Pages, the script is amalgamation of a friend's work and some of my own code from other projects.
The issue is the code exports pages 15 and 16 twice one with my inset maps and the other without and I can't figure out why.
I think it is something to do with the indentation within the Loop but I cant get it so it behaves in any other way. Any help would be appreciated!
import arcpy, os, time, datetime
from datetime import datetime
start_time = datetime.now()
PageNumber = "Page "
# Create an output directory variable i.e the location of your maps folder
outDir = r"C:\Users\support\Desktop\Python\Book of Reference"
# Create a new, empty pdf document in the specified output directory
# This will be your final product
finalpdf_filename = outDir + r"\FinalMapBook.pdf"
if os.path.exists(finalpdf_filename): # Check to see if file already exists, delete if it does
finalPdf = arcpy.mapping.PDFDocumentCreate(finalpdf_filename)
# Create a Data Driven Pages object from the mxd you wish to export
mxdPath = r"C:\Users\support\Desktop\Python\Book Of Reference\Book_Of_Reference_20160526_Python_Test.mxd"
tempMap = arcpy.mapping.MapDocument(mxdPath)
tempDDP = tempMap.dataDrivenPages
# Create objects for the layout elements that will be moving, e.g., inset data frame, scale text
Page15 = arcpy.mapping.ListDataFrames(tempMap)[1]
Page16 = arcpy.mapping.ListDataFrames(tempMap)[2]
# Instead of exporting all pages at once, you will need to use a loop to export one at a time
# This allows you to check each index and execute code to add inset maps to the correct pages
for pgIndex in range(1, tempDDP.pageCount + 1, 1):
# Create a name for the pdf file you will create for each page
temp_filename = r"C:\Users\support\Desktop\Python\Book of Reference\Book of Reference" + \
str(pgIndex) + ".pdf"
if os.path.exists(temp_filename):
os.remove(temp_filename) #Removes pdf if it is already in the folder
# Code for setting up the inset map on the first page #
if (pgIndex == 15):
# Set position of inset map to place it on the page layout
Page15.elementPositionX = 20.1717
Page15.elementPositionY = 2.0382
# Set the desired size of the inset map for this page
Page15.elementHeight = 9.7337
Page15.elementWidth = 12.7115
# Set the desired extent for the inset map
Page15insetExtent = arcpy.Extent(518878,108329,519831,107599)
Page15insetExtent = Page15insetExtent
tempDDP.exportToPDF(temp_filename, "RANGE", pgIndex)
Page15.elementPositionX = 50 #Move the Inset back off the page
arcpy.RefreshActiveView() #Refresh to ensure the Inset has been removed
print PageNumber + str(pgIndex)
if (pgIndex == 16):
# Set up inset map
Page16.elementPositionX = 2.1013
Page16.elementPositionY = 18.1914
Page16.elementHeight = 9.7337
Page16.elementWidth = 12.7115
Page16insetExtent = arcpy.Extent(520012, 107962, 521156,107086)
Page16insetExtent = Page16insetExtent
print PageNumber + str(pgIndex)
tempDDP.exportToPDF(temp_filename, "RANGE", pgIndex)
print PageNumber + str(pgIndex)
Page16.elementPositionX = 50
# Else Fuction takes care of the pages that dont have insets and just itterates through using the loop on line 28
else :
tempDDP.exportToPDF(temp_filename, "RANGE", pgIndex)
print PageNumber + str(pgIndex)
# Clean up
del tempMap
# Update the properties of the final pdf
# Save your result
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
I believe your problem is that when the pgIndex is 15 it performs the export as intended. Then it checks if the pgIndex is 16. The pgIndex is not 16 so it drops into the else and re-exports without the inset maps. I would recommend changing the second if to an elif
I have 3 parameters.
startLine, starColumn and width (here 2,8,3)
How can I erase the selected area without writing blanks in each cells?
(here there is only 30 line but there could potetialy be 10 000 lines)
Right now I'm succesfully counting the number of lines but I can't manage to find how to select and delete an area
self.startLine = 2
self.startColumn = 8
self.width = 8
self.xl = client.Dispatch("Excel.Application")
self.xl.Visible = 1
self.xl.ScreenUpdating = False
self.worksheet = self.xl.Workbooks.Open("c:\test.xls")
sheet = self.xl.Sheets("data")
#Count the number of line of the record
nb = 0
while sheet.Cells(start_line + nb, self.startColumn).Value is not None:
nb += 1
#must select from StartLine,startColumn to startcolum+width,nb
#and then erase
ps : the code works, I may have forgotten some part due do copy/pas error, in reality the handling of the excel file is managed by several classes inheriting from each other
What I usually do is that I record macro in Excel and than try to re-hack the VB in Python. For deleting content I got something like this, should not be hard to convert it to Python:
In Python it should be something like:
Working example:
from win32com.client.gencache import EnsureDispatch
exc = EnsureDispatch("Excel.Application")
exc.Visible = 1
this worked for me
xl = EnsureDispatch('Excel.Application')