How is the text from this pdf encoded?

How is the text from this pdf encoded? - python

I have some pdfs with data about machine parts and i am trying to extract sizes. I extracted the text from a pdf via pypdfium2.
import pypdfium2 as pdfium
pdf = pdfium.PdfDocument("myfile.pdf")
page=pdf[1]
textpage = page.get_textpage()
Most of the text is readable but for some reason the important data is not readable when extracted.
In the extracted string the relevant part is like this
Readable text \r\n\x13\x0c\x10 \x18\x0c\x18 \x0b\x10\x0e\x10\x15\x18\x0f\x10 \x15\x0c\x10 \x14\x0c\x10 \x14\x0c\x15 readable text
I tried also with tika and PyMuPDF. They only give me the questionmarkcharacter for those parts.
I know the mangled part (\r\n\x13\x0c\x10 \x18\x0c\x18 \x0b\x10\x0e\x10\x15\x18\x0f\x10 \x15\x0c\x10 \x14\x0c\x10 \x14\x0c\x15) should be 3,0 8,8 +0,058/0 5,0 4,0 4,5.
My current idea is to make my own encoding table but i wanted to ask if there is a better method and if this looks familiar to someone.
I have about 52 files whith around 200 occurences each.
While the pdfs are not confidential i dont want to post links because it is not my intelectual property.
Update------------------------------
I tried to find out more about the fonts.
from pdfreader import PDFDocument
fd = open("myfile", "rb")
doc = PDFDocument(fd)
page = next(doc.pages())
font_keys=sorted(page.Resources.Font.keys())
for font_key in font_keys:
font = page.Resources.Font[font_key]
print(f"{font_key}: {font.Subtype}, {font.BaseFont}, {font.Encoding}")
gives:
R13: Type0, UHIIUQ+MetaPlusBold-Roman-Identity-H, Identity-H
R17: Type0, EWGLNL+MetaPlusBold-Caps-Identity-H, Identity-H
R20: Type1, NRVKIY+Meta-LightLF, {'Type': 'Encoding', 'BaseEncoding': 'WinAnsiEncoding', 'Differences': [33, 'agrave', 'degree', 39, 'quoteright', 177, 'endash']}
R24: Type0, IKRCND+MetaPlusBold-Italic-Identity-H, Identity-H
-Edit------
I am not interested in help tranlating it manually. I can do that by myself. i am interested in a solution that works by script. For example a script that extracts fonts with codemaps from the pdf and then uses those to translate the unreadable parts

This is not uncommon CID CMAP substitution as output in python notation, and is usua;;y specific to a single font with 6 random ID e.g.UHIIUQ+Font name
often found for subsetting fonts that have a limited range of characters.
should be 3,0 8,8 +0,058/0 5,0 4,0 4,5
\r\n\ = cR Nl (windows line feed \x0d\x0a)
\x13 has been mapped to 3
\x0c has been mapped to ,
\x10 has been mapped to 0
(literal nbsp)
\x18 = 8
\x0c = ,
\x18 = 8
(literal nbsp)
\x0b has been mapped to +
\x10 = 0
\x0e has been mapped to , (very odd see \x0c)
\x10 = 0
\x15 = 5
\x18 = 8
\x0f has been mapped to /
\x10 = 0
(literal nbsp)
\x15 etc......................
\x0c
\x10
\x14
\x0c
\x10
\x14
\x0c
\x15
so \x0# are low order control codes & punctuation
and \x1# are digits
unknown if \x2# are used for letters, the CMAP table should be queried for the full details
\x0e has been mapped to , (very odd see \x0c)
I suspect as its different that should possibly be decimal separator dot ?

Here is example code to get the source of a font's CMAP with PyMuPDF:
import fitz
doc = fitz.open("some.pdf")
# assume that we know a font's xref already
# extract the xref of its CMAP:
cmap_xref = doc.xref_get_key(xref, "ToUnicode")[1] # second string is 'nnn 0 R'
if cmap_xref.endswith("0 R"): # check if a CMAP exists at all
cxref = int(cmap_xref.split()[0])
else:
raise ValueError("no CMAP found")
print(doc.xref_stream(cxref).decode()) # convert bytes to string
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R63 def
1 begincodespacerange
<00><ff>
endcodespacerange
12 beginbfrange
<20><20><0020>
<2e><2e><002e>
<30><31><0030>
<43><46><0043>
<49><49><0049>
<4c><4d><004c>
<4f><50><004f>
<61><61><0061>
<63><69><0063>
<6b><70><006b>
<72><76><0072>
<78><79><0078>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

Related

MS Word Manipulation using Python

I have ms word document and I want to apply the following setting automatically using python
1.Font type = Trebuchet MS
2.Font Size = 11
3.Table and Appendices font Size = 10
4.Line spacing = multiple of 1.15
Space before paragraph = 0
6.Space after paragraph = 0
7.Paragraph should be Justified
8.Number should be aligned to the bottom right corner of document
9.Text in the table should be Justified
10.Insert footer and header automatically
11.Tool should ensure that a document with a single page shall not have a page number. ,
12.Tool should ensure that For documents exceeding one page, page numbers shall be inserted at the right hand side of the document.
13.Tool should ensure that the cover page shall not be assigned a page number.
14.Tool should ensure that Roman numbers (i, ii, iii. .. ) used only on the preliminary pages including table of contents, preface, abbreviations, list of tables, executive summary
15.Tool should ensure that Arabic numbers (1, 2, 3 ...) used only for main text of the report and appendices.
16.Tool should ensure that year in any document is written in full for the preceding year and two last digits for the current year. For example; instead of writing 2020/2021, write 2020/21
17.Tool should ensure that only English United Kingdom vocabularies are used and NOT English United States. Example: "analyse -English UK" vs "analyze -English US".
18.Tool should ensure that Numbers presented in a paragraph that are less than 10 should be written in words (one, two, three ...).
19.Tool should ensure that Numbers presented in a paragraph For 10 and above, they should be written in numerals.
20.Tool should ensure that Numbers within the table must be expressed in numerals even if they are less than 10.
This is my code so far
from docx import Document
from docx.shared import Pt
from docx.shared import Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.shared import Length
path = 'C:\\Users\\Gaston\\Documents\\Words\\test.docx'
doc = Document(path)
style = doc.styles['Normal']
font = style.font
font.name = 'Trebuchet MS'
font.size = Pt(11)
paragraph = doc.add_paragraph()
paragraph_format = paragraph.paragraph_format
paragraph_format.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
paragraph_format.right_indent = Inches(1)
paragraph_format.space_before = Pt(0)
paragraph_format.space_after = Pt(0)
paragraph_format.line_spacing = Length(1.15)
doc.save(path)

Accessing additional paragraph-style properties with python-docx

I am trying to parse a Word document using python-docx, but have trouble getting the correct styles of paragraphs. I have uploaded a simplified version of the file to Dropbox.
The document's 'Normal' style uses 'Garamont' font, but this is changed so that everywhere I click in the file, the font is 'Calibri (Body)'.
When I use the 'Style inspector' in Word on the first line, it shows: "Paragraph formatting" is Normal + Plus: Centered, Left: 0 cm, Before: 0 pt, and "Text level formatting" is Default Paragraph Font + Plus: +Body (Calibri), 14 pt, Bold, Underline.
When I do the same on a non-bold text in the table, I get: "Paragraph formatting" is Normal + Plus: +Body (Calibri), Before: 0 pt, and "Text level formatting" is Default Paragraph Font + Plus: <none>.
That is, the font is changed on different levels inside and outside of the table. In both case, however, I do not know how to get this info using python-docx:
import docx
doc = docx.Document('test.docx')
par = doc.paragraphs[0]
#par = doc.tables[0].cell(0,1).paragraphs[0]
print(f"'{par.style.name}'")
print(f"'{par.style.font.name}'")
print(f"'{par.runs[0].font.name}'")
print(f"'{par.runs[0].style.name}'")
print(f"'{par.runs[0].style.font.name}'")
c = doc.tables[0].cell(1,0)
for par in c.paragraphs:
print(f"{len(par.runs)}", end=' ')
c.paragraphs[0].add_run('Very short summary')
doc.save('test_ed.docx')
returns
'Normal'
'Garamond'
'None'
'Default Paragraph Font'
'None'
1 0 0 0 0 0 0 0 0 1
In other words, I do not see any sign that the document actually uses the Calibri font.
It returns exactly the same if I use the second par definition (from the table).
Moreover, looking at the resulting test_ed.docx, the added line is using 'Garamont', even if Word shows the other empty paragraphs as using 'Calibri (Body)'.
So, my question is how to detect the actual format of the text and how to copy it to new paragraphs?

Parse unstructured text in python

Am new to python and am trying to read a PDF file to pull the ID No.. I have been successful so far to extract the text out of the PDF file using pdfplumber. Below is the code block:
import pdfplumber
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
print (raw_text)
Here is the text output:
Welcome to ABC
01 January, 1991
ID No. : 10101010
Welcome to your ABC portal. Learn
More text here..
Even more text here..
Mr Jane Doe
Jack & Jill Street Learn more about your
www.abc.com
....
....
....
However, am unable to find the optimum way to parse this unstructured text further. The final output am expecting to be is just the ID No. i.e. 10101010. On a side note, the script would be using against fairly huge set of PDFs so performance would be of concern.

Try using a regular expression:
import pdfplumber
import re
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
m = re.search(r'ID No\. : (\d+)', raw_text)
if m:
print(m.group(1))
Of course you'll have to iterate over all the PDF's contents - not just the first page! Also ask yourself if it's possible that there's more than one match per page. Anyway: you know the structure of the input better than I do (and we don't have access to the sample file), so I'll leave it as an exercise for you.

If the length of the id number is always the same, I would try to find the location of it with the find-function. position = raw_text.find('ID No. : ')should return the position of the I in ID No. position + 9 should be the first digit of the id. When the number has always a length of 8 you could get it with int(raw_text[position+9:position+17])

If you are new to Python and actually need to process serious amounts of data, I suggest that you look at Scala as an alternative.
For data processing in general, and regular expression matching in particular, the time it takes to get results is much reduced.
Here is an answer to your question in Scala instead of Python:
import com.itextpdf.text.pdf.PdfReader
import com.itextpdf.text.pdf.parser.PdfTextExtractor
val fil = "ABC.pdf"
val textFromPage = (1 until (new PdfReader(fil)).getNumberOfPages).par.map(page => PdfTextExtractor.getTextFromPage(new PdfReader(fil), page)).mkString
val r = "ID No\\. : (\\d+)".r
val res = for (m <- r.findAllMatchIn(textFromPage )) yield m.group(0)
res.foreach(println)

What is a working method for extracting numeric values with associated data from open text?

I tried to look for a solution but nothing was giving me quite what I needed. I'm not sure regex can do what I need.
I need to process a large amount of data where license information is provided. I just need to grab the number of licenses and the name for each license then group and tally the license counts for each company.
Here's an example of the data pulled:
L00129A578-E105C1D138 1 Centralized Recording
$42.00
L00129A677-213DC6D60E 1 Centralized Recording
$42.00
1005272AE2-C1D6CACEC8 5 Station
$45.00
100525B658-3AC4D2C93A 5 Station
$45.00
I would need to grab the license count and license name then add like objects so it would grab (1 Centralized Recording, 1 Centralized Recording, 5 Station, 5 Station) then add license counts and output (2 Centralized Recording, 10 Station)
What would be the easiest way to implement this?

It looks like you're trying to ignore the license number, and get the count and name. So, the following should point you on your way for your data, if it is as uniform as it seems:
import re
r = re.compile(r"\s+(\d+)\s+[A-Za-z ]+")
r = re.compile(r"\s+(\d+)\s+([A-Za-z ]+)")
m = r.search(" 1 Centralized")
m.groups()
# ('1', 'Centralized')
That regex just says, "Require but ignore 1 or more spaces, pay attention to the string of digits after it, require but ignore 1 or more spaces after it, and pay attention to the capital letters, lower case letters, and spaces after it." (You may need to trim of a newline when you're done.)
The file-handling bit would look like:
f = open('/path/to/your_data_file.txt')
for line in f.readlines():
# run regex and do stuff for each line
pass

import re, io, pandas as pd
a = open('your_data_file.txt')
pd.read_csv(io.StringIO(re.sub(r'(?m).*\s(\d+)\s+(.*\S+)\s+$\n|.*','\\1,\\2',a)),
header=None).groupby(1).sum()[0].to_dict()

Pandas is a good tool for jobs like this. You might have to play around with it a bit. You will also need to export your excel file as a .csv file. In the interpreter,try:
import pandas
raw = pandas.read_csv('myfile.csv')
print(raw.columns)
That will give you the column headings for the csv file. If you have headers name and nums, then you can extract those as a list of tuples as follows:
extract = list(zip(raw.name, raw.nums))
You can then sort this list by name:
extract = sorted(extract)
Pandas probably has a method for compressing this easily, but I can't recall it so:
def accum(c):
nm = c[0][0]
count = 0
result = []
for x in c:
if x[0] == nm:
count += x[1]
else:
result.append((nm, count))
nm = x[0]
count = x[1]
result.append((nm, count))
return result
done = accum(extract)
Now you can write this to a text file as follows(fstrings require Python 3.6+)
with open("myjob.txt", "w+") as fout:
for x in done:
line = f"name: {x[0]} count: {x[1]} \n"
fout.write(line)

How to add space between lines within a single paragraph with Reportlab

I have a block of text that is dynamically pulled from a database and is placed in a PDF before being served to a user. The text is being placed onto a lined background, much like notepad paper. I want to space the text so that only one line of text is between each background line.
I was able to use the following code to create a vertical spacing between paragraphs (used to generate another part of the PDF).
style = getSampleStyleSheet()['Normal']
style.fontName = 'Helvetica'
style.spaceAfter = 15
style.alignment = TA_JUSTIFY
story = [Paragraph(choice.value,style) for choice in chain(context['question1'].itervalues(),context['question2'].itervalues())]
generated_file = StringIO()
frame1 = Frame(50,100,245,240, showBoundary=0)
frame2 = Frame(320,100,245,240, showBoundary=0)
page_template = PageTemplate(frames=[frame1,frame2])
doc = BaseDocTemplate(generated_file,pageTemplates=[page_template])
doc.build(story)
However, this won't work here because I have only a single, large paragraph.

Pretty sure what yo u want to change is the leading. From the user manual in chapter 6.
To get double-spaced text, use a high
leading. If you set
autoLeading(default "off") to
"min"(use observed leading even if
smaller than specified) or "max"(use
the larger of observed and specified)
then an attempt is made to determine
the leading on a line by line basis.
This may be useful if the lines
contain different font sizes etc.
Leading is defined earlier in chapter 2:
Interline spacing (Leading)
The vertical offset between the point
at which one line starts and where the
next starts is called the leading
offset.
So try different values of leading, for example:
style = getSampleStyleSheet()['Normal']
style.leading = 24

Add leading to ParagraphStyle
orden = ParagraphStyle('orden')
orden.leading = 14
orden.borderPadding = 10
orden.backColor=colors.gray
orden.fontSize = 14
Generate PDF
buffer = BytesIO()
p = canvas.Canvas(buffer, pagesize=letter)
text = Paragraph("TEXT Nro 0001", orden)
text.wrapOn(p,500,10)
text.drawOn(p, 45, 200)
p.showPage()
p.save()
pdf = buffer.getvalue()
buffer.close()
The result

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How is the text from this pdf encoded? - python

Related

MS Word Manipulation using Python

Accessing additional paragraph-style properties with python-docx

Parse unstructured text in python

What is a working method for extracting numeric values with associated data from open text?

How to add space between lines within a single paragraph with Reportlab

Categories

Resources