how can I extract Chinese text from PDF using simple ‘with open’? - python

I need to extract pdf text using python,but pdfminer and others are too big to use,but when using simple "with open xxx as xxx" method, I met a problem , the content part didn't extract appropriately. The text looks like bytes because it start with b'. My code and the result screenshot:
with open(r"C:\Users\admin\Desktop\aaa.pdf","rb") as file:
aa=file.readlines()
for a in aa:
print(a)
Output Screenshot:

To generate an answer from the comments...
when using simple "with open xxx as xxx" method, I met a problem , the content part didn't extract appropriately
The reason is that PDF is not a plain text format but instead a binary format whose contents may be compressed and/or encrypted. For example the object you posted a screenshot of,
4 0 obj
<</Filter/FlateDecode/Length 210>>
stream
...
endstream
endobj
contains FLATE compressed data between stream and endstream (which is indicated by the Filter value FlateDecode).
But even if it was not compressed or encrypted, you might still not recognize any text displayed because each PDF font object can use its own, completely custom encoding. Furthermore, glyphs you see grouped in a text line do not need to be drawn by the same drawing instruction in the PDF, you may have to arrange all the strings in drawing instructions by coordinate to be able to find the text of a text line.
(For some more details and backgrounds read this answer which focuses on the related topic of replacement of text in a PDF.)
Thus, when you say
pdfminer and others are too big to use
please consider that they are so big for a reason: They are so big because you need that much code for adequate text extraction. This is in particular true for Chinese text; for simple PDFs with English text there are some short cuts working in benign circumstances, but for PDFs with CJK text you should not expect such short cuts.
If you want to try nonetheless and implement text extraction yourself, grab a copy of ISO 32000-1 or ISO 32000-2 (Google for pdf32000 for a free copy of the former) and study that pdf specification. Based on that information you can step by step learn to parse those binary strings to pdf objects, find content streams therein, parse the instructions in those content streams, retrieve the text pieces drawn by those instructions, and arrange those pieces correctly to a whole text.
Don't expect your solution to be much smaller than pdfminer etc...

Related

How to parse this kind of PDF with python

I am trying to parse the pdf found here: https://corporate.lowes.com/sites/lowes-corp/files/annual-report/lowes-2020ar.pdf with python. It seems to be text-based, according to the copy/paste test, and the first several pages parse just fine using, e.g. pymupdf.
However, after about page 12, there seems to be an internal change in the document encoding. For example, this section from page 18:
It looks like text, but when you copy and paste it, it becomes:
%A>&1;<81
FB9#4AH4EL
%BJ8XF8#C?BL874CCEBK<#4G8?L
9H??G<#84FFB6<4G8F4A7
C4EGG<#84FFB6<4G8F
CE<#4E<?L<AG;8.A<G87,G4G8F4A74A474"A9<F64?
J88KC4A787BHEJBE>9BE68
;<E<A:4FFB6<4G8F<AC4EGG<#8
F84FBA4?
4A79H??G<#8CBF<G<BAFGB9H?9<??G;8F84FBA4?78#4A7B9BHE,CE<A:F84FBA
<A6E84F8778#4A77HE<A:G;8(/"C4A78#<6
4F6HFGB#8EF9B6HF87BA;B#8<#CEBI8#8AGCEB=86GF
4A74A4G<BAJ<78899BEGGB#B7<9LBHEFGBE8?4LBHG
What is going on here? Will I need to use OCR to parse a file like this? Or is there some way of translating that the stuff above back to text?
Pages 13 to 100 have been imported also there are other odd practices thus suggest you will get 12 good pages then need to OCR 13-100 then probably good 3 pages from 101-104 again see https://stackoverflow.com/a/68627207/10802527
The majority of Pages 13-100 contain structured text that is described as Roman, and coincidentally the Romans were fond of encoding messages by sliding the alphabet a few step to the right or left and that's exactly what's happening here by character sliding we could extract much of the corrupted text using chars+n so read
A and replace with n
B and replace with o
C and replace with p
etc. but I will leave it there as I have little time to do 90 pages of analysis on a bad file font definition.
I tried Acrobat and Exchange plus others all agreed the text was defined as a reasonable form of Times Roman thus nothing to fix but content is meaningless nevertheless Selecting the characters for "We" (08) generally jumped to another instance suggesting there could be some slight possibility of redemption but then yet again the same two characters stopped on occasion at "ai" which is what's needed so I would say the file is Borked.
In theory the corruption should be recoverable in the PDF by remapping that font (at least for those pages), and with good Char remapping by adding or subtracting accordingly the plain text may be more easily converted.

PyPDF2 can't read non-English characters, returns empty string on extractText()

i'm working on a script that will extract data from a large PDF File (40-60 plus, pages long)
that isn't in English but the file contains Greek characters and all seems good until i run the extractText() function of PyPDF2 to get the givens page contents, then it returns an empty string.
I'm new to this library and i don't know what to do, to fix this problem!!
PyPDF2's "Extract Text" looks like it will either Work Just Fine, or Fail Completely. There's no parameters you can pass in to try to get things to work properly. It'll work or it won't.
You may not be able to fix this problem. If you can successfully copy/paste the text in Acrobat/Reader, then it's possible to extract the text. So what happens when you try to copy/paste out of Reader? Don't try this with some other third party PDF viewer, use Adobe software. You'll probably have to abandon PyPDF2 and move on to some other PDF API, but if Reader can do it, it's a fixable problem.
There are three different things in a PDF that can look like letters to the human eye.
Letters in the PDF in some text encoding. There are several fixed encodings, plus PDF allows you to embed your own custom encodings (often used with font subsets). Software can create PDFs that look fine but can't really be copy/pasted from, even by Adobe.
Path art that just happens to look an awful lot like letters. "Start drawing a line here, draw a straight line to there, then a curve like this to there" and so on. If you're curious, PDF uses Bezier curves to define its curves. Not terribly related to your question, but interesting.
Bit maps (.jpeg/gif/etc images) that define a grid of pixels.
In the past, Reader has only been able to handle text type 1 above, and then only if the text was encoded properly. Broken custom encodings are alarmingly common (or were 7+ years ago when I stopped working on PDF software).
With broken type 1s, and all of 2 and 3, the only thing you can do is to run OCR on the PDF. OCR: Optical Character Recognition. There are several open source OCR projects out there, as well as commercial ones.

Read PDF summary with Python

I'm trying to read some PDF documents with Python.
I would like to extract a summary in the first page.
Does it exist a library able to do it?
There are two parts to your problem: first you must extract the text from the PDF, and then run that through a summarizer.
There are many utilities to extract text from a PDF, though text in a PDF may not be stored in a 'logical' order.
(For instance, a page with two text columns might be stored with the first line of both columns, followed by the next, etc; rather than all the text of the first column, then the second column, as a human would read it.)
The PDFMiner library would seem to be ideal for extracting the text. A quick Google reveals that there are several text summarizer python libraries, though I haven't used any of them and can't attest to their abilities. But parsing human language is tricky - even for humans.
https://pypi.org/project/text-summarizer/
http://ai.intelligentonlinetools.com/ml/text-summarization/
If you're using MacOS, there is a built-in text summarizing Service. Right click on any selected text and click "Summarize" to activate. Though it seems hard to incorporate this into any automated process.

Python Slate Library: PDF text extraction concatenating words

Just trying to extract the text from a PDF in Python, using the Slate Library and PyPDF2. Unfortunately some PDFs are being output with multiple words merged/concatenated together. This seems to happen intermittently, for example for some PDFs words are extracted with the spaces between them correctly, whereas others are not.
One example of a PDF where words are not extracted correctly is included and available for download (SO wouldn't let me upload it) here. The output from
slate.PDF(open(name, 'rb') ).text()
is (or at least a segment is):
,notonadhocprocedures,andcanbeusedwithdatacollectedatmul-tiplespatialresolutions(Kulldorff1999).Ifdataontheabundanceofataxonovertimeareavailable,thesedatacanbeincorporatedintoanSTPSanalysistoincreasethesensitivityandreliabilityofthemodeltodetectsightingclusters,
where of course the first comma-separated token should be not on adhoc procedures
Does anybody know why this is happening, or have a better idea of a library to use for PDF text extraction?
Thanks for the help!

Is there a way to automate specific data extraction from a number of pdf files and add them to an excel sheet?

Regularly I have to go through a list of pdf files and search for specific data and add them to an excel sheet for later review. As the number of pdf files are around 50 per month, it is both time taking and frustrating to do it manually.
Can the process be automated in windows by python or any other scripting language? I require to have all the pdf files in a folder and run the script which will generate an excel sheet with all the data added. The pdf files with which I work are tabular and have similar structures.
Yes. And no. And maybe.
The problem here is not extracting something from a PDF document. Extracting something is almost always possible and there are plenty of tools available to extract content from a PDF document. Text, images, whatever you need.
The major problem (and the reason for the "no" or "maybe") is that PDF in general is not a structured file format. It doesn't care about columns, paragraphs, tables, sentences or even words. In the general case it cares only about characters on a page in a specific location.
This means that in the general case you cannot query a PDF document and ask it for every paragraph or for the third sentence in the fifth paragraph. You can ask a library to get all of the text or all of the text in a specific location. And then you have to hope the library is able to extract the text you need in a legible format. Because there doesn't even have to be the case that you can copy and paste or otherwise extra understandable characters from a PDF file. Many PDF files don't even contain enough information for that.
So... If you have a certain type of document and you can test that it predictably behaves a certain way with a certain extraction engine, then yes, you can extract information from a PDF file.
If the PDF files you receive are different all the time or the layout on the page is totally different every time than the answer is probably that you cannot reliably extract the information you want.
As a side note:
There are certain types of PDF documents that are easier to handle than others so if you're lucky that might make your life easier. Two examples:
Many PDF files will in fact contain textual information in such a way that it can be extracted in a legible way. PDF files that follow certain standards (such as PDF/A-1a, PDF/A-2a or PDF/A-2u etc...) are even required to be created this way.
Some PDF files are "tagged" which means they contain additional structural information that allows you to extract information in an easier and more meaningful way. This structure would in fact identify paragraphs, images, tables etc and if the tagging was done in a good way it could make the job of content extraction much easier.
You could use pdf2text2 in Python to extract data from your PDF.
Alternatively you can use pdftotext that is part of the Xpdf suite

Categories