Extract text from pdf file with Python - python

I would like to extract text, including tables from pdf file.
I tried camelot, but it can only get table data not text.
I also tried PDF2, however it can't read Chinese characters.
Here is the pdf sample to read.
Are there any recommended text-extraction python packages?

By far the simplest way is to extract text in one OS shell command line using the poppler pdf utility tools (often included in python libraries) then modify that output in python.py as required.
>pdftotext -layout -f 1 -l 1 -enc UTF-8 sample.pdf
NOTE some of the text is embeded to right of the logo image and that can be extracted separately using pdftoppm -png or pdfimages then pass to inferior output quality OCR tools for those smaller areas.

Related

How to remove text layer from pdf using python

I need to remove all text information from pdf file. So the file I wanna get should be like scan: only images wrapped as pdf, no texts that u can copy or select. Now I'm using ghostscript command:
import os
...
os.system(f"gs -o {output_path} -sDEVICE=pdfwrite -dFILTERTEXT {input_path}")
unfortunately, with some documents it removes not only text layer but real pixels of characters!!! And sometimes I cannot see any text pictures on the page, it's not what I need
Is there some stable and fast solution with python or pip utils? It will be wonderful if I can solve this with PyMuPDF (fitz) but I couldn't find anything about it

is there a way to bulk covert docx files into pdf

Is there a way to covert bulk docx files to PDF ?
I'm doing a mailmerge to generate huge number of letters in word extension and i'm struggling on having the pdf conversion for the same letters.
https://pandoc.org/demos.html - pandoc is a library and a command line tool, and it also has python bindings. Demo 35 - 36 shows using docx as an input file, so I think it would work.
If straight PDF output unavailable (I don't think I've used pandoc in that way), you can also output to LaTeX, which can then be compiled to PDF (using miktex or other latex distributions).

I need to extract text from PDF file and make a new .txt file to put in

I need help in a PYTHON script to read PDF file and copy every word on it and put them in a new .txt file (every word must take 1 line) ; and then deleted the repeated words and count them after that and print the count in the last line
Install these libraries.
PyPDF2 (To convert simple, text-based PDF files into text readable by Python)
textract (To convert non-trivial, scanned PDF files into text readable by Python)
nltk (To clean and convert phrases into keywords)
Each of these libraries can be installed with the following commands in side terminal(on macOS):
pip install Libraryname
See this Tutorial https://medium.com/#rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
Use texttrack it support many types of files also PDF. So texttrack better.
folow these links
https://github.com/deanmalmgren/textract
https://textract.readthedocs.io/en/latest/
Did you search the Stackoverflow for answers?
Here you can find some pretty good answers about how to extract text from a pdf file (Look at Jakobovski answer):
How to extract text from a PDF file?
Here you can find information about writing/editing/creating .txt files:
https://www.guru99.com/reading-and-writing-files-in-python.html

Getting Chinese text from pdf, font encoding issue

I am using python 3 on windows 10 (though OS X is also available). I am attempting to extract the text from lots of .pdf files, all in Chinese characters. I have had success with pdfminer and textract, except for certain files. These files are not images, but proper documents with selectable text. If I use Adobe Acrobat Pro X and export to .txt, My output looks like:
!!
F/.....e..................!
216.. ..... .... ....
........
If I output to .doc, .docx, .rtf, or even copy-paste into any text editor, it looks like this:
ҁϦљӢख़ε༊౗ݢ୏ቹៜϐѦჾѱ൑॥ᓀϩ݋ӵΠ
I have no idea why Adobe would display the text properly but not export it correctly or even let me copy-paste. I thought maybe it was a font issue, the font is DFKaiShu sb-estd-bf which I already have installed (it appears to automatically come with windows).
I do have a workaround, but it's ugly and very difficult to automate; I print the pdf to a pdf (or any sort of image), then use adobe pro's built-in OCR, then convert to a word document (it still does not convert correctly to .txt). Ultimately I need to do this for ~2000 documents, each of which can be up to 200 pages.
Is there any other way to do this? Why is exporting or copy-pasting not working correctly? I have uploaded a 2-page sample to google drive here.

How do I resize a PDF?

I have 100 pdf files in a folder in A0 format I'd like to scale them to A3.
I have tried various methods but none of them worked. My goal is to create a script (preferably in python) to scale all the files in A3.
I tried to use PIL but my version of python doesn't support it.
you can write commands on windows command processor
to do these functions :
-Merge PDF files together, or split them apart
-Encrypt and decrypt
-Scale, crop and rotate pages
-Read and set document info and metadata
-Copy, add or remove bookmarks
-Stamp logos, text, dates, page numbers
-Add or remove attachments
-Losslessly compress PDF files
it can work also in different OS such as / MAC / Ubunto / Linux
scale command example :
cpdf -scale-to-fit a3portrait in.pdf -o out.pdf
using this tool Coherent PDF Command Line Tools , you may download it from here click here to open link
you can as well use python script to execute the code for you on the command processor / terminals .
hope my post helped you , good luck
You could try the linux pdftk program (although it might even be available on Windows, i am not sure). Once installed, you can use the command below to convert an A0 pdf to an A3 pdf
convert -page a0 infile.pdf -page a3 outfile.pdf

Categories