How can I read pdf in python? [duplicate] - python

This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 5 years ago.
How can I read pdf in python?
I know one way of converting it to text, but I want to read the content directly from pdf.
Can anyone explain which module in python is best for pdf extraction

You can USE PyPDF2 package
# install PyPDF2
pip install PyPDF2
Once you have it installed:
# importing all the required modules
import PyPDF2
# creating a pdf reader object
reader = PyPDF2.PdfReader('example.pdf')
# print the number of pages in pdf file
print(len(reader.pages))
# print the text of the first page
print(reader.pages[0].extract_text())
Follow the documentation.

You can use textract module in python
Textract
for install
pip install textract
for read pdf
import textract
text = textract.process('path/to/pdf/file', method='pdfminer')
For detail Textract

Related

Convert the docx file into pdf in python

I am workin on a report generator, and I used pip install python-docx and
import docx.
Now I have made a new docx file, edited it but I want to save it in pdf instead of docx file. And the program script will be converted into EXE file.
Please help.
(pip install python-docx)
from docx import Document
doc=Document()
doc.add_heading('Report', 0)
# Now to save file, I know to save in docx,
# But, I want to save in pdf
# I can not finish the program and then manually convert
# As this script will run as an
# **EXE**
doc.save('report.docx')
I tried saving like --> doc.save('report.pdf') But, it did not work.
I fould some thing here: https://medium.com/analytics-vidhya/3-methods-to-convert-docx-files-into-pdf-files-using-python-b03bd6a56f45 I pesonally think the easiest way to do it is the docx2pdf module.
You can use the python package docx2pdf*:
pip install docx2pdf
Then call the convert function:
convert("report.docx", "report.pdf") after saving doc.save('report.docx'). Creating the docx file before converting it is mandatory.
unless you work on a Linux Machine as it requires Microsoft Word to be installed.
Try using the msoffice2pdf library using Microsoft Office or LibreOffice installed in the environment.
https://pypi.org/project/msoffice2pdf/

How to read '.doc' file with python-docx module

I'm trying to read the .doc file with python-docx module ,
I'm doing
import docx
path = 'Sample-doc-file-100kb.doc'
doc = docx.Document(path)
#extracting texts from doc
This works fine for .docx but gives ValueError: file 'Sample-doc-file-100kb.doc' is not a Word file, content type is 'application/vnd.openxmlformats-officedocument.themeManager+xml' error for .doc file.
I searched and found that this docx module doesn't work for older version of doc file. And I looked for converting the doc to docx but all the solution are windows dependent.
I'm running this code on aws-lambda so can't use those method .
Any way to either convert to doc to docx (platform independent) or to read .doc file?
convert to doc to docx (platform independent)
If you are able to provide working LibreOffice or OpenOffice then you might try using unoconv to do doc to docx conversion as it
is a command line tool to convert any document format that LibreOffice
can import to any document format that LibreOffice can export.
in Ubuntu with this command:
apt-get install antiword

Docx to pdf using pandoc in python

So I a quite new to Python so it may be a silly question but i can't seem to find the solution anywhere.
I have a django site I am running it locally on my machine just for development.
on the site I want to convert a docx file to pdf. I want to use pandoc to do this. I know there are other methods such as online apis or the python modules such as "docx2pdf". However i want to use pandoc for deployment reasons.
I have installed pandoc on my terminal using brew install pandoc.
so it should b installed correctly.
In my django project i am doing:
import pypandoc
import docx
def making_a_doc_function(request):
doc = docx.Document()
doc.add_heading("MY DOCUMENT")
doc.save('thisisdoc.docx')
pypandoc.convert_file('thisisdoc.docx', 'docx', outputfile="thisisdoc.pdf")
pdf = open('thisisdoc.pdf', 'rb')
response = FileResponse(pdf)
return response
The docx file get created no problem but it not pdf has been created. I am getting an error that says:
Pandoc died with exitcode "4" during conversion: b'cannot produce pdf output from docx\n'
Does anyone have any ideas?
The second argument to convert_file is output format, or, in this case, the format through which pandoc generates the pdf. Pandoc doesn't know how to produce a PDF through docx, hence the error.
Use pypandoc.convert_file('thisisdoc.docx', 'latex', outputfile="thisisdoc.pdf") or pypandoc.convert_file('thisisdoc.docx', 'pdf', outputfile="thisisdoc.pdf") instead.

How do open python NLTK downloader [duplicate]

This question already has answers here:
How do I download NLTK data?
(15 answers)
Closed 3 years ago.
I'm trying to download all the nltk packages :
import nltk
nltk.download()
But the GUI don't open.
This is what appears instead :
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
When I type d I can see the list of available packages but I'm not offered to download them.
How can I open the GUI, or how can I download them all from the console ? Thanks.
Try
nltk.download('all')
this will download all the data and no need to download individually.

I installed a pip package, how to know the module name to import? [duplicate]

This question already has answers here:
How to find "import name" of any package in Python?
(5 answers)
Closed 5 years ago.
I am doing a python webpage scraper .
Some tutorial told me to use this package: BeautifulSoup. So I installed it using pip.
Then, in my script, I try to import BeautifulSoup as bs. But I was warned that no module named BeautifulSoup.
Is there a reliable way to get module name out of an installed package?
Try this from bs4 import BeautifulSoup
Edit: Was already answered by #jonsharpe and #Vinícius Aguiar in the comments under the question.

Categories