How to read '.doc' file with python-docx module - python

I'm trying to read the .doc file with python-docx module ,
I'm doing
import docx
path = 'Sample-doc-file-100kb.doc'
doc = docx.Document(path)
#extracting texts from doc
This works fine for .docx but gives ValueError: file 'Sample-doc-file-100kb.doc' is not a Word file, content type is 'application/vnd.openxmlformats-officedocument.themeManager+xml' error for .doc file.
I searched and found that this docx module doesn't work for older version of doc file. And I looked for converting the doc to docx but all the solution are windows dependent.
I'm running this code on aws-lambda so can't use those method .
Any way to either convert to doc to docx (platform independent) or to read .doc file?

convert to doc to docx (platform independent)
If you are able to provide working LibreOffice or OpenOffice then you might try using unoconv to do doc to docx conversion as it
is a command line tool to convert any document format that LibreOffice
can import to any document format that LibreOffice can export.

in Ubuntu with this command:
apt-get install antiword

Related

Convert the docx file into pdf in python

I am workin on a report generator, and I used pip install python-docx and
import docx.
Now I have made a new docx file, edited it but I want to save it in pdf instead of docx file. And the program script will be converted into EXE file.
Please help.
(pip install python-docx)
from docx import Document
doc=Document()
doc.add_heading('Report', 0)
# Now to save file, I know to save in docx,
# But, I want to save in pdf
# I can not finish the program and then manually convert
# As this script will run as an
# **EXE**
doc.save('report.docx')
I tried saving like --> doc.save('report.pdf') But, it did not work.
I fould some thing here: https://medium.com/analytics-vidhya/3-methods-to-convert-docx-files-into-pdf-files-using-python-b03bd6a56f45 I pesonally think the easiest way to do it is the docx2pdf module.
You can use the python package docx2pdf*:
pip install docx2pdf
Then call the convert function:
convert("report.docx", "report.pdf") after saving doc.save('report.docx'). Creating the docx file before converting it is mandatory.
unless you work on a Linux Machine as it requires Microsoft Word to be installed.
Try using the msoffice2pdf library using Microsoft Office or LibreOffice installed in the environment.
https://pypi.org/project/msoffice2pdf/

Converting (ideally) doc to pdf with python or docx to pdf, but I get error

I am working in ios and with spyder (anaconda) trying the following code in order to convert docx files which are in a directory (folder_path):
from docx2pdf import convert
import os
no_pdfs = []
i=1
for filename in os.listdir(os.path.normcase(folder_path)):
filename = os.path.join(folder_path, filename)
try:
convert(filename, os.path.splitext(filename)[0]+'.pdf')
print(f"DONE - {i}: {os.path.basename(filename)}")
i += 1
except Exception:
no_pdfs.append(os.path.basename(filename))
print(no_pdfs)
I use try - except in my code because there is the .DS_Store that appears with ios and nothing happens.
If I brutally try convert() I get the error: ImportError: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html but I am not really able to understand what goes wrong.
An extra thing is that my initial files are not actually .docx but .doc and I would really like a piece of advice where I could convert doc to pdf or doc to docx to pdf.
Any help will be much appreciated!
If you haven't resolved this yet, you could try installing the Parallels Access from the Apple store on your ios but it sounds like you just have to update your packages (pip install --upgrade PackageName). The code you provided might be working, but the error is being flagged from those packages you mentioned. Also with docx2pdf, have you installed word on your device? From the creator of docx2pdf "Unfortunately, it requires Microsoft Office to be installed and thus only works on Windows and macOS. – Al Johri"
Also, An efficient way to convert document to pdf format is worth reading.

Docx to pdf using pandoc in python

So I a quite new to Python so it may be a silly question but i can't seem to find the solution anywhere.
I have a django site I am running it locally on my machine just for development.
on the site I want to convert a docx file to pdf. I want to use pandoc to do this. I know there are other methods such as online apis or the python modules such as "docx2pdf". However i want to use pandoc for deployment reasons.
I have installed pandoc on my terminal using brew install pandoc.
so it should b installed correctly.
In my django project i am doing:
import pypandoc
import docx
def making_a_doc_function(request):
doc = docx.Document()
doc.add_heading("MY DOCUMENT")
doc.save('thisisdoc.docx')
pypandoc.convert_file('thisisdoc.docx', 'docx', outputfile="thisisdoc.pdf")
pdf = open('thisisdoc.pdf', 'rb')
response = FileResponse(pdf)
return response
The docx file get created no problem but it not pdf has been created. I am getting an error that says:
Pandoc died with exitcode "4" during conversion: b'cannot produce pdf output from docx\n'
Does anyone have any ideas?
The second argument to convert_file is output format, or, in this case, the format through which pandoc generates the pdf. Pandoc doesn't know how to produce a PDF through docx, hence the error.
Use pypandoc.convert_file('thisisdoc.docx', 'latex', outputfile="thisisdoc.pdf") or pypandoc.convert_file('thisisdoc.docx', 'pdf', outputfile="thisisdoc.pdf") instead.

How can I read pdf in python? [duplicate]

This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 5 years ago.
How can I read pdf in python?
I know one way of converting it to text, but I want to read the content directly from pdf.
Can anyone explain which module in python is best for pdf extraction
You can USE PyPDF2 package
# install PyPDF2
pip install PyPDF2
Once you have it installed:
# importing all the required modules
import PyPDF2
# creating a pdf reader object
reader = PyPDF2.PdfReader('example.pdf')
# print the number of pages in pdf file
print(len(reader.pages))
# print the text of the first page
print(reader.pages[0].extract_text())
Follow the documentation.
You can use textract module in python
Textract
for install
pip install textract
for read pdf
import textract
text = textract.process('path/to/pdf/file', method='pdfminer')
For detail Textract

How to open / read password protected xls or xlsx (Excel) file using python in Linux?

I want to open password protected xls or xlsx using python. Generally I use xlrd to process xls or xlsx files but it cannot open password protected excel files. I tried to use pywin32 but I was not able to install it on my Linux system.
You can do that using a method outlined in my answer to a similar question.
It does not require Excel to be installed and, because it's pure Python, it's cross-platform too!
msoffcrypto-tool supports password-protected (encrypted) Microsoft Office documents, including the older XLS binary file format.
Install msoffcrypto-tool:
pip install msoffcrypto-tool
You could create an unencrypted version of the workbook from the command line:
msoffcrypto-tool Myfile.xlsx Myfile-decrypted.xlsx -p "caa team"
Or, you could use msoffcrypto-tool as a library. While you could write an unencrypted version to disk like above, you may prefer to create an decrypted in-memory file and pass this to your Python Excel library (openpyxl, xlrd, etc.).
import io
import msoffcrypto
import openpyxl
decrypted_workbook = io.BytesIO()
with open('Myfile.xlsx', 'rb') as file:
office_file = msoffcrypto.OfficeFile(file)
office_file.load_key(password='caa team')
office_file.decrypt(decrypted_workbook)
# `filename` can also be a file-like object.
workbook = openpyxl.load_workbook(filename=decrypted_workbook)

Categories