For a DOCX document I do:
document = zipfile.ZipFile(path)
soup = BeautifulSoup(document.read('word/document.xml'), 'html.parser')
How to do this for DOC document?
You don't.
DOCX are tough enough to process, and they're XML-based and documented by international standards organizations. DOC files are binary and proprietary.
Don't try to process DOC files directly. Convert them to DOCX first.
See:
Convert .doc to .docx using C#
Automation: how to automate transforming .doc to .docx?
multiple .doc to .docx file conversion using python
Python & MS Word: Convert .doc to .docx?
Related
Below is a simple example to write an XML file and read it back. The writing works OK, but I am not sure how to read this file back? Below is some sample code. How do I get thse values from the XML file?
file1 = 'result1.xml'
fs = cv2.FileStorage(file1, cv2.FILE_STORAGE_WRITE)
fs.write('var1', 1)
fs.write('var2', 2)
fs = cv2.FileStorage(file1,cv2.FILE_STORAGE_READ)
fn = fs.real
Python in different versions has its own library to parse XML data.
Here is where you can find the documentation : XML Library
You have to be careful when using it, as said in title of the webpage, this library is not safe if XML files aren't built properly.
Here is another useful website : How to parse XML files using Python ?
I'm using comptypes python 3.6 and trying read office documents as i need to extract the text from these files.
I understand that for word and ppt this is how to open files using comtype
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(filename)
ppt = comtypes.client.CreateObject('PowerPoint.Application')
prs = ppt.Presentations.Open(filename)
How about for Outlook files (.msg)? I tried the following code but doesn't work
ol = comtypes.client.CreateObject('Outlook.Application')
msg = ol.MailItem.Open(filename)
I've resorted in using the approach done in this thread instead of what I was testing out on my question.
I have a question regarding the splitting of pdf files. basically I have a collection of pdf files, which files I want to split in terms of paragraph. so to each paragraph of the pdf file to be a file on its own. I would appreciate if you can help me with this, preferably in Python, but if that is not possible any language will do.
You can use pdftotext for the above, wrap it in python subprocess. Alternatively you could use some other library which already do it implicitly like textract. Here is a quick example, Note: I have used 4 spaces as delimiter to convert the text to paragraph list, you might want to use different technique.
import re
import textract
#read the content of pdf as text
text = textract.process('file_name.pdf')
#use four space as paragraph delimiter to convert the text into list of paragraphs.
print re.split('\s{4,}',text)
Does anyone know a python library to read docx files?
I have a word document that I am trying to read data from.
There are a couple of packages that let you do this.
Check
python-docx.
docx2txt (note that it does not seem to work with .doc). As per this, it seems to get more info than python-docx.
From original documentation:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
textract (which works via docx2txt).
Since .docx files are simply .zip files with a changed extension, this shows how to access the contents.
This is a significant difference with .doc files, and the reason why some (or all) of the above do not work with .docs.
In this case, you would likely have to convert doc -> docx first. antiword is an option.
python-docx can read as well as write.
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Now all paragraphs will be in the list allText.
Thanks to "How to Automate the Boring Stuff with Python" by Al Sweigart for the pointer.
See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi. Then you can use the following
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
A quick search of PyPI turns up the docx package.
import docx
def main():
try:
doc = docx.Document('test.docx') # Creating word reader object.
data = ""
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
data = '\n'.join(fullText)
print(data)
except IOError:
print('There was an error opening the file!')
return
if __name__ == '__main__':
main()
and dont forget to install python-docx using (pip install python-docx)
How can search a word or a line in a pdf file?
Is there an existing module to do that by being concise?
Thank you in advance,
There's something called as pyPDF.
It is a Pure-Python library built as a PDF toolkit.
You can extract ( using extractText() method ) & also perform search on the pdf file with using something like following code.
pdf = pyPdf.PdfFileReader(file(path, "rb"))
content = pdf.getPage(1).extractText()