save text to a docx file python - python

I have a structured text on python. I want to save it to a docx file.
Something like
text = "A simple text\n Structured"
with open('docx_file.docx', 'w') as f:
f.write(text)

Check python-docx package:
from docx import Document
document = Document()
document.add_heading('A simple text', level=1)
document.add_paragraph('some more text ... ')
document.save('docx_file.docx')

A docx file is not a plain text file, so unless you want to not use a library for this, I recommend https://grokonez.com/python/how-to-read-write-word-docx-files-in-python-docx-module.
Unless you need to use a "fancy" format like docx, I would recommend just writing plain text to a txt file.

Related

Reading from pdf file to text yields no results

So I'm trying something very simple: I just want to read text from a pdf file in to a variable - that's it. This is what I'm getting:
Does anyone know a reliable way to just read pdf in to a text file?
Try the following library - pdfplumber:
import pdfplumber
pdf_file = pdfplumber.open('anyfile.pdf')
page = pdf_file.pages[0]
text = page.extract_text()
print(text)
pdf_file.close()
I haven't used PyPDF2 before but pdfplumber seems to work well for me.

How to copy certain strings from txt file to Word doc using Python?

I want to copy a word or string from txt file to word file at a certain block of table!
can someone guide me how to do it?
Best Regards,
Usman
If your question is how to write a word (.docx) file. There is a library called docx. Simply installed using pip:
pip install python-docx
Here is a short example that writes a docx file for you.
from docx import Document
document = Document()
document.add_heading('Document Title', 0)
p = document.add_paragraph('A plain paragraph having some text')
document.save('demo.docx')
Here is a script that that will read a text file and add lines that match a condition to a word file. I would change the matches_my_condition function to your own needs.
from docx import Document
def matches_my_condition(line):
""" Returns true or false if the given line should be added to the document """
# Which will return true if the word cake appears in the line
return 'cake' in line
# Prepare document
document = Document()
with open('my_text_file.txt', 'r') as textfile:
for line in textfile.readlines():
if matches_my_condition(line):
document.add_paragraph(line)
document.save('my_cake_file.docx')

Splitting PDF files into Paragraphs

I have a question regarding the splitting of pdf files. basically I have a collection of pdf files, which files I want to split in terms of paragraph. so to each paragraph of the pdf file to be a file on its own. I would appreciate if you can help me with this, preferably in Python, but if that is not possible any language will do.
You can use pdftotext for the above, wrap it in python subprocess. Alternatively you could use some other library which already do it implicitly like textract. Here is a quick example, Note: I have used 4 spaces as delimiter to convert the text to paragraph list, you might want to use different technique.
import re
import textract
#read the content of pdf as text
text = textract.process('file_name.pdf')
#use four space as paragraph delimiter to convert the text into list of paragraphs.
print re.split('\s{4,}',text)

Read Docx files via python

Does anyone know a python library to read docx files?
I have a word document that I am trying to read data from.
There are a couple of packages that let you do this.
Check
python-docx.
docx2txt (note that it does not seem to work with .doc). As per this, it seems to get more info than python-docx.
From original documentation:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
textract (which works via docx2txt).
Since .docx files are simply .zip files with a changed extension, this shows how to access the contents.
This is a significant difference with .doc files, and the reason why some (or all) of the above do not work with .docs.
In this case, you would likely have to convert doc -> docx first. antiword is an option.
python-docx can read as well as write.
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Now all paragraphs will be in the list allText.
Thanks to "How to Automate the Boring Stuff with Python" by Al Sweigart for the pointer.
See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi. Then you can use the following
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
A quick search of PyPI turns up the docx package.
import docx
def main():
try:
doc = docx.Document('test.docx') # Creating word reader object.
data = ""
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
data = '\n'.join(fullText)
print(data)
except IOError:
print('There was an error opening the file!')
return
if __name__ == '__main__':
main()
and dont forget to install python-docx using (pip install python-docx)

read ms word with python

I'm trying to read ms word with StringIO. But somehow the output become strange string
from docx import Document
import StringIO
import cStringIO
files = "D:/Workspace/Python scripting/test.docx"
document = Document(files)
f = cStringIO.StringIO()
document.save(f)
contents = f.getvalue()
print contents
Thanks for any help in advance
document.save(f) saves the file to a string, formatted as a .docx file. You're then reading that string, which will do exactly the same thing as f=open(files).read(). If you want the text in the document, you should use python-docx's API for that. I haven't used it before, but the documentation is here:
https://python-docx.readthedocs.org/en/latest/index.html
It looks like you could use something like this:
paragraphs=document.paragraphs
This is the list of Paragraph objects in the document. You can get the tex of that paragraph like this:
text="\n".join([paragraph.text for paragraph in paragraphs])
text will then contain the text of the document.

Categories