extract text from pdf file object in python - python

can we extract text from pdf file object collected from request for example
f = request.FILES.get('file', None)
So from f can we extract text of the document as we get text content from text file object.

Try using this library called textract
http://textract.readthedocs.io/en/latest/
It supports a lot of formats including PDF
import textract
text = textract.process("path/to/file.extension")

Related

Reading nothing from a pdf file using PyPDF2

I want to convert a pdf file to json format. So I was using PyPDF2 module to read the pdf. But I am unable to read it. I gives me some "\n" characters but no text. The pdf I am using can be retrieve from here:
pdf_to_json.pdf
The code I am using is:
import PyPDF2
file = open("pdf_to_json.pdf", "rb")
pdf = PyPDF2.PdfFileReader(file)
page_one = pdf.getPage(0)
page_one.extractText()
It's returning something like this:
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
DISCLAIMER: The pdf is in spanish

Reading from pdf file to text yields no results

So I'm trying something very simple: I just want to read text from a pdf file in to a variable - that's it. This is what I'm getting:
Does anyone know a reliable way to just read pdf in to a text file?
Try the following library - pdfplumber:
import pdfplumber
pdf_file = pdfplumber.open('anyfile.pdf')
page = pdf_file.pages[0]
text = page.extract_text()
print(text)
pdf_file.close()
I haven't used PyPDF2 before but pdfplumber seems to work well for me.

save text to a docx file python

I have a structured text on python. I want to save it to a docx file.
Something like
text = "A simple text\n Structured"
with open('docx_file.docx', 'w') as f:
f.write(text)
Check python-docx package:
from docx import Document
document = Document()
document.add_heading('A simple text', level=1)
document.add_paragraph('some more text ... ')
document.save('docx_file.docx')
A docx file is not a plain text file, so unless you want to not use a library for this, I recommend https://grokonez.com/python/how-to-read-write-word-docx-files-in-python-docx-module.
Unless you need to use a "fancy" format like docx, I would recommend just writing plain text to a txt file.

How to get XML from DOC (not DOCX)?

For a DOCX document I do:
document = zipfile.ZipFile(path)
soup = BeautifulSoup(document.read('word/document.xml'), 'html.parser')
How to do this for DOC document?
You don't.
DOCX are tough enough to process, and they're XML-based and documented by international standards organizations. DOC files are binary and proprietary.
Don't try to process DOC files directly. Convert them to DOCX first.
See:
Convert .doc to .docx using C#
Automation: how to automate transforming .doc to .docx?
multiple .doc to .docx file conversion using python
Python & MS Word: Convert .doc to .docx?

How to copy paragraph of docx to another docx and retain all styles

I am reading the input docx file sections/paragraphs and then copy-pasting the content in it to another docx file at a particular section. The content is having images, tables and bullet points in between the data. However, I'm getting only text not the images, tables and bullet points present in between the text.
Tika module is able to read whole content but the whole docx is coming in a single string so I'm unable to iterate over the section and also I'm unable to edit(copy-pasting the content) the output docx file.
Tried using python-docx, whereas it reads only content and it won't identify the images and tables inside the paragraph in between text data. The python-docx will identifies all the images and tables present in whole document not particularly with paragraph
Tried unzipping word to XML, but the XML is having images in a separate folder. Also, the code will not identify the bullets
def tika_extract_data(input_file, output_file):
import tika, collections
from tika import unpack
parsed = collections.OrderedDict()
parsed = unpack.from_file(input_file)
with open(output_file, 'w') as f:
for line in parsed:
if line == 'content':
lines = parsed[line]
# print(lines)
for indx, j in enumerate(lines.split("\\n")):
print(j)
I expected the output file should be having all the sections replaced with the copied input section content(images, tables, smart art and formats)
The output file just has the text data.

Categories