Read all types of files in python - python

I am trying to extract information from different types of files in python(.pdf .doc .docx) and convert to .txt but while processing different files I am getting space and newlines when not required and many other issues. I have tried PyPDF2 and PDF manager.Please suggest me something with which i can extract information from files.
EDIT
Currently looking for something which can help me extract exact text from .pdf files. I have tried PyPDF, PDFMiner and PDF Manager and I am getting issues with some pdfs in all of them.

Personally I think pdfminer is the best python module for extracting information from pdfs Get it here
I think you can refer to this link
for corresponding file formats.

Related

How to extract contents from WMI file in python?

I have WMI-files on my hdd, which I can open using some supporting tools (LogReader). In wmi-files, I have some report information as text module. I want to convert these files into .txt, .xml or some other datatype, which is suitable to use in python. So, I can use the information for further tasks. I tried to extract contents of WMI-Files in python but I couldn't accomplish it.
Is there any way to solve this problem of mine?

How to convert XML Word Documents to DOCX?

I have been given a series of folders with large amounts of Word documents in .xml formatting. They each contain some VBA code, but the code on all of them has already been run, so I don't need to keep this.
I need to print all of the files in each folder, but due to constraints on XML files on the network, I can't simply mass-print them from Windows Explorer, so I need to convert them to .docx (or .doc) first.
How can I go about doing this? I tried a simple python script using python-docx:
import os
from docx import Document
folderPath=<folderpath>
fileNamesList=os.listdir(folderPath)
for xmlFileName in fileNamesList:
currentDoc=Document(os.path.join(folderPath,xmlFileName))
docxFileName=xmlFileName.replace('.xml','.docx')
currentDoc.save(os.path.join(folderPath,docxFileName))
currentDoc.close()
This gives:
docx.opc.exceptions.PackageNotFoundError: Package not found at <first file name>.xml
I'm guessing this is because python-docx isn't meant to open .xml files, but that's a pretty uneducated guess. Searching around for this error, all I can find are problems with it not being installed properly (which it is, as far as I can tell) or using .doc files instead of .docx.
Am I simply using python-docx incorrectly? If not, is there are more suitable package or technique I should be using?
It's not clear just what sort of files those .xml files are, but I suspect they are a transitional format used I think in Word 2003, which was XML-based, but not the Open Packaging Convention (OPC) format used in Word documents since Word 2007.
python-docx is not going to read those, now or ever, so you'll either need to convert them to .docx format or parse the XML directly.
If I had Windows available, I suppose I would use VBA to write a short conversion script and then work with the .docx files using python-pptx. I would start by seeing if Word could load the .xml file and go from there.
You might be able to find a utility to do a bulk conversion, but I didn't find any on a quick search.
If all you're interested in is a one-time print, and Word will load the files, then a VBA script for that without the conversion step might be a good option. python-docx doesn't print .docx files, only read and write them.

How to retrieve highlighted sequences of an .epub file in Python?

I'm used to highlight the important parts of the .epub books I read on my Kobo reader, and I'd like to write a script extracting these highlighted parts, and saving them in a .txt file.
I've checked out epub library documentation, but I couldn't find anything relevant to my problem.
Can anyone give me some tips on how to select and save the highlighted parts of my epub files? Is there a specific tag I should search in the file?
The highlights and the annotations are stored in the KoboReader.sqlite file, not in each EPUB file.
You might want to check my script out:
http://github.com/pettarin/export-kobo
or search for Calibre plugins (there are several ones that can export highlights/annotations).

Processing a PDF for information extraction

I am working on a project where I have a pdf file which describes one of the health policy. What I need to do is extract the information from this PDF and try to save it in some form such that I can answer the questions related to the policy by extracting info from this PDf.
This PDF is too big, so I want to divide the PDF according to the different sections so that when a query related to some particular area comes in then I wont have to go through the entire document.
I tried solving this using some pdf converters which converts the PDFs into the HTMLs. But these converters wont convert the PDF to HTML properly so that headings will have heading tag. Also even if I convert this properly and get the proper sections out of the document, I am not getting how to store this data.(I mean in which form should I store this Data).
Is there any other solution with which I can achieve this. I am using Python and also I can use NLTK if needed. Also the format is not fixed for the PDfs, I mean to say my code should work on any kind of PDFs.
PDFMiner is great in that it has location for every bit of text it gets from the PDF. It won't be nicely put in header tags or anything like that, but if you have a consistent PDF structure in your docs you might be able to get something working.

How can I extract the tables, text and the pictures in ODT(OpenDocumentText) format using Python?

How can I extract the tables, text and the pictures in an ODT(OpenDocumentText) file to output them to another ODT file using Python on Ubuntu?
OOoPy seems to be a good fit. I've never used it, but it comes with documentation and code examples, and it can read and write ODT files.
An easy way is to just rename the foo.odt to foo.zip and then extract it. the extracted directory contains many files including Pictures.
However I think it's better to change it's type to docx and then do the process on docx (extract it). Because it extract images with better name (image1, image2, ...).

Categories