I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:
f = open('test.doc', 'r')
f.read()
but this does not return a friendly string I need to convert it to utf-8
Edit: I just want get the text from this file
One can use the textract library.
It take care of both "doc" as well as "docx"
import textract
text = textract.process("path/to/file.extension")
You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.
antiword filename.doc > filename.docx
Ultimately, textract in the backend is using antiword.
You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images.
You can install it by running: pip install docx2txt.
Let's download and read the first Microsoft document on here:
import docx2txt
my_text = docx2txt.process("test.docx")
print(my_text)
Here is a screenshot of the Terminal output the above code:
EDIT:
This does NOT work for .doc files. The only reason I am keep this answer is that it seems there are people who find it useful for .docx files.
I was trying to to the same, I found lots of information on reading .docx but much less on .doc; Anyway, I managed to read the text using the following:
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open("myfile.doc")
doc = word.ActiveDocument
print(doc.Range().Text)
The answer from Shivam Kotwalia works perfectly. However, the object is imported as a byte type. Sometimes you may need it as a string for performing REGEX or something like that.
I recommend the following code (two lines from Shivam Kotwalia's answer) :
import textract
text = textract.process("path/to/file.extension")
text = text.decode("utf-8")
The last line will convert the object text to a string.
I agree with Shivam's answer except for textract doesn't exist for windows.
And, for some reason antiword also fails to read the '.doc' files and gives an error:
'filename.doc' is not a word document. # This happens when the file wasn't generated via MS Office. Eg: Web-pages may be stored in .doc format offline.
So, I've got the following workaround to extract the text:
from bs4 import BeautifulSoup as bs
soup = bs(open(filename).read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).encode('utf-8').strip()
print text
This script will work with most kinds of files.
Have fun!
Prerequisites :
install antiword : sudo apt-get install antiword
install docx : pip install docx
from subprocess import Popen, PIPE
from docx import opendocx, getdocumenttext
from cStringIO import StringIO
def document_to_text(filename, file_path):
cmd = ['antiword', file_path]
p = Popen(cmd, stdout=PIPE)
stdout, stderr = p.communicate()
return stdout.decode('ascii', 'ignore')
print document_to_text('your_file_name','your_file_path')
Notice – New versions of python-docx removed this function. Make sure to pip install docx and not the new python-docx
I looked for solution so long. Materials about .doc file is not enough, finally I solved this problem by changing type .doc to .docx
from win32com import client as wc
w = wc.Dispatch('Word.Application')
# Or use the following method to start a separate process:
# w = wc.DispatchEx('Word.Application')
doc=w.Documents.Open(os.path.abspath('test.doc'))
doc.SaveAs("test_docx.docx",16)
I had to do the same to search through a ton of *.doc files for a specific number and came up with:
special_chars = {
"b'\\t'": '\t',
"b'\\r'": '\n',
"b'\\x07'": '|',
"b'\\xc4'": 'Ä',
"b'\\xe4'": 'ä',
"b'\\xdc'": 'Ü',
"b'\\xfc'": 'ü',
"b'\\xd6'": 'Ö',
"b'\\xf6'": 'ö',
"b'\\xdf'": 'ß',
"b'\\xa7'": '§',
"b'\\xb0'": '°',
"b'\\x82'": '‚',
"b'\\x84'": '„',
"b'\\x91'": '‘',
"b'\\x93'": '“',
"b'\\x96'": '-',
"b'\\xb4'": '´'
}
def get_string(path):
string = ''
with open(path, 'rb') as stream:
stream.seek(2560) # Offset - text starts after byte 2560
current_stream = stream.read(1)
while not (str(current_stream) == "b'\\xfa'"):
if str(current_stream) in special_chars.keys():
string += special_chars[str(current_stream)]
else:
try:
char = current_stream.decode('UTF-8')
if char.isalnum():
string += char
except UnicodeDecodeError:
string += ''
current_stream = stream.read(1)
return string
I'm not sure how 'clean' this solution is, but it works well with regex.
!pip install python-docx
import docx
#Creating a word file object
doc = open("file.docx","rb")
#creating word reader object
document = docx.Document(doc)
This code will run when if you are looking for how to read the doc file in python install the all related packages first and see the result.
if doc_file:
_file=requests.get(request.values['MediaUrl0'])
doc_file_link=BytesIO(_file.content)
file_path=os.getcwd()+'\+data.doc'
E=open(file_path,'wb')
E.write(doc_file_link.getbuffer())
E.close()
word = win32.gencache.EnsureDispatch('Word.Application',pythoncom.CoInitialize())
doc = word.Documents.Open(file_path)
doc.Activate()
doc_data=doc.Range().Text
print(doc_data)
doc.Close(False)
if os.path.exists(file_path):
os.remove(file_path)
Related
👉The original file is docx format, which has multiple tables, but there may be format problems, so it cannot be read by python-docx.
✔️ 1.Solution by hand:
solve the question by click [save as ....] menu. A prompt box appears:
prompt box : appears upgrade to newest
❓2. Question:
How to implement [save as] function through Python-docx, upgrade the docx format to the latest?
😃Thanks for any suggestion!
3. appendix
from docx import Document
from win32com import client as wc
file = 'D:\\1.docx'
word = wc.Dispatch("Word.Application")
word.Visible = False
doc = word.Documents.Open(file)
doc.SaveAs("{}".format(file), 12)
doc.Close()
word.Quit()
With a compromise method,
we first created a blank DOCX,
then using Win32 libraries to copy the content as a whole to the blank DOCX,
testing available
Still looking forward to optimization methods
I am new to coding python and have trouble when I print out from a file (only tried from .rtf) as it displays all the file properties. I've tried a variety of ways to code the same thing, but the output is always similar. Example of the code and the output:
opener=open("file.rtf","r")
print(opener.read())
opener.close()
The file only contains this:
Camila
Employee
Try it
But the outcome is always:
{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\f0\fs24 \cf0 Camila\
\
Employees\
\
Try it}
Help? How to stop that from happening or what am I doing wrong?
The RTF filetype contains more information than just the text, like fonts etc..
Python reads the RTF file as plain text, and therefore includes this information.
If you want to get the plain text, you need a module that can translate it, like striprtf
Make sure the module is installed by running this in the commandline:
pip install striprtf
Then, to get your text:
from striprtf.striprtf import rtf_to_text
file = open("file.rtf", "r")
plaintext = rtf_to_text(file.read())
file.close()
Use this package https://github.com/joshy/striprtf.
from striprtf.striprtf import rtf_to_text
rtf = "some rtf encoded string"
text = rtf_to_text(rtf)
print(text)
I am trying to extract text from pdf using pdfminer.six library (like here), I have already installed it in my virtual environment. here is my code :
import pdfminer as miner
text = miner.high_level.extract_text('file.pdf')
print(text)
but when I execute the code with python pdfreader.py I get the following error :
Traceback (most recent call last):
File ".\pdfreader.py", line 9, in <module>
text = miner.high_level.extract_text('pdfBulletins/corona1.pdf')
AttributeError: module 'pdfminer' has no attribute 'high_level'
I suspect it has something to do with the Python path, because I installed pdfminer inside my virtual environment, but I see that this installed pdf2txt.py outside in my system python install. Is this behavior normal? I mean something that happens inside my venv should not alter my system Python installation.
I successfully extracted the text using pdf2txt.py utility that comes with pdfminer.six library (from command line and using system python install), but not from the code inside my venv project. My pdfminer.six version is 20201018
What could be the problem with my code ?
pdfminer high_level extract_text requires additional parameters to work properly. This code below uses pdfminer.six and it extracts the text from my pdf files.
from pdfminer.high_level import extract_text
pdf_file = open('my_file.pdf', 'rb')
text = extract_text(pdf_file, password='', page_numbers=None, maxpages=0, caching=True, codec='utf-8', laparams=None)
print(text)
Here are a couple of additional posts that I wrote on extracting text from PDF files that might be useful:
Unsuccessful attempt to extract text data from PDF
How to convert whole pdf to text in python
how to write code to extract a specific text and integer on the same line from a pdf file using python?
Python Data Extraction from an Encrypted PDF
Your problem is trying to use a function from a module you have not imported. Importing pdfminer does NOT automatically also import pdfminer.high_level.
This works:
from pdfminer.high_level import extract_text
text = extract_text('file.pdf')
print(text)
Try pdfreader to extract texts (plain and containing PDF operators) from PDF document
Here is a sample code extracting all the above from all document pages.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
You'll need to install pdfminer.six instead of just pdfminer:
pip install pdfminer.six
Only after that, you can import extract_text as:
from pdfminer.high_level import extract_text
Problem in my case
pdfminer and pdfminer.six are both installed,
from pdfminer.high_level import extract_text than tries to use the wrong package.
Solution
For me uninstalling pdfminer worked:
pip uninstall pdfminer
now you should only have pdfminer.six installed and should be able to import extract_text.
I am trying to convert .doc documents to .docx documents using python. Getting inspiration from this post, I have tried the following code :
import subprocess
import glob
import os
root = "//PARADFS101/7folder/LIAGREV/Documents/RFP/"
data_path = root + '/data2/'
os.chdir(data_path)
for doc in glob.iglob("*.doc"):
print(doc)
subprocess.call(['soffice', '--headless', '--convert-to', 'docx', doc], shell = True)
But unfortunately litterally nothing happens, i.e. I get no error message, the code is running, the docs are detected (which I check thanks to print) but I don't get any result. Any idea how I may troubleshoot this ?
EDITS :
I am running on Windows, hence shell = True
I have tried double quotes : '"
I have tried without spaces in the names
When I execute the subprocess command on one file alone, I get 1as output, which I don't knowhow to interpret...
My text file includes "SSS™" as one of its words and I am trying to find it using regular expression. My problem is with finding ™ superscript. My code is:
import re
path='G:\python_code\A.txt'
f_general=open(path, 'r')
special=re.findall(r'\U2122',f_general.read())
print(special)
but it doesn't print anything. How can I fix it?
It may have to do with the encoding of your file. Try this:
import re
path = "g:\python_code\A.txt"
f_general=open(path, "r", encoding="UTF-16")
data = f_general.read()
special=re.findall(chr(8482), data)
print(special)
print(chr(8482))
Note I'm using the decimal value for Trade mark. This is the site I use:
https://www.ascii.cl/htmlcodes.htm
So, open the file you have in notepad. Do a save as and choose encoding unicode and this should all work. Working with extended ascii can be a hassle. I am using Python 3.6 but I think this should still work in 2.x
Note when it prints out the chr(8482) in your command line it will probably just be a T, at least that is what I get in windows.
update
Try this for Python 2 and this should capture the word before trademark:
import re
with open("g:\python_code\A.txt", "rb") as f:
data = f.read().decode("UTF-16")
regex = re.compile("\S+" + chr(8482))
match = re.search(regex, data)
if match:
print (match.group(0))