I have a code written in Python that reads from PDF files and convert it to text file.
The problem occurred when I tried to read Arabic text from PDF files. I know that the error is in the coding and encoding process but I don't know how to fix it.
The system converts Arabic PDF files but the text file is empty.
and display this error:
Traceback (most recent call last): File
"C:\Users\test\Downloads\pdf-txt\text maker.py", line 68, in
f.write(content) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 50: ordinal not in range(128)
Code:
import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime
def check_path(prompt):
''' (str) -> str
Verifies if the provided absolute path does exist.
'''
abs_path = raw_input(prompt)
while path.exists(abs_path) != True:
print "\nThe specified path does not exist.\n"
abs_path = raw_input(prompt)
return abs_path
print "\n"
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
for filename in files:
if filename.endswith('.pdf'):
t=os.path.join(directory,filename)
list.append(t)
m=len(list)
print (m)
i=0
while i<=m-1:
path=list[i]
print(path)
head,tail=os.path.split(path)
var="\\"
tail=tail.replace(".pdf",".txt")
name=head+var+tail
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for j in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(j).extractText() + "\n"
print strftime("%H:%M:%S"), " pdf -> txt "
f=open(name,'w')
content.encode('utf-8')
f.write(content)
f.close
i=i+1
You have a couple of problems:
content.encode('utf-8') doesn't do anything. The return value is the encoded content, but you have to assign it to a variable. Better yet, open the file with an encoding, and write Unicode strings to that file. content appears to be Unicode data.
Example (works for both Python 2 and 3):
import io
f = io.open(name,'w',encoding='utf8')
f.write(content)
If you don't close the file properly, you may see no content because the file is not flushed to disk. You have f.close not f.close(). It's better to use with, which ensures the file is closed when the block exits.
Example:
import io
with io.open(name,'w',encoding='utf8') as f:
f.write(content)
In Python 3, you don't need to import and use io.open but it still works. open is equivalent. Python 2 needs the io.open form.
you can use anthor library called pdfplumber instead of using pypdf or PyPDF2
import arabic_reshaper
from bidi.algorithm import get_display
with pdfplumber.open(r'example.pdf') as pdf:
my_page = pdf.pages[10]
thepages=my_page.extract_text()
reshaped_text = arabic_reshaper.reshape(thepages)
bidi_text = get_display(reshaped_text)
print(bidi_text)
Related
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt
#converts pdf, returns its text content as a string
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
filepath = open(fname, 'rb')
for page in PDFPage.get_pages(filepath, pagenums):
interpreter.process_page(page)
filepath.close()
converter.close()
text = output.getvalue()
output.close
return text
def convertMultiple(pdfDir, txtDir):
if pdfDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in
for pdf in os.listdir(pdfDir): #iterate through pdfs in pdf directory
fileExtension = pdf.split(".")[-1]
if fileExtension == "pdf":
pdfFilename = pdfDir + pdf
text = convert(pdfFilename) #get string of text content of pdf
textFilename = txtDir + pdf + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
#textFile.close
pdfDir = (r"FK_EPPS")
txtDir = (r"FK_txt")
convertMultiple(pdfDir, txtDir)
I tried to convert multiple pdf files called FK_EPPS into txt files and write it in different folder called FK_txt. But it says that there is no such files or directory. I put the folder exactly in those path. I try find the solution but still there is an error. Can you help me why this is happen?
/usr/local/lib/python2.7/dist-packages/pdfminer/__init__.py:20: UserWarning: On January 1st, 2020, pdfminer.six will stop supporting Python 2. Please upgrade to Python 3. For more information see https://github.com/pdfminer/pdfminer.six/issues/194
warnings.warn('On January 1st, 2020, pdfminer.six will stop supporting Python 2. Please upgrade to Python 3. For '
Traceback (most recent call last):
File "/home/a1-re/Documents/pdftotext/1.py", line 44, in <module>
convertMultiple(pdfDir, txtDir)
File "/home/a1-re/Documents/pdftotext/1.py", line 36, in convertMultiple
text = convert(pdfFilename) #get string of text content of pdf
File "/home/a1-re/Documents/pdftotext/1.py", line 21, in convert
filepath = file(fname, 'rb')
IOError: [Errno 2] No such file or directory: 'pdf1831150030.pdf'
(There is no way the traceback that you show is correct. With your sample input, the error should have contained FK_EPPS at the start.)
You forget that a path and filename must be separated from each other with the appropriate separator for your OS.
You could immediately have seen this if you had printed out the value of fname at the start of that convert function. You make the same mistake for the text output filename, but that would be harder to notice because it would not yield an error, but only create a wrong filename.
My code works ok except for hashing. It works fine on hashing text files but as soon as it encounters a jpg or other file type, it crashes. I know it's some type of encoding error, but I'm stumped on how to encode it properly for non-text files.
#import libraries
import os
import time
from datetime import datetime
import logging
import hashlib
from prettytable import PrettyTable
from pathlib import Path
import glob
#user input
path = input ("Please enter directory: ")
print ("===============================================")
#processing input
if os.path.exists(path):
print("Processing directory: ", (path))
else:
print("Invalid directory.")
logging.basicConfig(filename="error.log", level=logging.ERROR)
logging.error(' The directory is not valid, please run the script again with the correct directory.')
print ("===============================================")
#process directory
directory = Path(path)
paths = []
filename = []
size = []
hashes = []
modified = []
files = list(directory.glob('**/*.*'))
for file in files:
paths.append(file.parents[0])
filename.append(file.parts[-1])
size.append(file.stat().st_size)
modified.append(datetime.fromtimestamp(file.stat().st_mtime))
with open(file) as f:
hashes.append(hashlib.md5(f.read().encode()).hexdigest())
#output in to tablecx
report = PrettyTable()
column_names = ['Path', 'File Name', 'File Size', 'Last Modified Time', 'MD5 Hash']
report.add_column(column_names[0], paths)
report.add_column(column_names[1], filename)
report.add_column(column_names[2], size)
report.add_column(column_names[3], modified)
report.add_column(column_names[4], hashes)
report.sortby = 'File Size'
print(report)
change following lines
with open(file) as f:
hashes.append(hashlib.md5(f.read().encode()).hexdigest())
to
with open(file, "rb") as f:
hashes.append(hashlib.md5(f.read()).hexdigest())
Doing this you will read the contents directly as bytes and you calculate the hash.
Your version tried to read the file as text and re-encoded it to bytes.
Reading a file as text means, the code tries to decode it with the system's encoding. For some byte combinations this will fail, as they are no valid code points for the given encoding.
So just read everything directly as bytes.
i have a python script that list all the pdf files in a specified directory and convert it to text files.
the system work perfect the problem is when i have ARABIC text the script crash because its unable to search in the PDF file.
i know that pdf are binary but do not know how to read ARABIC and convert it to text
how to fix this error ?
i tried to encode to UTF-8 and decode it but still not working
if i try the code with comment lines at the bottom of the code the result will be converted empty text file.
if i try to Uncomment the lines in order to encode and decode the result will be empty converted text file with this error:
Traceback (most recent call last): File
"C:\Users\test\Downloads\pdf-txt\text maker.py", line 63, in
content.decode('ascii', 'ignore') UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 50: ordinal not in
range(128)
code:
import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime
def check_path(prompt):
''' (str) -> str
Verifies if the provided absolute path does exist.
'''
abs_path = raw_input(prompt)
while path.exists(abs_path) != True:
print "\nThe specified path does not exist.\n"
abs_path = raw_input(prompt)
return abs_path
print "\n"
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
for filename in files:
if filename.endswith('.pdf'):
t=os.path.join(directory,filename)
##print(t)
##list.extend(t)
list.append(t)
## print(list)
m=len(list)
print (m)
i=0
while i<=m-1:
path=list[i]
print(path)
head,tail=os.path.split(path)
var="\\"
tail=tail.replace(".pdf",".txt")
name=head+var+tail
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
##pdf = pyPdf.PdfFileReader(codecs.open(path, "rb", encoding='UTF-8'))
# Iterate pages
for j in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(j).extractText() + "\n"
print strftime("%H:%M:%S"), " pdf -> txt "
f=open(name,'w')
## all this red line are added by me ##
##f.decode(content.encode('UTF-8'))
##content.encode('utf-8')
##content.decode('ascii', 'ignore')
##content.decode('unicode-escape')
##f.write(content)
f.write(content.encode('UTF-8'))
f.close
i=i+1
5 years later: pypdf has gone through major updates. Extracting Arabic text should work fine the standard way:
from pypdf import PdfReader
reader = PdfReader("arabic.pdf")
full_text = ""
for page in reader.pages:
full_text += page.extract_text() + "\n"
print(full_text)
If some PDF is causing issues, please report a bug. You need to share the pypdf version and the file that causes the issue.
What NOT to do
Don't do anything with encode / decode. It's not necessary.
Don't use PyPDF2 / PyPDF3 / PyPDF4: The community moved to pypdf.
I have the following script to process filenames with non-latin characters:
import os
filelst = []
allfile = os.listdir(os.getcwd())
for file in allfile:
if os.path.isfile(file):
filelst.append(file)
w = open(os.getcwd()+'\\_filelist.txt','w+')
for file in allfile:
w.write(file)
w.write("\n")
w.close()
filelist in my folder:
new 1.py
ああっ女神さまっ 小っちゃいって事は便利だねっ.1998.Ep0108.x264.AC3CalChi.avi
ああっ女神さまっ 小っちゃいって事は便利だねっ.1998.Ep0108.x264.AC3CalChi.srt
output in _filelist.txt:
new 1.py
???????? ??????????????.1998.Ep01-08.x264.AC3-CalChi.avi
???????? ??????????????.1998.Ep01-08.x264.AC3-CalChi.srt
You should get the list of files as Unicode strings instead by passing a Unicode file path to listdir. As you're using getcwd, use: os.getcwdu()
Then open your output file with a text encoding wrapper. io module is the new way to do this (io handles Universal newlines correctly).
Putting it all together:
import os
import io
filelst = []
allfile = os.listdir(os.getcwdu())
for file in allfile:
if os.path.isfile(file):
filelst.append(file)
w = io.open(os.getcwd()+'\\_filelist.txt','w+', encoding="utf-8")
for file in allfile:
w.write(file)
w.write("\n")
w.close()
In Windows and OS X, this will just work as filename translation is enforced. In Linux, a filename can be any encoding (or non at all!). Therefore, ensure that whatever is creating your files (avi + srt), is using UTF-8, your terminal is set to UTF-8 and your locale is UTF-8.
You need to open your file with a proper encoding to write unicode in it.You can use codecs module for opening the file:
import codecs
with codecs.open(os.getcwd()+'\\_filelist.txt','w+',encoding='your-encoding') as w:
for file in allfile:
w.write(file + '\n')
You can use UTF-8 as your encoding which is a universal encoding or another proper encoding based on your unicode type.Also note that instead of opening the file and closing it manually you can use with statement to open the file which will close the file automatically at the end of the block.
I am trying to write an application that opens the txt files inside selected (sub)folder(s) and replace all the letters "ž" with letters "š" and save it in UTF-8 format.
This is what i managed to do so far (VERSION 2 - see edit):
import os
import codecs
startIn = os.getcwd()
print()
print("Pregledujem: " + startIn + "\\")
print("-------------------------")
for dirName, subdirList, fileList in os.walk(startIn):
print()
print("Trenutna mapa: " + dirName + "\\")
for fname in fileList:
if fname.endswith(".srt"):
fullpath = dirName + "\\" + fname
print(" Podnapis: " + fname )
with codecs.open(fullpath, 'r+', "UTF-8-sig") as cursub:
lines = cursub.read().replace("ž","š")
cursub.seek(0)
cursub.write(lines)
EDIT
Replacing the letters now works like it should, but I still cant figure out how to properly encode file TO utf-8.
Current version outputs the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position
220: invalid start byte
If you want to read and write open in r+ mode
cursub = codecs.open(filename, 'r+',"utf-8")
lines = cursub.read().replace("š", "ž")
cursub.seek(0) # go back to start of file
cursub.write(lines) # rewrite updated lines
Using with will close the file automatically:
with codecs.open(filename, 'r+',"utf-8") as cursub:
lines = cursub.read().replace("š", "ž")
cursub.seek(0)
cursub.write(lines)
if you are ging to edit (or rather rewrite) a file you shouldn't open it in write mode because that makes it impossible to read from it.
Either read the full file into memory first or write to a copy while reading from the original (or make a copy first and read from the copy, rewriting the original).