I am trying to write an application that opens the txt files inside selected (sub)folder(s) and replace all the letters "ž" with letters "š" and save it in UTF-8 format.
This is what i managed to do so far (VERSION 2 - see edit):
import os
import codecs
startIn = os.getcwd()
print()
print("Pregledujem: " + startIn + "\\")
print("-------------------------")
for dirName, subdirList, fileList in os.walk(startIn):
print()
print("Trenutna mapa: " + dirName + "\\")
for fname in fileList:
if fname.endswith(".srt"):
fullpath = dirName + "\\" + fname
print(" Podnapis: " + fname )
with codecs.open(fullpath, 'r+', "UTF-8-sig") as cursub:
lines = cursub.read().replace("ž","š")
cursub.seek(0)
cursub.write(lines)
EDIT
Replacing the letters now works like it should, but I still cant figure out how to properly encode file TO utf-8.
Current version outputs the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position
220: invalid start byte
If you want to read and write open in r+ mode
cursub = codecs.open(filename, 'r+',"utf-8")
lines = cursub.read().replace("š", "ž")
cursub.seek(0) # go back to start of file
cursub.write(lines) # rewrite updated lines
Using with will close the file automatically:
with codecs.open(filename, 'r+',"utf-8") as cursub:
lines = cursub.read().replace("š", "ž")
cursub.seek(0)
cursub.write(lines)
if you are ging to edit (or rather rewrite) a file you shouldn't open it in write mode because that makes it impossible to read from it.
Either read the full file into memory first or write to a copy while reading from the original (or make a copy first and read from the copy, rewriting the original).
Related
I'm parsing some file that is ISO-8859-1 and has Chinese, Japanese, Korean characters in it.
import os
from os import listdir
cnt = 0
base_path = 'data/'
cwd = os.path.abspath(os.getcwd())
for f in os.listdir(base_path):
path = cwd + '/' + base_path + f
cnt = 0
with open(path, 'r', encoding='ISO-8859-1') as file:
for line in file:
print('line {}: {}'.format(cnt, line))
cnt +=1
The code runs but it prints broken characters. Other stackoverflow questions suggest I use encode and decode. For example, for Korean texts, I tried
file.read().encode('latin1').decode('euc-kr'), but that didn't do anything. I also tried to convert the files into utf-8 using iconv but the characters are still broken in the converted text file.
Any suggestions would be much appreciated.
Sorry, no. ISO-8859-1 cannot have any Chinese, Japanese, nor Korean characters in it. The code page doesn't support them at the first place.
What you did in the code is to ask Python to assume the file is in ISO-8859-1 encoding and return characters in Unicode (which is how strings are built). If you do not specify the encoding parameter in open(), the default would be assuming UTF-8 encoding use in the file and still return in Unicode, i.e. logical characters without any encoding specified.
Now the question is how are those CJK characters encoded in the file. If you know the answer, you can just put the right encoding parameter in open() and it works right away. Let's say it is EUC-KR as you mentioned, the code should be:
with open(path, 'r', encoding='euc-kr') as file:
for line in file:
print('line {}: {}'.format(cnt, line))
cnt +=1
If you feel frustrated, please take a look at chardet. It should help you detect the encoding from text. Example:
import chardet
with open(path, 'rb') as file:
rawdata = file.read()
guess = chardet.detect(rawdata) # e.g. {'encoding': 'EUC-KR', 'confidence': 0.99}
text = guess.decode(guess['encoding'])
cnt = 0
for line in text.splitlines():
print('line {}: {}'.format(cnt, line))
cnt +=1
i have a python script that list all the pdf files in a specified directory and convert it to text files.
the system work perfect the problem is when i have ARABIC text the script crash because its unable to search in the PDF file.
i know that pdf are binary but do not know how to read ARABIC and convert it to text
how to fix this error ?
i tried to encode to UTF-8 and decode it but still not working
if i try the code with comment lines at the bottom of the code the result will be converted empty text file.
if i try to Uncomment the lines in order to encode and decode the result will be empty converted text file with this error:
Traceback (most recent call last): File
"C:\Users\test\Downloads\pdf-txt\text maker.py", line 63, in
content.decode('ascii', 'ignore') UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 50: ordinal not in
range(128)
code:
import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime
def check_path(prompt):
''' (str) -> str
Verifies if the provided absolute path does exist.
'''
abs_path = raw_input(prompt)
while path.exists(abs_path) != True:
print "\nThe specified path does not exist.\n"
abs_path = raw_input(prompt)
return abs_path
print "\n"
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
for filename in files:
if filename.endswith('.pdf'):
t=os.path.join(directory,filename)
##print(t)
##list.extend(t)
list.append(t)
## print(list)
m=len(list)
print (m)
i=0
while i<=m-1:
path=list[i]
print(path)
head,tail=os.path.split(path)
var="\\"
tail=tail.replace(".pdf",".txt")
name=head+var+tail
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
##pdf = pyPdf.PdfFileReader(codecs.open(path, "rb", encoding='UTF-8'))
# Iterate pages
for j in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(j).extractText() + "\n"
print strftime("%H:%M:%S"), " pdf -> txt "
f=open(name,'w')
## all this red line are added by me ##
##f.decode(content.encode('UTF-8'))
##content.encode('utf-8')
##content.decode('ascii', 'ignore')
##content.decode('unicode-escape')
##f.write(content)
f.write(content.encode('UTF-8'))
f.close
i=i+1
5 years later: pypdf has gone through major updates. Extracting Arabic text should work fine the standard way:
from pypdf import PdfReader
reader = PdfReader("arabic.pdf")
full_text = ""
for page in reader.pages:
full_text += page.extract_text() + "\n"
print(full_text)
If some PDF is causing issues, please report a bug. You need to share the pypdf version and the file that causes the issue.
What NOT to do
Don't do anything with encode / decode. It's not necessary.
Don't use PyPDF2 / PyPDF3 / PyPDF4: The community moved to pypdf.
I have a code written in Python that reads from PDF files and convert it to text file.
The problem occurred when I tried to read Arabic text from PDF files. I know that the error is in the coding and encoding process but I don't know how to fix it.
The system converts Arabic PDF files but the text file is empty.
and display this error:
Traceback (most recent call last): File
"C:\Users\test\Downloads\pdf-txt\text maker.py", line 68, in
f.write(content) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 50: ordinal not in range(128)
Code:
import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime
def check_path(prompt):
''' (str) -> str
Verifies if the provided absolute path does exist.
'''
abs_path = raw_input(prompt)
while path.exists(abs_path) != True:
print "\nThe specified path does not exist.\n"
abs_path = raw_input(prompt)
return abs_path
print "\n"
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
for filename in files:
if filename.endswith('.pdf'):
t=os.path.join(directory,filename)
list.append(t)
m=len(list)
print (m)
i=0
while i<=m-1:
path=list[i]
print(path)
head,tail=os.path.split(path)
var="\\"
tail=tail.replace(".pdf",".txt")
name=head+var+tail
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for j in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(j).extractText() + "\n"
print strftime("%H:%M:%S"), " pdf -> txt "
f=open(name,'w')
content.encode('utf-8')
f.write(content)
f.close
i=i+1
You have a couple of problems:
content.encode('utf-8') doesn't do anything. The return value is the encoded content, but you have to assign it to a variable. Better yet, open the file with an encoding, and write Unicode strings to that file. content appears to be Unicode data.
Example (works for both Python 2 and 3):
import io
f = io.open(name,'w',encoding='utf8')
f.write(content)
If you don't close the file properly, you may see no content because the file is not flushed to disk. You have f.close not f.close(). It's better to use with, which ensures the file is closed when the block exits.
Example:
import io
with io.open(name,'w',encoding='utf8') as f:
f.write(content)
In Python 3, you don't need to import and use io.open but it still works. open is equivalent. Python 2 needs the io.open form.
you can use anthor library called pdfplumber instead of using pypdf or PyPDF2
import arabic_reshaper
from bidi.algorithm import get_display
with pdfplumber.open(r'example.pdf') as pdf:
my_page = pdf.pages[10]
thepages=my_page.extract_text()
reshaped_text = arabic_reshaper.reshape(thepages)
bidi_text = get_display(reshaped_text)
print(bidi_text)
I have the following script to process filenames with non-latin characters:
import os
filelst = []
allfile = os.listdir(os.getcwd())
for file in allfile:
if os.path.isfile(file):
filelst.append(file)
w = open(os.getcwd()+'\\_filelist.txt','w+')
for file in allfile:
w.write(file)
w.write("\n")
w.close()
filelist in my folder:
new 1.py
ああっ女神さまっ 小っちゃいって事は便利だねっ.1998.Ep0108.x264.AC3CalChi.avi
ああっ女神さまっ 小っちゃいって事は便利だねっ.1998.Ep0108.x264.AC3CalChi.srt
output in _filelist.txt:
new 1.py
???????? ??????????????.1998.Ep01-08.x264.AC3-CalChi.avi
???????? ??????????????.1998.Ep01-08.x264.AC3-CalChi.srt
You should get the list of files as Unicode strings instead by passing a Unicode file path to listdir. As you're using getcwd, use: os.getcwdu()
Then open your output file with a text encoding wrapper. io module is the new way to do this (io handles Universal newlines correctly).
Putting it all together:
import os
import io
filelst = []
allfile = os.listdir(os.getcwdu())
for file in allfile:
if os.path.isfile(file):
filelst.append(file)
w = io.open(os.getcwd()+'\\_filelist.txt','w+', encoding="utf-8")
for file in allfile:
w.write(file)
w.write("\n")
w.close()
In Windows and OS X, this will just work as filename translation is enforced. In Linux, a filename can be any encoding (or non at all!). Therefore, ensure that whatever is creating your files (avi + srt), is using UTF-8, your terminal is set to UTF-8 and your locale is UTF-8.
You need to open your file with a proper encoding to write unicode in it.You can use codecs module for opening the file:
import codecs
with codecs.open(os.getcwd()+'\\_filelist.txt','w+',encoding='your-encoding') as w:
for file in allfile:
w.write(file + '\n')
You can use UTF-8 as your encoding which is a universal encoding or another proper encoding based on your unicode type.Also note that instead of opening the file and closing it manually you can use with statement to open the file which will close the file automatically at the end of the block.
I'm trying to convert file content from Windows-1251 (Cyrillic) to Unicode with Python. I found this function, but it doesn't work.
#!/usr/bin/env python
import os
import sys
import shutil
def convert_to_utf8(filename):
# gather the encodings you think that the file may be
# encoded inside a tuple
encodings = ('windows-1253', 'iso-8859-7', 'macgreek')
# try to open the file and exit if some IOError occurs
try:
f = open(filename, 'r').read()
except Exception:
sys.exit(1)
# now start iterating in our encodings tuple and try to
# decode the file
for enc in encodings:
try:
# try to decode the file with the first encoding
# from the tuple.
# if it succeeds then it will reach break, so we
# will be out of the loop (something we want on
# success).
# the data variable will hold our decoded text
data = f.decode(enc)
break
except Exception:
# if the first encoding fail, then with the continue
# keyword will start again with the second encoding
# from the tuple an so on.... until it succeeds.
# if for some reason it reaches the last encoding of
# our tuple without success, then exit the program.
if enc == encodings[-1]:
sys.exit(1)
continue
# now get the absolute path of our filename and append .bak
# to the end of it (for our backup file)
fpath = os.path.abspath(filename)
newfilename = fpath + '.bak'
# and make our backup file with shutil
shutil.copy(filename, newfilename)
# and at last convert it to utf-8
f = open(filename, 'w')
try:
f.write(data.encode('utf-8'))
except Exception, e:
print e
finally:
f.close()
How can I do that?
Thank you
import codecs
f = codecs.open(filename, 'r', 'cp1251')
u = f.read() # now the contents have been transformed to a Unicode string
out = codecs.open(output, 'w', 'utf-8')
out.write(u) # and now the contents have been output as UTF-8
Is this what you intend to do?
This is just a guess, since you didn't specify what you mean by "doesn't work".
If the file is being generated properly but appears to contain garbage characters, likely the application you're viewing it with does not recognize that it contains UTF-8. You need to add a BOM to the beginning of the file - the 3 bytes 0xEF,0xBB,0xBF (unencoded).
If you use the codecs module to open the file, it will do the conversion to Unicode for you when you read from the file. E.g.:
import codecs
f = codecs.open('input.txt', encoding='cp1251')
assert isinstance(f.read(), unicode)
This only makes sense if you're working with the file's data in Python. If you're trying to convert a file from one encoding to another on the filesystem (which is what the script you posted tries to do), you'll have to specify an actual encoding, since you can't write a file in "Unicode".