File write and file read in utf-16 in python

File write and file read in utf-16 in python - python

I have this file write function:
def filewrite(folderpath, filename, strdata, encmode):
try:
path = os.path.join(folderpath, filename)
if not path:
return
create_dir_path(folderpath)
#path = os.path.join(folderpath, filepath)
with codecs.open(path, mode='w', encoding=encmode) as fp:
fp.write(unicode(strdata))
except Exception, e:
raise Exception(e)
which am using to write data to a file:
filewrite(folderpath, filename, strdata, 'utf-16')
But, when if try to read this file am getting the exception:
Exception: UTF-16 stream does not start with BOM
My file read function is as show below:
def read_in_chunks(file_object, chunk_size=4096):
try:
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
except Exception, ex:
raise ex
def fileread(folderPath, fileName, encmode):
try:
path = os.path.join(folderPath, fileName)
fileData = ''
if os.access(path, os.R_OK):
with codecs.open(path, mode='r', encoding=encmode) as fp:
for block in read_in_chunks(fp):
fileData = fileData + block
return fileData
return ''
except Exception, ex:
raise ex
Please, let me know what am doing wrong here.
Thanks

There doesn't appear to be anything wrong with your code. Running it on my machine creates the proper BOM at the start of the file automatically.
BOM is a sequence of bytes at the start of the file that indicates which order multi-byte encodings (UTF-16) should be read - you can read about system endianness if you're interested.
If you're running on a mac/linux you should be able to hd your_utf16file or hexdump your_utf16file to check the raw bytes inside the file. Running your code I saw the correct bytes 0xff 0xfe at the beginning of mine.
Try replacing your fileread function portion with
with codecs.open(path, mode='r', encoding=encmode) as fp:
for block in fp:
print block
to ensure you can still read the file after eliminating external factors (your read_in_chunks functional).

Related

what is an exception handler for

I have a script which wants to load integers from a text file. If the file does not exist I want the user to be able to browse for a different file (or the same file in a different location, I have UI implementation for that).
What I don't get is what the purpose of Exception handling, or catching exceptions is. From what I have read it seems to be something you can use to log errors, but if an input is needed catching the exception won't fix that. I am wondering if a while loop in the except block is the approach to use (or don't use the try/except for loading a file)?
with open(myfile, 'r') as f:
try:
with open(myfile, 'r') as f:
contents = f.read()
print("From text file : ", contents)
except FileNotFoundError as Ex:
print(Ex)

You need to use to while loop and use a variable to verify in the file is found or not, if not found, set in the input the name of the file and read again and so on:
filenotfound = True
file_path = myfile
while filenotfound:
try:
with open(file_path, 'r') as f:
contents = f.read()
print("From text file : ", contents)
filenotfound = False
except FileNotFoundError as Ex:
file_path = str(input())
filenotfound = True

Replacing words with a dictionary in text file encoded for example in UTF-8

I'm trying to open a text file and then read through it replacing certain strings with strings stored in a dictionary. Based on answers to Replacing words in text file using a dictionary and How to search and replace text in a file using Python?
As like:
# edit print line to print (line)
import fileinput
text = "sample file.txt"
fields = {"pattern 1": "replacement text 1", "pattern 2": "replacement text 2"}
for line in fileinput.input(text, inplace=True):
line = line.rstrip()
for field in fields:
if field in line:
line = line.replace(field, fields[field])
print (line)
My file is encoding in utf-8.
When I run this, the console shows this error:
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>
When add: encoding = "utf8" to fileinput.FileInput() its show an error:
TypeError: __init__() got an unexpected keyword argument 'encoding'
When add: openhook=fileinput.hook_encoded("utf8") to fileinput.FileInput() it show error:
ValueError: FileInput cannot use an opening hook in inplace mode
I do not want to insert a subcode 'ignore' ignoring errors.
I have file, dictionary and want replace values from dictionary into file like stdout.
Source file in utf-8:
Plain text on the line in the file.
This is a greeting to the world.
Hello world!
Here's another plain text.
And here too!
I want to replace the word world with the word earth.
In dictionary: {"world": "earth"}
Modified file in utf-8:
Plain text on the line in the file.
This is a greeting to the earth.
Hello earth!
Here's another plain text.
And here too!

The fileinput library has several problems that I addressed in the past in a blog post; one of these is that you can't set the encoding and use in-place file rewriting.
The following code can do this, but you have to replace your print() calls with writes to the outgoing file object:
from contextlib import contextmanager
import io
import os
#contextmanager
def inplace(filename, mode='r', buffering=-1, encoding=None, errors=None,
newline=None, backup_extension=None):
"""Allow for a file to be replaced with new content.
yields a tuple of (readable, writable) file objects, where writable
replaces readable.
If an exception occurs, the old file is restored, removing the
written data.
mode should *not* use 'w', 'a' or '+'; only read-only-modes are supported.
"""
# move existing file to backup, create new file with same permissions
# borrowed extensively from the fileinput module
if set(mode).intersection('wa+'):
raise ValueError('Only read-only file modes can be used')
backupfilename = filename + (backup_extension or os.extsep + 'bak')
try:
os.unlink(backupfilename)
except os.error:
pass
os.rename(filename, backupfilename)
readable = io.open(backupfilename, mode, buffering=buffering,
encoding=encoding, errors=errors, newline=newline)
try:
perm = os.fstat(readable.fileno()).st_mode
except OSError:
writable = open(filename, 'w' + mode.replace('r', ''),
buffering=buffering, encoding=encoding, errors=errors,
newline=newline)
else:
os_mode = os.O_CREAT | os.O_WRONLY | os.O_TRUNC
if hasattr(os, 'O_BINARY'):
os_mode |= os.O_BINARY
fd = os.open(filename, os_mode, perm)
writable = io.open(fd, "w" + mode.replace('r', ''), buffering=buffering,
encoding=encoding, errors=errors, newline=newline)
try:
if hasattr(os, 'chmod'):
os.chmod(filename, perm)
except OSError:
pass
try:
yield readable, writable
except Exception:
# move backup back
try:
os.unlink(filename)
except os.error:
pass
os.rename(backupfilename, filename)
raise
finally:
readable.close()
writable.close()
try:
os.unlink(backupfilename)
except os.error:
pass
So your code would look like:
import fileinput
text = "sample file.txt"
fields = {"pattern 1": "replacement text 1", "pattern 2": "replacement text 2"}
with inplace(text, encoding='utf8') as (infh, outfh):
for line in infh:
for field in fields:
if field in line:
line = line.replace(field, fields[field])
outfh.write(line)
Note that you don't have to remove the newline now.

I tried to use this:
with open(fileName1, "r+", encoding = "utf8", newline='') as fileIn, open(fileName1, "r+", encoding = "utf8", newline='') as fileOut:
for line in fileIn:
for field in fields:
if field in line:
line = line.replace(field, fields[field])
fileOut.write(line)
Note: When using one file, the waste is pushed at the end of the file.
So far I have not figured out why. It does not reflect the number of replacements. (The number of replacements is greater than the number of lines of waste.)
Pseudo-mathematical:
oriA < modfA + subEnd(oriA)
I'm ready to fix it.
Edit: When I use two files, everything works correctly. Change fileName1 in the second open() for fileName2. And change mod argument to "w+".

How to loop through csv file in Python/Django

I created a management command and there I want to download the csv file from ftp and update the database if necessary.
The code that I have is as follows:
class Command(BaseCommand):
#staticmethod
def update_database(row):
code = row['CODE']
vat_number = row['BTWNR']
if code != "" and vat_number != "":
# pdb.set_trace()
print("Code: {0} --- BTW: {1}").format(code, vat_number)
def read_file(self):
path_to_relations_csv_file = "{0}/{1}".format(destination, file_name)
with open(path_to_relations_csv_file) as csvfile:
relations_reader = csv.DictReader(csvfile, delimiter=';')
for row in relations_reader:
self.update_database(row)
def handle(self, *args, **options):
# Open ftp connection
ftp = ftplib.FTP(ftp_host, ftp_username, ftp_password)
try:
ftp.cwd(source) # Goes to remote folder where the relations file is
os.chdir(destination) # Goes to local folder where the relations file will be downloaded
print("Switched to the directory successful. Current directory: {}".format(destination))
except OSError:
pass
except ftplib.error_perm:
print("Error: could not change to {0}".format(source))
sys.exit("Ending Application")
try:
# Get the file
relations = open(file_name, "wb") # Opens the remote file
ftp.retrbinary('RETR {0}'.format(file_name), relations.write) # Writes to the local file
print("File {0} downloaded successfully.".format(file_name))
relations.close() # Closes the local file
print("Local file closed.")
ftp.quit() # Closes the ftp connection
print("FTP Connection quited.")
try:
self.read_file()
except:
print("Error: Unable to read the file.")
except:
print("Error: File {0} could not be downloaded.".format(file_name))
But in read_file method the for loop gives me the error. If I place pdb.set_trace()before for loop I can see that relations_reader is <csv.DictReader object at 0x10e67a6a0>, thus it seems ok, but if I try to loop over it it goes to the except and it execute print("Error: Unable to read the file.")
The path are correct.
If the same code is executed as a separated file with python file_name.py and not as command with python manage.py file_name everything works fine.
Any idea what am I doing wrong?

The iterator should return strings, not bytes (did you open the file in text mode?) means you need to open the CSV in other ways, than the way you do.
To solve, change the opening mode, and the encoding, to the one it fits your csv file
open(path_to_relations_csv_file, "rt", encoding='utf-8') as csvfile:

Extracting file from corrupted GZ

My code snippet can extract file from GZ as save it as .txt file, but sometimes that file may contain some weird text which crashes extract module.
Some Gibberish from file:
Method I use:
def unpackgz(name ,path):
file = path + '\\' +name
outfilename = file[:-3]+".txt"
inF = gzip.open(file, 'rb')
outF = open(outfilename, 'wb')
outF.write( inF.read() )
inF.close()
outF.close()
My question how I can go around this? Something maybe similar to with open(file, errors='ignore') as fil: . Because With that method I can extract only healthy files.
EDIT to First question
def read_corrupted_file(filename):
with gzip.open(filename, 'r') as f:
for line in f:
try:
string+=line
except Exception as e:
print(e)
return string
newfile = open("corrupted.txt", 'a+')
cwd = os.getcwd()
srtNameb="service"+str(46)+"b.gz"
localfilename3 = cwd +'\\'+srtNameb
newfile.write(read_corrupted_file(localfilename3))
Results in multiple errors:
Like This
Fixed to working state:
def read_corrupted_file(filename):
string=''
newfile = open("corrupted.txt", 'a+')
try:
with gzip.open(filename, 'rb') as f:
for line in f:
try:
newfile.write(line.decode('ascii'))
except Exception as e:
print(e)
except Exception as e:
print(e)
cwd = os.getcwd()
srtNameb="service"+str(46)+"b.gz"
localfilename3 = cwd +'\\'+srtNameb
read_corrupted_file(localfilename3)
print('done')

Generally if the file is corrupt then it will throw a error trying to unzip the file, there is not much you can do simply to still get the data, but if you just want to stop it crashing you could use a try catch.
try:
pass
except Exception as error:
print(error)
Applying this logic you could read line by line with gzip, with a try exception, after, still reading the next line when it hits a corrupted section.
import gzip
with gzip.open('input.gz','r') as f:
for line in f:
print('got line', line)

Python: how to convert from Windows 1251 to Unicode?

I'm trying to convert file content from Windows-1251 (Cyrillic) to Unicode with Python. I found this function, but it doesn't work.
#!/usr/bin/env python
import os
import sys
import shutil
def convert_to_utf8(filename):
# gather the encodings you think that the file may be
# encoded inside a tuple
encodings = ('windows-1253', 'iso-8859-7', 'macgreek')
# try to open the file and exit if some IOError occurs
try:
f = open(filename, 'r').read()
except Exception:
sys.exit(1)
# now start iterating in our encodings tuple and try to
# decode the file
for enc in encodings:
try:
# try to decode the file with the first encoding
# from the tuple.
# if it succeeds then it will reach break, so we
# will be out of the loop (something we want on
# success).
# the data variable will hold our decoded text
data = f.decode(enc)
break
except Exception:
# if the first encoding fail, then with the continue
# keyword will start again with the second encoding
# from the tuple an so on.... until it succeeds.
# if for some reason it reaches the last encoding of
# our tuple without success, then exit the program.
if enc == encodings[-1]:
sys.exit(1)
continue
# now get the absolute path of our filename and append .bak
# to the end of it (for our backup file)
fpath = os.path.abspath(filename)
newfilename = fpath + '.bak'
# and make our backup file with shutil
shutil.copy(filename, newfilename)
# and at last convert it to utf-8
f = open(filename, 'w')
try:
f.write(data.encode('utf-8'))
except Exception, e:
print e
finally:
f.close()
How can I do that?
Thank you

import codecs
f = codecs.open(filename, 'r', 'cp1251')
u = f.read() # now the contents have been transformed to a Unicode string
out = codecs.open(output, 'w', 'utf-8')
out.write(u) # and now the contents have been output as UTF-8
Is this what you intend to do?

This is just a guess, since you didn't specify what you mean by "doesn't work".
If the file is being generated properly but appears to contain garbage characters, likely the application you're viewing it with does not recognize that it contains UTF-8. You need to add a BOM to the beginning of the file - the 3 bytes 0xEF,0xBB,0xBF (unencoded).

If you use the codecs module to open the file, it will do the conversion to Unicode for you when you read from the file. E.g.:
import codecs
f = codecs.open('input.txt', encoding='cp1251')
assert isinstance(f.read(), unicode)
This only makes sense if you're working with the file's data in Python. If you're trying to convert a file from one encoding to another on the filesystem (which is what the script you posted tries to do), you'll have to specify an actual encoding, since you can't write a file in "Unicode".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

File write and file read in utf-16 in python - python

Related

what is an exception handler for

Replacing words with a dictionary in text file encoded for example in UTF-8

How to loop through csv file in Python/Django

Extracting file from corrupted GZ

Python: how to convert from Windows 1251 to Unicode?

Categories

Resources