How can I write characters such as § into a file using Python? - python

This is my code for creating the string to be written ('result' is the variable that holds the final text):
fileobj = open('file_name.yml','a+')
begin = initial+":0 "
n_name = '"§'+tag+name+'§!"'
begin_d = initial+"_desc:0 "
n_desc = '"§3'+desc+'§!"'
title = ' '+begin + n_name
descript = ' '+begin_d + n_desc
result = title+'\n'+descript
print()
fileobj.close()
return result
This is my code for actually writing it into the file:
text = writing(initial, tag, name, desc)
override = inserter(fileobj, country, text)
fileobj.close()
fileobj = open('file_name.yml','w+')
fileobj.write(override)
fileobj.close()
(P.S: Override is a function which works perfectly. It returns a longer string to be written into the file.)
I have tried this with .txt and .yml files but in both cases, instead of §, this is what takes its place: xA7 (I cannot copy the actual text into the internet as it changes into the correct character. It is, however, appearing as xA7 in the file.) Everything else is unaffected, and the code runs fine.
Do let me know if I can improve the question in any way.

You're running into a problem called character encoding. There are two parts to the problem - first is to get the encoding you want in the file, the second is to get the OS to use the same encoding.
The most flexible and common encoding is UTF-8, because it can handle any Unicode character while remaining backwards compatible with the very old 7-bit ASCII character set. Most Unix-like systems like Linux will handle it automatically.
fileobj = open('file_name.yml','w+',encoding='utf-8')
You can set your PYTHONIOENCODING environment value to make it the default.
Windows operating systems are a little trickier because they'll rarely assume UTF-8, especially if it's a Microsoft program opening the file. There's a magic byte sequence called a BOM that will trigger Microsoft to use UTF-8 if it's at the beginning of a file. Python can add that automatically for you:
fileobj = open('file_name.yml','w+',encoding='utf_8_sig')

Related

Changing Encoding of Text Files Using Python: It's Not Happening

Upon copying or pretty much touching files in any way, Windows changes their encoding to its default 1252: Western European. In the text editor I'm using, EditPad Pro Plus, I can see and convert the encoding. I trust that this conversion works, because I've been working with files between Windows and UNIX, and I know that when my text editor changes encodings, the files are read correctly in UNIX where they caused problems before.
I would like to convert files en masse. So I'm attempting to do that using Python in Windows 10, called from either Powershell (using Python v 3.6.2) or CygWin (using Python v 2.7.13). I see both codecs and io used for the job, and commentary that io is the proper way for Python 3.
But the files are not converted -- codecs or io. The script below successfully copies the files, but my text editor reports them as 1252 still. And the UniversalDetector (in the commented out portions of the script below) reports their encoding as "ascii".
What needs to happen to get these to convert successfully?
import sys
import os
import io
#from chardet.universaldetector import UniversalDetector
BLOCKSIZE = 1048576
#detector = UniversalDetector()
#def get_encoding( current_file ):
# detector.reset()
# for line in file(current_file):
# detector.feed(line)
# if detector.done: break
# detector.close()
# return detector.result['encoding']
def main():
src_dir = ""
if len( sys.argv ) > 1:
src_dir = sys.argv[1]
if os.path.exists( src_dir ):
dest_dir = src_dir[:-2]
for file in os.listdir( src_dir ):
with io.open( os.path.join( src_dir, file ), "r", encoding='cp1252') as source_file:
with io.open( os.path.join( dest_dir, file ), "w", encoding='utf8') as target_file:
while True:
contents = source_file.read( BLOCKSIZE )
if not contents:
break
target_file.write( contents )
#print( "Encoding of " + file + ": " + get_encoding( os.path.join( dest_dir, file ) ) )
else:
print( 'The specified directory does not exist.' )
if __name__ == "__main__":
main()
I've tried some variations such as opening the file as UTF8, calling read() without the blocksize, and, originally, the encodings were specified a little differently. They all successfully copy the files, but do not encode them as intended.
ASCII is the common subset to a whole lot of encodings. It is a subset of UTF-8, Latin-1, and cp1252-- and of the whole ISO-8859 family which has encodings for Russian, Greek etc. If your files are really ASCII, there's nothing to convert and your system is only saying "cp1252" because the files are compatible with this. You could add a BOM to tag a file as UTF (encoding utf-8-sig), but frankly I don't see the point. UTF doesn't actually need it, because UTF files are recognizable by the structure of multi-byte characters.
If you want to experiment with encodings, use text that contains non-ASCII characters: French, Russian, Chinese, or even English with some accented words (or the silly directed quotes that Microsoft applications like to insert). Save the words "Wikipédia en français" in a file and repeat your experiments, and you'll get very different results.
I strongly recommend using Python 3 for this, and for anything else to do with character encodings. The Python 2 approach to encodings results in a lot of pointless confusion, and was in fact one of the major reasons for breaking compatibility and introducing Python 3.
As a bonus, in Python 3 you can just use open() with an encoding argument. You don't need any modules to change encodings.

Trouble with utf-8 encoding/decoding

I am reading a .csv which is UTF-8 encoded.
I want to create an index and rewrite the csv.
The index is created as an ongoing number and the first letter of a word.
Python 2.7.10, Ubuntu Server
#!/usr/bin/env python
# -*- coding: utf-8 -*-
counter = 0
tempDict = {}
with open(modifiedFile, "wb") as newFile:
with open(originalFile, "r") as file:
for row in file:
myList = row.split(",")
toId = str(myList[0])
if toId not in tempDict:
tempDict[toId] = counter
myId = str(toId[0]) + str(counter)
myList.append(myId)
counter += 1
else:
myId = str(toId[0]) + str(tempDict[toId])
myList.append(myId)
# and then I write everything into the csv
for i, j in enumerate(myList):
if i < 6:
newFile.write(str(j).strip())
newFile.write(",")
else:
newFile.write(str(j).strip())
newFile.write("\n")
The problem is the following.
When a word starts with a fancy letter, such as
Č
É
Ā
...
The id I create starts with a ?, but not with the letter of the word.
The strange part is, that withing the csv I create, the words with the fancy letters are written correct. There are no ? or other symbols which indicate a wrong encoding.
Why is that?
By all means, you should not be learning Python 2 unless there is a specific legacy C extension that you need.
Python 3 makes major changes to the unicode/bytes handling that removes (most) implicit behavior and makes errors visible. It's still good practice to use open('filename', encoding='utf-8') since the default encoding is environment- and platform-dependent.
Indeed, running your program in Python 3 should fix it without any changes. But here's where your bug lies:
toId = str(myList[0])
This is a no-op, since myList[0] is already a str.
myId = str(toId[0]) + str(counter)
This is a bug: toId is a str (byte string) containing UTF-8 data. You never, ever want to do anything with UTF-8 data except process it one character at a time.
with open(originalFile, "r") as file:
This is a style error, since it masks the built-in function file.
There are two changes to make this run under Python 2.
Change open(filename, mode) to io.open(filename, mode, encoding='utf-8').
Stop calling str() on strings, since that actually attempts to encode them (in ASCII!).
But you really should switch to Python 3.
There are a few pieces new to 2.6 and 2.7 that are intended to bridge the gap to 3, and one of them is the io module, which behaves in all the nice new ways: Unicode files and universal newlines.
~$ python2.7 -c 'import io,sys;print(list(io.open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
[u'\xc4\n', u'\xf9\n']
~$ python3 -c 'import sys;print(list(open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
['Ä\n', 'ù\n']
This can be useful to write software for both 2 and 3. Again, the encoding argument is optional but on all platforms the default encoding is environment-dependent, so it's good to be specific.
In python 2.x strings are by default non-unicode - str() returns a non-unicode string. Use unicode() instead.
Besides, you must open the file using utf-8 encoding through codecs.open() rather than the built-in open().

Writing out text with double double quotes - Python on Linux

I'm trying to take the text output of a query to an SSD (pulling a log page, similar to pulling SMART data. I'm then trying to write this text data out of a log file I update periodically.
My problem happens when the log data for some drives has double double-quotes as a placeholder for a blank field. Here is a snippet of the input:
VER 0x10200
VID 0x15b7
BoardRev 0x0
BootLoadRev ""
When this gets written out (appended) to my own log file, the text gets replaced with several null characters and then when I try to open all the text editors tell me it's corrupted.
The "" characters are replaced by something like this on my Linux system:
BootLoadRev "\00\00\00\00"
Some fields are even longer with the \00 characters. If the "" is not there, things write out OK.
The code is similar to this:
f=open(fileName, 'w')
test_bench.send_command('get_log_page')
identify_data = test_bench.get_data_in()
f.write(identify_data)
f.close()
Is there a way to send this text to a file w/o these nulls causing problems?
Assuming that this is Python 2 (and that your content is thus what Python 3 would call a bytestring), and that your intended data format is raw ASCII, the trivial solution is simply to remove the NULs from your content before you write to disk:
f.write(identify_data.replace('\0', ''))

"ValueError: embedded null character" when using open()

I am taking python at my college and I am stuck with my current assignment. We are supposed to take 2 files and compare them. I am simply trying to open the files so I can use them but I keep getting the error "ValueError: embedded null character"
file1 = input("Enter the name of the first file: ")
file1_open = open(file1)
file1_content = file1_open.read()
What does this error mean?
It seems that you have problems with characters "\" and "/". If you use them in input - try to change one to another...
Default encoding of files for Python 3.5 is 'utf-8'.
Default encoding of files for Windows tends to be something else.
If you intend to open two text files, you may try this:
import locale
locale.getdefaultlocale()
file1 = input("Enter the name of the first file: ")
file1_open = open(file1, encoding=locale.getdefaultlocale()[1])
file1_content = file1_open.read()
There should be some automatic detection in the standard library.
Otherwise you may create your own:
def guess_encoding(csv_file):
"""guess the encoding of the given file"""
import io
import locale
with io.open(csv_file, "rb") as f:
data = f.read(5)
if data.startswith(b"\xEF\xBB\xBF"): # UTF-8 with a "BOM"
return "utf-8-sig"
elif data.startswith(b"\xFF\xFE") or data.startswith(b"\xFE\xFF"):
return "utf-16"
else: # in Windows, guessing utf-8 doesn't work, so we have to try
try:
with io.open(csv_file, encoding="utf-8") as f:
preview = f.read(222222)
return "utf-8"
except:
return locale.getdefaultlocale()[1]
and then
file1 = input("Enter the name of the first file: ")
file1_open = open(file1, encoding=guess_encoding(file1))
file1_content = file1_open.read()
Try putting r (raw format).
r'D:\python_projects\templates\0.html'
On Windows while specifying the full path of the file name, we should use double backward slash as the seperator and not single backward slash.
For instance, C:\\FileName.txt instead of C:\FileName.txt
I got this error when copying a file to a folder that starts with a number. If you write the folder path with the double \ sign before the number, the problem will be solved.
The first slash of the file path name throws the error.
Need Raw, r
Raw string
FileHandle = open(r'..', encoding='utf8')
FilePath='C://FileName.txt'
FilePath=r'C:/FileName.txt'
The problem is due to bytes data that needs to be decoded.
When you insert a variable into the interpreter, it displays it's repr attribute whereas print() takes the str (which are the same in this scenario) and ignores all unprintable characters such as: \x00, \x01 and replaces them with something else.
A solution is to "decode" file1_content (ignore bytes):
file1_content = ''.join(x for x in file1_content if x.isprintable())
I was also getting the same error with the following code:
with zipfile.ZipFile("C:\local_files\REPORT.zip",mode='w') as z:
z.writestr(data)
It was happening because I was passing the bytestring i.e. data in writestr() method without specifying the name of file i.e. Report.zip where it should be saved.
So I changed my code and it worked.
with zipfile.ZipFile("C:\local_files\REPORT.zip",mode='w') as z:
z.writestr('Report.zip', data)
If you are trying to open a file then you should use the path generated by os, like so:
import os
os.path.join("path","to","the","file")

Reading non-text files into Python

I want to read in a non text file. It has an extension ".map" but can be opened by notepad. How should I open this file through python?
file = open("path-to-file","r") doesn't work for me. It returns No such file or directory: error.
Here's what my file looks like:
111 + gi|89106884|ref|AC_000091.1| 725803 TCGAGATCGACCATGTTGCCCGCCT IIIIIIIIIIIIIIIIIIIIIIIII 0 14:A>G
457 + gi|89106884|ref|AC_000091.1| 32629 CCGTGTCCACCGACTACGACACCTC IIIIIIIIIIIIIIIIIIIIIIIII 0 4:C>G,22:T>C
779 + gi|89106884|ref|AC_000091.1| 483582 GATCACCCACGCAAAGATGGGGCGA IIIIIIIIIIIIIIIIIIIIIIIII 0 15:A>G,18:C>G
784 + gi|89106884|ref|AC_000091.1| 226200 ACCGATAGTGAACCAGTACCGTGAG IIIIIIIIIIIIIIIIIIIIIIIII 1
If I do the follwing:
file = open("D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")
It still gives me No such file or directory: 'D:\x08owtie-0.12.7-win32\x08owtie-0.12.7\\output_635\results_NC_000117.fna.1.ebwt.map' error. Is this because the file isn't binary or I don't have some permissions?
Would apppreciate help with this!
Binary files should use a binary mode.
f = open("path-to-file","rb")
But that won't help if you don't have the appropriate permissions or don't know the format of the file itself.
EDIT:
Obviously you didn't bother reading the error message, or you would have noticed that the filename it is using is not the one you expected.
f = open("D:\\bowtie-0.12.7-win32\\bowtie-0.12.7\\output_635\\results_NC_000117.fna.1.ebwt.map","rb")
f = open(r"D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")
You have hit upon a minor difference between Unix and Windows here.
Since you mentioned Notepad, you must be running this on Windows. In DOS/Windows land, opening a binary file requires specifying attribute 'b' for binary, as others have already indicated. Unix/Linux are a bit more relaxed about this. Omitting attribute 'b' will still open a binary file.
The same behavior is exhibited in the C library's fopen() call.
If its a non-text file you could try opening it using binary format. Try this -
with open("path-to-file", "rb") as f:
byte = f.read(1)
while byte != "":
byte = f.read(1) # Do stuff with byte.
The with statement handles opening and closing the file, including if an exception is raised in the inner block.
Of course since the format is binary you need to know what you are going to do after you read. Also, here I read 1 byte at a time, you can define bigger chunk sizes too.
UPDATE: Maybe this is not a binary file. You might be having problems with file encoding, the characters might not be ascii or they might belong to unicode charset. Try this -
import codecs
f = codecs.open(u'path-to-file','r','utf-8')
print f.read()
f.close()
If you print this out in the terminal, you might still get gibberish since the terminal might not support this charset. I would advise, go ahead & process the text assuming its properly opened.
Source

Categories