Changing Encoding of Text Files Using Python: It's Not Happening - python

Upon copying or pretty much touching files in any way, Windows changes their encoding to its default 1252: Western European. In the text editor I'm using, EditPad Pro Plus, I can see and convert the encoding. I trust that this conversion works, because I've been working with files between Windows and UNIX, and I know that when my text editor changes encodings, the files are read correctly in UNIX where they caused problems before.
I would like to convert files en masse. So I'm attempting to do that using Python in Windows 10, called from either Powershell (using Python v 3.6.2) or CygWin (using Python v 2.7.13). I see both codecs and io used for the job, and commentary that io is the proper way for Python 3.
But the files are not converted -- codecs or io. The script below successfully copies the files, but my text editor reports them as 1252 still. And the UniversalDetector (in the commented out portions of the script below) reports their encoding as "ascii".
What needs to happen to get these to convert successfully?
import sys
import os
import io
#from chardet.universaldetector import UniversalDetector
BLOCKSIZE = 1048576
#detector = UniversalDetector()
#def get_encoding( current_file ):
# detector.reset()
# for line in file(current_file):
# detector.feed(line)
# if detector.done: break
# detector.close()
# return detector.result['encoding']
def main():
src_dir = ""
if len( sys.argv ) > 1:
src_dir = sys.argv[1]
if os.path.exists( src_dir ):
dest_dir = src_dir[:-2]
for file in os.listdir( src_dir ):
with io.open( os.path.join( src_dir, file ), "r", encoding='cp1252') as source_file:
with io.open( os.path.join( dest_dir, file ), "w", encoding='utf8') as target_file:
while True:
contents = source_file.read( BLOCKSIZE )
if not contents:
break
target_file.write( contents )
#print( "Encoding of " + file + ": " + get_encoding( os.path.join( dest_dir, file ) ) )
else:
print( 'The specified directory does not exist.' )
if __name__ == "__main__":
main()
I've tried some variations such as opening the file as UTF8, calling read() without the blocksize, and, originally, the encodings were specified a little differently. They all successfully copy the files, but do not encode them as intended.

ASCII is the common subset to a whole lot of encodings. It is a subset of UTF-8, Latin-1, and cp1252-- and of the whole ISO-8859 family which has encodings for Russian, Greek etc. If your files are really ASCII, there's nothing to convert and your system is only saying "cp1252" because the files are compatible with this. You could add a BOM to tag a file as UTF (encoding utf-8-sig), but frankly I don't see the point. UTF doesn't actually need it, because UTF files are recognizable by the structure of multi-byte characters.
If you want to experiment with encodings, use text that contains non-ASCII characters: French, Russian, Chinese, or even English with some accented words (or the silly directed quotes that Microsoft applications like to insert). Save the words "Wikipédia en français" in a file and repeat your experiments, and you'll get very different results.
I strongly recommend using Python 3 for this, and for anything else to do with character encodings. The Python 2 approach to encodings results in a lot of pointless confusion, and was in fact one of the major reasons for breaking compatibility and introducing Python 3.
As a bonus, in Python 3 you can just use open() with an encoding argument. You don't need any modules to change encodings.

Related

How can I write characters such as § into a file using Python?

This is my code for creating the string to be written ('result' is the variable that holds the final text):
fileobj = open('file_name.yml','a+')
begin = initial+":0 "
n_name = '"§'+tag+name+'§!"'
begin_d = initial+"_desc:0 "
n_desc = '"§3'+desc+'§!"'
title = ' '+begin + n_name
descript = ' '+begin_d + n_desc
result = title+'\n'+descript
print()
fileobj.close()
return result
This is my code for actually writing it into the file:
text = writing(initial, tag, name, desc)
override = inserter(fileobj, country, text)
fileobj.close()
fileobj = open('file_name.yml','w+')
fileobj.write(override)
fileobj.close()
(P.S: Override is a function which works perfectly. It returns a longer string to be written into the file.)
I have tried this with .txt and .yml files but in both cases, instead of §, this is what takes its place: xA7 (I cannot copy the actual text into the internet as it changes into the correct character. It is, however, appearing as xA7 in the file.) Everything else is unaffected, and the code runs fine.
Do let me know if I can improve the question in any way.
You're running into a problem called character encoding. There are two parts to the problem - first is to get the encoding you want in the file, the second is to get the OS to use the same encoding.
The most flexible and common encoding is UTF-8, because it can handle any Unicode character while remaining backwards compatible with the very old 7-bit ASCII character set. Most Unix-like systems like Linux will handle it automatically.
fileobj = open('file_name.yml','w+',encoding='utf-8')
You can set your PYTHONIOENCODING environment value to make it the default.
Windows operating systems are a little trickier because they'll rarely assume UTF-8, especially if it's a Microsoft program opening the file. There's a magic byte sequence called a BOM that will trigger Microsoft to use UTF-8 if it's at the beginning of a file. Python can add that automatically for you:
fileobj = open('file_name.yml','w+',encoding='utf_8_sig')

Pycurl: uploading files with filenames in UTF-8

This question is related to this one.
Please read the problem that Chris describes there. I'll narrow it down: there's a CURL error 26 if a filename is utf-8-encoded and contains characters that are not in the range of those supported by non-unicode programs.
Let me explain myself:
local_filename = filename.encode("utf-8")
self.curl.setopt(self.curl.HTTPPOST, [(field, (self.curl.FORM_FILE, local_filename, self.curl.FORM_FILENAME, local_filename))])
I have windows 7 with Russian set as the language for non-unicode programs. If I don't encode filename to utf-8 (and pass filename, not local_filename to pycurl(, everything goes flawlessly if the filename contains either English or Russian chars. But if there is, say, an à, — it throws an error 26. If I pass local_filename (so encoded to UTF-8), even Russian chars aren't allowed.
Could you help, please? Thanks!
This is easy to answer, harder to fix:
pycurl uses libcurl for formposting. libcurl uses plain fopen() to open files for posting. Therefore you need to tell libcurl the exact file name that it should open and read from your local file system.
Decompose this problem into 2 components:
tell pycurl which file to open to read file data
send filename in correct encoding to the server
These may or may not be same encodings.
For 1, use sys.getfilesystemencoding() to convert unicode filename (which you use throughout python code correctly) to a string that pycurl/libcurl can open correctly with fopen(). Use strace (linux) or equivalent windows osx to verify correct file path is being opened by pycurl.
If that totally fails you can always feed file data stream from Python via pycurl.READFUNCTION.
For 2, learn how filename is transmitted during file upload, example. I don't have a good link, all I know it's not trivial, e.g. when it comes to very long file names.
I hacked up your code snippet, I have this, it works against nc -vl 5050 at least.
#!/usr/bin/python
import pycurl
c = pycurl.Curl()
filename = u"example-\N{EURO SIGN}.mp3"
with open(filename, "wb") as f:
f.write("\0\xfffoobar\x07\xff" * 9)
local_filename = filename.encode("utf-8")
c.setopt(pycurl.HTTPPOST, [("xxx", (pycurl.FORM_FILE, local_filename, pycurl.FORM_FILENAME, local_filename))])
c.setopt(pycurl.URL, "http://localhost:5050/")
c.setopt(pycurl.HTTPHEADER, ["Expect:"])
c.perform()
My test doesn't cover the case where encoding is different between OS and HTTP.
Should be enough to get your started though, shouldn't it?

Encoding file names to base64 on OS X not correct when using Japanese characters

I have a bunch of files named after people's names (e.g. "john.txt", "mary.txt") but among them are also japanese names (e.g. "fūka.txt", "tetsurō.txt").
What I'm trying to do is to convert names before ".txt" to Base64.
Only problem is that when I take a file name (without extension) and use a web based converter I get a different result than encoding with a help of my Python script.
So... For example when I copy file name part without extension and encode "fūka" in http://www.base64encode.org I get "ZsWra2E=". Same result I get when I take person's name from UTF-8 encoded PostgreSQL database, make it lower case and base64 encode it.
But when I use Python script below I get "ZnXMhGth"
import glob, os
import base64
def rename(dir, pattern):
for pathAndFilename in glob.iglob(os.path.join(dir, pattern)):
title, ext = os.path.splitext(os.path.basename(pathAndFilename))
t = title.lower().encode("utf-8")
encoded_string = base64.b64encode(t) + ext
p = os.path.join(dir, encoded_string)
os.rename(pathAndFilename, p)
rename(u'./test', u'*.txt')
I get the same results in OS X 10.8 and Linux (files uploaded from Mac to Linux server). Python is 2.7. And I tried also PHP script (the result was same as for Python script).
And similar difference happens when I use names with other characters (e.g. "tetsurō").
One more strange thing ... when I output filename part with a Python script in OS X's Terminal application and then copy this text as a filename ... and THEN encode file name to base64, I get the same result as on a webpage I mentioned above. Terminal has UTF-8 encoding.
Could somebody please explain me what am I doing (or thinking) wrong? Is there somewhere inbetween some little character substitution going on? How can I make Python script get the same result as above mentioned web page Any hints will be greatly appreciated.
SOLUTION:
With a help of Marks answer I modified a script and it worked like a charm! Thanks Mark!
import glob, os
import base64
from unicodedata import normalize
def rename(dir, pattern):
for pathAndFilename in glob.iglob(os.path.join(dir, pattern)):
title, ext = os.path.splitext(os.path.basename(pathAndFilename))
t = normalize('NFC', title.lower()).encode("utf-8") # <-- NORMALIZE !!!
encoded_string = base64.b64encode(t) + ext
p = os.path.join(dir, encoded_string)
os.rename(pathAndFilename, p)
rename(u'./test', u'*.txt')
It appears that the Python script is using a normalized form of Unicode, where the ū has been split into two characters, u and a combining macron. The other form uses a single character latin small letter u with macron. As far as Unicode is concerned, they're the same string even though they don't have the same binary representation.
You might get some more information from this Unicode FAQ: http://www.unicode.org/faq/normalization.html

How do I allow opening of files that have Unicode characters in their filenames?

I have this Python script here that opens a random video file in a directory when run:
import glob,random,os
files = glob.glob("*.mkv")
files.extend(glob.glob("*.mp4"))
files.extend(glob.glob("*.tp"))
files.extend(glob.glob("*.avi"))
files.extend(glob.glob("*.ts"))
files.extend(glob.glob("*.flv"))
files.extend(glob.glob("*.mov"))
file = random.choice(files)
print "Opening file %s..." % file
cmd = "rundll32 url.dll,FileProtocolHandler \"" + file + "\""
os.system(cmd)
Source: An answer in my Super User post, 'How do I open a random file in a folder, and set that only files with the specified filename extension(s) should be opened?'
This is called by a BAT file, with this as its script:
C:\Python27\python.exe "C:\Programs\Scripts\open-random-video.py" cd
I put this BAT file in the directory I want to open random videos of.
In most cases it works fine. However, I can't make it open files with Unicode characters (like Japanese or Korean characters in my case) in their filenames.
This is the error message when the BAT file and Python script is run on a directory and opens a file with Unicode characters in its filename:
C:\TestDir>openrandomvideo.BAT
C:\TestDir>C:\Python27\python.exe "C:\Programs\Scripts\open-random-video.py" cd
The filename, directory name, or volume label syntax is incorrect.
Note that the filename of the .FLV video file in that log is changed from its original filename (소시.flv) to '∩╗┐' in the command line log.
EDIT: I learned that the above command line error message is due to saving the BAT file as 'UTF-8 with BOM'. Saving it as 'ANSI or UTF-16' shows the following message instead, but still does not open the file:
C:\TestDir>openrandomvideo.BAT
C:\TestDir>C:\Python27\python.exe "C:\Programs\Scripts\open-random-video.py" cd
Opening file ??.flv...
Now, the filename of the .FLV video file in that log is changed from its original filename (소시.flv) to '??.flv.' in the command line log.
I'm using Python 2.7 on Windows 7, 64-bit.
How do I allow opening of files that have Unicode characters in their filenames?
Just use Unicode literals e.g., u".mp4" everywhere. IO functions in Python will return Unicode filenames back if you give them Unicode input (internally they might use Unicode-aware Windows API):
import os
import random
videodir = u"." # get videos from current directory
extensions = tuple(u".mkv .mp4 .tp .avi .ts .flv .mov".split())
files = [file for file in os.listdir(videodir) if file.endswith(extensions)]
if files: # at least one video file exists
random_file = random.choice(files)
os.startfile(os.path.join(videodir, random_file)) # start the video
else:
print('No %s files found in "%s"' % ("|".join(extensions), videodir,))
If you want to emulate how your web browser would open video files then you could use webbrowser.open() instead of os.startfile() though the former might use the latter internally on Windows anyway.
The error when running the BAT file is because the BAT file itself is saved as "UTF-8 with BOM". The "" bytes are not a corrupted filename, they are the literal first bytes stored in the BAT file. Re-save the BAT file as ANSI or UTF-16, which are the only encodings supported for BAT files.
Either use Unicode literals as described by J. F. Sebastian, or use Python 3, which always uses Unicode.
(For Python 3, your script will need a minor modification: print is a function now, so you have to put parentheses around the parameter list.)
please familiarize yourself to add # -*- coding: utf-8 -*- in your source code,
so python understanding about your unicode.

Reading non-text files into Python

I want to read in a non text file. It has an extension ".map" but can be opened by notepad. How should I open this file through python?
file = open("path-to-file","r") doesn't work for me. It returns No such file or directory: error.
Here's what my file looks like:
111 + gi|89106884|ref|AC_000091.1| 725803 TCGAGATCGACCATGTTGCCCGCCT IIIIIIIIIIIIIIIIIIIIIIIII 0 14:A>G
457 + gi|89106884|ref|AC_000091.1| 32629 CCGTGTCCACCGACTACGACACCTC IIIIIIIIIIIIIIIIIIIIIIIII 0 4:C>G,22:T>C
779 + gi|89106884|ref|AC_000091.1| 483582 GATCACCCACGCAAAGATGGGGCGA IIIIIIIIIIIIIIIIIIIIIIIII 0 15:A>G,18:C>G
784 + gi|89106884|ref|AC_000091.1| 226200 ACCGATAGTGAACCAGTACCGTGAG IIIIIIIIIIIIIIIIIIIIIIIII 1
If I do the follwing:
file = open("D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")
It still gives me No such file or directory: 'D:\x08owtie-0.12.7-win32\x08owtie-0.12.7\\output_635\results_NC_000117.fna.1.ebwt.map' error. Is this because the file isn't binary or I don't have some permissions?
Would apppreciate help with this!
Binary files should use a binary mode.
f = open("path-to-file","rb")
But that won't help if you don't have the appropriate permissions or don't know the format of the file itself.
EDIT:
Obviously you didn't bother reading the error message, or you would have noticed that the filename it is using is not the one you expected.
f = open("D:\\bowtie-0.12.7-win32\\bowtie-0.12.7\\output_635\\results_NC_000117.fna.1.ebwt.map","rb")
f = open(r"D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")
You have hit upon a minor difference between Unix and Windows here.
Since you mentioned Notepad, you must be running this on Windows. In DOS/Windows land, opening a binary file requires specifying attribute 'b' for binary, as others have already indicated. Unix/Linux are a bit more relaxed about this. Omitting attribute 'b' will still open a binary file.
The same behavior is exhibited in the C library's fopen() call.
If its a non-text file you could try opening it using binary format. Try this -
with open("path-to-file", "rb") as f:
byte = f.read(1)
while byte != "":
byte = f.read(1) # Do stuff with byte.
The with statement handles opening and closing the file, including if an exception is raised in the inner block.
Of course since the format is binary you need to know what you are going to do after you read. Also, here I read 1 byte at a time, you can define bigger chunk sizes too.
UPDATE: Maybe this is not a binary file. You might be having problems with file encoding, the characters might not be ascii or they might belong to unicode charset. Try this -
import codecs
f = codecs.open(u'path-to-file','r','utf-8')
print f.read()
f.close()
If you print this out in the terminal, you might still get gibberish since the terminal might not support this charset. I would advise, go ahead & process the text assuming its properly opened.
Source

Categories