Python keeps showing cp1250 character encoding in files - python

I have excersise to make script which convert UTF-16 files to UTF-8, so I wanted to have one example file with UTF-16 coding. The problem is that all files encoding which Python shows me is 'cp1250'(no matter which format .csv or .txt). What am I missing here? I have also example files from the Internet, but Python recognize them as cp-1250. Even when I save file with UTF-8, Python shows cp-1250 coding.
This is the code I use:
with open('FILE') as f:
print(f.encoding)

The result from open simply is a file in your system's default encoding. To open it in something else, you have to specifically say so.
To actually convert a file, try something like
with open('input', encoding='cp1252') as input, open('output', 'w', encoding='utf-16le') as output:
for line in input:
output.write(line)
Converting a legacy 8-bit file to Unicode isn't really useful because it only exercises a small subset of the character set. See if you can find a good "hello world" sample file. https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html is one for UTF-8.

Related

Wrong encoding when reading file in Python 3?

When I read a file in python and print it to the screen, it does not read certain characters properly, however, those same characters hard coded into a variable print just fine. Here is an example where "test.html" contains the text "Hallå":
with open('test.html','r') as file:
Str = file.read()
print(Str)
Str = "Hallå"
print(Str)
This generates the following output:
hallå
Hallå
My guess is that there is something wrong with how the data in the file is being interpreted when it is read into Python, however I am uncertain of what it is since Python 3.8.5 already uses UTF-8 encoding by default.
Function open does not use UTF-8 by default. As the documentation says:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
So, it depends, and to be certain, you have to specify the encoding yourself. If the file is saved in UTF-8, you should do this:
with open('test.html', 'r', encoding='utf-8') as file:
On the other hand, it is not clear whether the file is or is not saved in UTF-8 encoding. If it is not, you'll have to choose a different one.

Converting a Massive Unicode Text File to ASCII

I've received several text files, where each file contains thousands of lines of text. Because the files use Unicode encoding, each file ends up being around 1GB. I know this might sound borderline ridiculous, but it unfortunately is the reality:
I'm using Python 2.7 on a Windows 7 machine. I've only started using Python but figured this would be a good chance to really start using the language. You've gotta use it to learn it, right?
What I'm hoping to do is to be able to make a copy of all of these massive files. The new copies would be using ASCII character encoding and would ideally be significantly smaller in size. I know that changing the character encoding is a solution because I've had success by opening a file in MS WordPad and saving it to a regular text file:
Using WordPad is a manual and slow process: I need to open the file, which takes forever because it's so big, and then save it as a new file, which also takes forever since it's so big. I'd really like to automate this by having a script run in the background while I work on other things. I've written a bit of Python to do this, but it's not working correctly. What I've done so far is the following:
def convertToAscii():
# Getting a list of the current files in the directory
cwd = os.getcwd()
current_files = os.listdir(cwd)
# I don't want to mess with all of the files, so I'll just pick the second one since the first file is the script itself
test_file = current_files[1]
# Determining a new name for the ASCII-encoded file
file_name_length = len(test_file)
ascii_file_name = test_file[:file_name_length - 3 - 1] + "_ASCII" + test_file[file_name_length - 3 - 1:]
# Then we open the new blank file
the_file = open(ascii_file_name, 'w')
# Finally, we open our original file for testing...
with io.open(test_file, encoding='utf8') as f:
# ...read it line by line
for line in f:
# ...encode each line into ASCII
line.encode("ascii")
# ...and then write the ASCII line to the new file
the_file.write(line)
# Finally, we close the new file
the_file.close()
convertToAscii()
And I end up with the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
But that doesn't make any sense.... The first line within all of the text files is either a blank line or a series of equal signs, such as ===========.
I was wondering if someone would be able to put me onto the right path for this. I understand that doing this operation can take a very long time since I'm essentially reading each file line by line and then encoding the string into ASCII. What must I do in order to get around my current issue? And is there a more efficient way to do this?
For characters that exist in ASCII, UTF-8 already encodes using single bytes. Opening a UTF8 file with only single byte characters then saving an ASCII file should be a non-operation.
For any size difference, your files would have to be some wider encoding of Unicode, like UTF-16 / UCS-2. That would also explain the utf8 codec complaining about unexpected bytes in the source file.
Find out what encoding your files actually are, then save using utf8 codec. That way your files will be just as small (equivalent to ASCII) for single byte characters, but if your source files happen to have any multibyte characters, the result file will still be able to encode them and you won't be doing a lossy conversion.
There's a potential speedup if you avoid splitting the file into lines, since the only thing that you're doing is joining the lines back together. This allows you to process the input in larger blocks.
Using the shutil.copyfileobj function (which is just read and write in a loop):
import shutil
with open('input.txt', encoding='u16') as infile, \
open('output.txt', 'w', encoding='u8') as outfile:
shutil.copyfileobj(infile, outfile)
(Using Python 3 here, by passing the encoding argument directly to open, but it should be the same as the library function io.open.)

Python 3.5.1 mixed line code file UTF-8 and UTF-16

I have successfully been parsing data files that I recieve with a simple python script I wrote. The files I get are like this:
file.txt, ~50 columns of data, x 1000s of rows
abcd1,1234a,efgh1,5678a,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
Unfortunatly, sometimes some of the lines contain UTF-16 symbols, and look like this
abcd1,12341,efgh1,UTF-16 symbols here,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
I have been able to implement the "latin-1" coding for commands in my script like:
open('file fixed.txt', 'w', encoding="latin-1").writelines([line for line in open('file.txt', 'r', encoding="latin-1"])
My problem lies in code such as:
for line in fileinput.Fileinput('file fixed.txt', inplace=1):
line = line.replace(":",",")
print (line, ",")
I am unable to get past the encoding errors for the last command. I have tried enforcing the coding of:
# -*- coding: latin-1 -*-
At the top of the document as well as before the last mentioned command (find and replace). How can I get mixed encoded files to process for the above command? I would like to preserve the UTF-16 (unicode) symbols as they appear in the new file. Thanks in advance.
EDIT: Thanks to Alexis I was able to determine that filinput would not work for setting another encoding method. I used the below to resolve my issue.
f = open(filein,'r', encoding="latin-1")
filedata = f.read()
f.close()
newdata = filedata.replace("old data","new data")
f = open(fileout,'w', encoding="latin-1")
f.write(newdata)
f.close()
You can tell fileinput how to open your files. As the documentation says:
You can control how files are opened by providing an opening hook via the openhook parameter to fileinput.input() or FileInput(). The hook must be a function that takes two arguments, filename and mode, and returns an accordingly opened file-like object. Two useful hooks are already provided by this module.
So you'd do it like this:
def open_utf16(name, m):
return open(name, m, encoding="utf-16")
for line in fileinput.FileInput("file fixed.txt", openhook=open_utf16):
...
I use "utf-16" as the encoding since this is your file's encoding, not "latin-1". 8-bit encodings don't have error checking so Latin1 will read the bytes without noticing there's anything wrong, but you're likely to have problems down the line. If this gives you errors, your file is not in utf-16.
If your file has mixed encoding, you need to read it as binary and then decode different parts as necessary, or just process the whole thing as binary instead. The latin-1 solution in the question works by accident really.
In your example that would be something like:
with open('the/path', 'rb') as fi:
data = fi.read().replace(b'old data', b'new data')
with open('other/path', 'wb') as fo:
fo.write(data)
This is the closest to what you ask for - as far as I understand you don't even care about that field with potentially different encoding - you just want to change some content and copy the rest of the file as is. Binary mode allows you to do that.

Python 2.7: Setting I/O Encoding, ’?

Attempting to write a line to a text file in Python 2.7, and have the following code:
# -*- coding: utf-8 -*-
...
f = open(os.path.join(os.path.dirname(__file__), 'output.txt'), 'w')
f.write('Smith’s BaseBall Cap') // Note the strangely shaped apostrophe
However, in output.txt, I get Smith’s BaseBall Cap, instead. Not sure how to correct this encoding problem? Any protips with this sort of issue?
You have declared your file to be encoded with UTF-8, so your byte-string literal is in UTF-8. The curly apostrophe is U+2019. In UTF-8, this is encoded as three bytes, \xE2\x80\x99. Those three bytes are written to your output file. Then, when you examine the output file, it is interpreted as something other than UTF-8, and you see the three incorrect characters instead.
In Mac OS Roman, those three bytes display as ’.
Your file is a correct UTF-8 file, but you are viewing it incorrectly.
There are a couple possibilities, but the first one to check is that the output file actually contains what you think it does. Are you sure you're not viewing the file with the wrong encoding? Some editors have an option to choose what encoding you're viewing the file in. The editor needs to know the file's encoding, and if it interprets the file as being in some other encoding than UTF-8, it will display the wrong thing even though the contents of the file are correct.
When I run your code (on Python 2.6) I get the correct output in the file. Another thing to try: Use the codecs module to open the file for UTF-8 writing: f = codecs.open("file.txt", "w", "utf-8"). Then declare the string as a unicode string withu"'Smith’s BaseBall Cap'"`.

Python open("x", "r") function, how do I know or control which encoding the file is supposed to have?

If a python script uses the open("filename", "r") function to open, and subsequently read, the contents of a text file, how can I tell which encoding this file is supposed to have?
Note that since I'm executing this script from my own program, if there is any way to control this through environment variables, then that is good enough for me.
This is Python 2.7 by the way.
The code in question comes from Mercurial, it can be given a list of files to, say, add to the repository, through a file on disk, instead of passing them on the command line.
So basically, instead of this:
hg add A B C
I can write out A, B and C to a file, with newlines between each, and then execute the following:
hg add listfile:input.txt
The code that ends up reading this file is this:
files = open(name, 'r').read().split(delimiter)
Hence my question. The answer I was given on IRC when I asked which encoding I should use was this:
it is the same encoding than the one you use on command line when passing a file argument
I take this to mean that it is the same encoding I "use" when I execute Mercurial (hg). Since I have no idea which encoding that is, I just give everything to the .NET Process object, I ask here.
You can't. Reading a file is independent of its encoding; you'll need to know the encoding in advance in order to properly interpret the bytes you read in.
For example, if you know the file is encoded in UTF-8:
with open('filename', 'rb') as f:
contents = f.read().decode('utf-8-sig') # -sig deals with BOM, if present
Or if you know the file is ASCII only:
with open('filename', 'r') as f:
contents = f.read() # results in a str object
If you really don't know the encoding of the file, then there's obviously no guarantee that you can read it properly; however, you can guess at the encoding using a tool like chardet.
UPDATE:
I think I understand your question now. I thought you had a file you needed to write code for, but it seems you have code you need to write a file for ;-)
The code in question probably only deals properly with plain ASCII (it's possible the strings are converted later, but unlikely I think). So you'll want to make a text file that contains only ASCII (codepoint < 128) characters, and make sure it is saved in an ASCII encoding (i.e. not UTF-16 or anything like that). This is a little unfortunate considering that Mercurial deals with filenames, which can contain Unicode characters.

Categories