I am using python to write a text to an .sps file (which is SPSS syntax file).
begin program.
outfile=open("c:/temp/syntax.sps","w+")
outfile.write("some text…")
outfile.close()
end program.
The last character in the text is:
>>> my_text="some text…"
>>> my_text[-1]
'\x85'
If I open the resulting file in Notepad++, I see the text correctly. However, if I open the file in SPSS syntax, I see this:
some text…
Is there a quick way around this, using only the native modules of python 2.7 ? I would rather not convert all unicodes into their some-other-encoding corresponding characters, if possible
I know when you SAVE AS a syntax file in SPSS there is an option of Encoding (Unicode (UTF-8) vs. Local Encoding).
Not sure what the resolution is here but try adding to your python generated text file, on the very first line:
* Encoding: UTF-8.
In the end, this worked, with the help of the codecs module
begin program.
import codecs
outfile=codec.sopen("c:/temp/syntax.sps","w+","utf-8-sig")
outfile.write("some text…")
outfile.close()
end program.
Related
I have excersise to make script which convert UTF-16 files to UTF-8, so I wanted to have one example file with UTF-16 coding. The problem is that all files encoding which Python shows me is 'cp1250'(no matter which format .csv or .txt). What am I missing here? I have also example files from the Internet, but Python recognize them as cp-1250. Even when I save file with UTF-8, Python shows cp-1250 coding.
This is the code I use:
with open('FILE') as f:
print(f.encoding)
The result from open simply is a file in your system's default encoding. To open it in something else, you have to specifically say so.
To actually convert a file, try something like
with open('input', encoding='cp1252') as input, open('output', 'w', encoding='utf-16le') as output:
for line in input:
output.write(line)
Converting a legacy 8-bit file to Unicode isn't really useful because it only exercises a small subset of the character set. See if you can find a good "hello world" sample file. https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html is one for UTF-8.
I'm trying to write results from a web scraping to a html file. I'm using Beautiful Soup to scrape links and text from web pages. Then when I'm creating the file and writing to it, I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 939-940: ordinal not in range(128)
The line writing to file looks like this:
file_object.write(file_content)
And when I instead do this:
file_object.write(file_content.encode('utf-8'))
I don't get an error, but it can't print special characters, like å or ä.
I realize this is some kind of encoding error, but I can't understand how to get around it. The project in its entirety is located here, line 81, since I had trouble extracting runnable and logical sub parts.
I'm using a Mac, but had similar problem running the same script on a pc. Using python 2.7
Yes use open() from codecs module, or, in Python 3 normal (built-in) open() as this:
f = open(path, "wt", encoding="UTF-8")
But, if you don't want to change your code much, you do not need anything special.
The trick is to add the correct BOM (byte order mark) at the beggining of your file, so that editor that opens it knows that it is an UTF-8 file, and that should treat is as such.
Change you should make:
file_object.write('\xef\xbb\xbf'+file_content.encode('utf-8'))
I am running Win7 x64 and I have Python 2.7.5 x64 installed. I am using Wing IDE 101 4.1.
For some reason, encoding is messed up.
special_str = "sauté"
print string
# sauté
string
# 'saut\xc3\xa9'
I don't understand why when I try to print it, it comes out weird. When I write it to a notepad text file, it comes out as right ("sauté"). Problem with this is that when I use BeautifulSoup on the string, it comes out containing that weird string "saut├⌐" and then when I output it back into a csv file, I end up with a html chunk containing that weird bit. Help!
You need to declare the encoding of the source file so Python can properly decode your string literals.
You can do this with a special comment at the top of the file (first or second line).
# coding:<coding>
where <coding> is the encoding used when saving the file, for example utf-8.
Attempting to write a line to a text file in Python 2.7, and have the following code:
# -*- coding: utf-8 -*-
...
f = open(os.path.join(os.path.dirname(__file__), 'output.txt'), 'w')
f.write('Smith’s BaseBall Cap') // Note the strangely shaped apostrophe
However, in output.txt, I get Smith’s BaseBall Cap, instead. Not sure how to correct this encoding problem? Any protips with this sort of issue?
You have declared your file to be encoded with UTF-8, so your byte-string literal is in UTF-8. The curly apostrophe is U+2019. In UTF-8, this is encoded as three bytes, \xE2\x80\x99. Those three bytes are written to your output file. Then, when you examine the output file, it is interpreted as something other than UTF-8, and you see the three incorrect characters instead.
In Mac OS Roman, those three bytes display as ’.
Your file is a correct UTF-8 file, but you are viewing it incorrectly.
There are a couple possibilities, but the first one to check is that the output file actually contains what you think it does. Are you sure you're not viewing the file with the wrong encoding? Some editors have an option to choose what encoding you're viewing the file in. The editor needs to know the file's encoding, and if it interprets the file as being in some other encoding than UTF-8, it will display the wrong thing even though the contents of the file are correct.
When I run your code (on Python 2.6) I get the correct output in the file. Another thing to try: Use the codecs module to open the file for UTF-8 writing: f = codecs.open("file.txt", "w", "utf-8"). Then declare the string as a unicode string withu"'Smith’s BaseBall Cap'"`.
A few weeks ago I wrote a CSV parser in python and it was working great with the provided text file. But when we tried to test is with other files the problems started.
First was the
ValueError: empty string for float()
for a string like "313.44". The problem was that in unicode there was some empty bytes betwee the numbers '\x0'.
Ok I decoded to read it as an unicode with
codecs.open(filename, 'r', 'utf-16')
And then the hell opened, missing BOM, problems with the line end characters (LF vs CR+LF) etc.
So can you provide me or give me hint for a workaround about parsing unicode and non-unicode files if I do not know what the encoding is, is BOM present, what line ending are etc.
P.S. I am using Python 2.7
The problem was solved using the csv module as proposed by Daenyth
It mainly depends on the Python version you are using but those 2 links shopuld help you out:
http://docs.python.org/howto/unicode.html
Character reading from file in Python