Python and parsing unicode files - python

A few weeks ago I wrote a CSV parser in python and it was working great with the provided text file. But when we tried to test is with other files the problems started.
First was the
ValueError: empty string for float()
for a string like "313.44". The problem was that in unicode there was some empty bytes betwee the numbers '\x0'.
Ok I decoded to read it as an unicode with
codecs.open(filename, 'r', 'utf-16')
And then the hell opened, missing BOM, problems with the line end characters (LF vs CR+LF) etc.
So can you provide me or give me hint for a workaround about parsing unicode and non-unicode files if I do not know what the encoding is, is BOM present, what line ending are etc.
P.S. I am using Python 2.7

The problem was solved using the csv module as proposed by Daenyth

It mainly depends on the Python version you are using but those 2 links shopuld help you out:
http://docs.python.org/howto/unicode.html
Character reading from file in Python

Related

Python keeps showing cp1250 character encoding in files

I have excersise to make script which convert UTF-16 files to UTF-8, so I wanted to have one example file with UTF-16 coding. The problem is that all files encoding which Python shows me is 'cp1250'(no matter which format .csv or .txt). What am I missing here? I have also example files from the Internet, but Python recognize them as cp-1250. Even when I save file with UTF-8, Python shows cp-1250 coding.
This is the code I use:
with open('FILE') as f:
print(f.encoding)
The result from open simply is a file in your system's default encoding. To open it in something else, you have to specifically say so.
To actually convert a file, try something like
with open('input', encoding='cp1252') as input, open('output', 'w', encoding='utf-16le') as output:
for line in input:
output.write(line)
Converting a legacy 8-bit file to Unicode isn't really useful because it only exercises a small subset of the character set. See if you can find a good "hello world" sample file. https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html is one for UTF-8.

SPSS python - writing Unicode to spss syntax file

I am using python to write a text to an .sps file (which is SPSS syntax file).
begin program.
outfile=open("c:/temp/syntax.sps","w+")
outfile.write("some text…")
outfile.close()
end program.
The last character in the text is:
>>> my_text="some text…"
>>> my_text[-1]
'\x85'
If I open the resulting file in Notepad++, I see the text correctly. However, if I open the file in SPSS syntax, I see this:
some text…
Is there a quick way around this, using only the native modules of python 2.7 ? I would rather not convert all unicodes into their some-other-encoding corresponding characters, if possible
I know when you SAVE AS a syntax file in SPSS there is an option of Encoding (Unicode (UTF-8) vs. Local Encoding).
Not sure what the resolution is here but try adding to your python generated text file, on the very first line:
* Encoding: UTF-8.
In the end, this worked, with the help of the codecs module
begin program.
import codecs
outfile=codec.sopen("c:/temp/syntax.sps","w+","utf-8-sig")
outfile.write("some text…")
outfile.close()
end program.

what is something like %2 in unicode

I am reading from someone else and come to the part concerning unicode, which is always a headache for me. That will really help a lot if you can give some hints.
The situation is so:
I have a stopword file named stopword.txt in the form of following:
1 781037
2 650706 damen
3 196100 löwe
4 146044 lego
5 138280 monster
6 136410 high
7 100657 kost%c3%bcm #this % seems to be strange already
8 94084 schuhe
9 93680 kinder
10 87308 mit
and the code trying to read in it, look likes:
with open('%s/%s'%('path_to_stopwords.txt'), 'r') as f:
stoplines = [line.decode('utf-8').strip() for line in f.readlines()]
this decode('utf-8') seems to be very mysterious to me. As my understanding, without extra
specification "open" method read in files as string which will be automated encoded as
ascii (so in this case it causes already information loss if file which is opened contains character whose code point outside of 128, like löwe and it is read into program with encoding ascii, because then ö will be truncated encoded?) What the meaning of trying decoding it into utf-8 after reading into program ?
And to verify my ideas, I have tried to check what is in each line now with codes.
for line in stoplines:
print line
which gives me:
%09
%21%21%21
%26
%26amp%3b
%28buch%29
%28gr.
%2b
%2bbarbie
I am quite confused where these % comes from. Have I correctly read in the context of file ?
Thnak you very much
In Python 2, when you open a file and read from it, you get an str instance back, not a unicode string (in Python 3, you'd get a str, which is unicode in Python 3).
str.decode('utf-8') lets you decode that str into a unicode string (assuming the encoding is UTF8!).
It seems like your stopwords are URL-encoded:
print urllib.unquote('%c3%bc')
ü
It is indeed redundant to use urlencoding if the file is supposed to be UTF8 (which natively supports characters such as ü), but my intuition would be that this file is in fact ASCII, not UTF8.
All ASCII chars map to the same char in UTF8, so this works, despite being wrong.
A few points:
If the file is UTF-8, you should open all of it as UTF-8, not line by line. Either read it all and then decode (i.e f.read().decode("utf-8")) or open it using codecs.open with UTF-8.
You don't need f.readlines(), you can simple do "for line in f". It's more memory efficient and shorter.
'%s/%s'%('path_to_stopwords.txt') does not even work. Make sure you're doing it correctly. You might want to use os.path.join to join the paths.
The % encoding is url encoding. As Thomas above me wrote, you can use urllib.unquote.

Python encoding issue involving special characters

I am running Win7 x64 and I have Python 2.7.5 x64 installed. I am using Wing IDE 101 4.1.
For some reason, encoding is messed up.
special_str = "sauté"
print string
# sauté
string
# 'saut\xc3\xa9'
I don't understand why when I try to print it, it comes out weird. When I write it to a notepad text file, it comes out as right ("sauté"). Problem with this is that when I use BeautifulSoup on the string, it comes out containing that weird string "saut├⌐" and then when I output it back into a csv file, I end up with a html chunk containing that weird bit. Help!
You need to declare the encoding of the source file so Python can properly decode your string literals.
You can do this with a special comment at the top of the file (first or second line).
# coding:<coding>
where <coding> is the encoding used when saving the file, for example utf-8.

Python 3.1.3 Win 7: csv writerow Error "must be bytes or buffer, not str"

Got a simple script which worked perfectly under Python 2.7.1 at my Win xp machine.
Now got a win 7 machine with python 3.1.3.
The code is:
owriter.writerow(dtime[1][1])
dtime[1][1]=['30-Aug-10 16:00:00', '2.5', '15']
Got this error message: TypeError: must be bytes or buffer, not str
What changes should I make?
thanks.
In Python 2.X, it was required to open the csvfile with 'b' because the csv module does its own line termination handling.
In Python 3.X, the csv module still does its own line termination handling, but still needs to know an encoding for Unicode strings. The correct way to open a csv file for writing is:
outputfile=open("out.csv",'w',encoding='utf8',newline='')
encoding can be whatever you require, but newline='' suppresses text mode newline handling. On Windows, failing to do this will write \r\r\n file line endings instead of the correct \r\n. This is mentioned in the 3.X csv.reader documentation only, but csv.writer requires it as well.
Probably you need to open the file in text mode. If not, include enough of your code so it's runnable and demonstrates the problem.
Change to str.encode("ascii").
The point is that Python 2.x had somewhat mixed usage of str type for storing byte buffers and for storing character strings. Now in Python 3.x we have proper Unicode support and byte buffers are now separate type. You can convert between them using str.encode() and bytes.decode() specifying each time a character encoding as parameter.

Categories