Why is the first line longer? - python

i'm using python to read a txt document with:
f = open(path,"r")
for line in f:
line = line.decode('utf8').strip()
length = len(line)
firstLetter = line[:1]
it seems to work, but the first line's length is always longer by... 1
for example:
the first line is "XXXX" where X denotes a chinese character
then length will be 5, but not 4
and firstLetter will be nothing
but when it goes to the second and after lines,it works properly
tks~

You have a UTF-8 BOM at the start of your file. Don't faff about inspecting the first character. Instead of the utf8 encoding, use the utf_8_sig encoding with either codecs.open() or your_byte_string.decode() ... this sucks up the BOM if it exists and you don't see it in your code.
>>> bom8 = u'\ufeff'.encode('utf8')
>>> bom8
'\xef\xbb\xbf'
>>> bom8.decode('utf8')
u'\ufeff'
>>> bom8.decode('utf_8_sig')
u'' # removes the BOM
>>> 'abcd'.decode('utf_8_sig')
u'abcd' # doesn't care if no BOM
>>>

You are probably getting the Byte Order Mark (BOM) as the first character on the first line.
Information about dealing with it is here

Related

Error when reading UTF-8 characters with python

I have the following function in python, which takes a string as argument and returns the same string in ASCII (e.g. "alçapão" -> "alcapao"):
def filt(word):
dic = { u'á':'a',u'ã':'a',u'â':'a' } # the whole dictionary is too big, it is just a sample
new = ''
for l in word:
new = new + dic.get(l, l)
return new
It is supposed to "filter" all strings in a list that I read from a file using this:
lines = []
with open("to-filter.txt","r") as f:
for line in f:
lines.append(line.strip())
lines = [filt(l) for l in lines]
But I get this:
filt.py:9: UnicodeWarning: Unicode equal comparison failed to convert
both arguments to Unicode - interpreting them as being unequal
new = new + dic.get(l, l)
and the strings filtered have characters like '\xc3\xb4' instead of ASCII characters. What should I do?
You're mixing and matching Unicodes strs and regular (byte) strs.
Use the io module to open and decode your text file to Unicodes as it's read:
with io.open("to-filter.txt","r", encoding="utf-8") as f:
this assumes your to-filter.txt file is UTF-8 encoded.
You can also shrink your file read into an array with just:
with io.open("to-filter.txt","r", encoding="utf-8") as f:
lines = f.read().splitlines()
lines is now a list of Unicode strings.
Optional
It looks like you're trying to convert non-ASCII characters to their closest ASCII equivalent. The easy way to this is:
import unicodedata
def filt(word):
return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')
What this does is:
Decomposes each character into their component parts. For example, ã can be expressed as a single Unicode char (U+00E3 'LATIN SMALL LETTER A WITH TILDE') or as two Unicode characters: U+0061 'LATIN SMALL LETTER A' + U+0303 'COMBINING TILDE'.
Encode component parts to ASCII. Non ASCII parts (those with code points greater than U+007F), will be ignored.
Decode back to a Unicode str for convenience.
Tl;dr
Your code is now:
import unicodedata
def filt(word):
return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')
with io.open("to-filter.txt","r", encoding="utf-8") as f:
lines = f.read().splitlines()
lines = [filt(l) for l in lines]
Python 3.x
Although not strictly necessarily, remove io from open()
The root of your problem is that you're not reading Unicode strings from the file, you're reading byte strings. There are three ways to fix this, first is to open the file with the io module as suggested by another answer. The second is to convert each string as you read it:
with open("to-filter.txt","r") as f:
for line in f:
lines.append(line.decode('utf-8').strip())
The third way is to use Python 3, which always reads text files into Unicode strings.
Finally, there's no need to write your own code to turn accented characters into plain ASCII, there's a package unidecode to do that.
from unidecode import unidecode
print(unidecode(line))

Unicode in python 3

I want to convert string, which contains Unicode numbers to usual text. For example, file "input.txt" contains string '\u0057\u0068\u0061\u0074,' and I want to know what does it mean. If string is input in the code like:
s = '\u0057\u0068\u0061\u0074'
b = s.encode('utf-8')
print(b)
it works perfectly, but if I want to do the same with file I get this result b'\\u0057\\u0068\\u0061\\u0074'.
How to fix this problem? Windows 8, encoding of files are 'windows-1251'.
If your file contains those unicode escape sequences, then you can use the unicode_escape “codec” to interpret them after you read the file contents as a string.
>>> s = r'\u0057\u0068\u0061\u0074'
>>> print(s)
\u0057\u0068\u0061\u0074
>>> s.encode('utf-8').decode('unicode_escape')
'What'
Or, you can just read a bytes string directly and decode that:
with open('file.txt', 'br') as f:
print(f.read().decode('unicode_escape'))

removing blank lines from text file output python 3

I wrote a program in python 3 that edits a text file, and outputs the edited version to a new text file. But the new file has blank lines that I can't have, and I can't figure out how to get rid of them.
Thanks in advance.
newData = ""
i=0
run=1
j=0
k=1
seqFile = open('temp100.txt', 'r')
seqData = seqFile.readlines()
while i < 26:
sLine = seqData[j]
editLine = seqData[k]
tempLine = editLine[0:20]
newLine = editLine.replace(editLine, tempLine)
newData = newData+sLine+'\n'+newLine+'\n'
i=i+1
j=j+2
k=k+2
run=run+1
seqFile.close()
new100 = open("new100a.fastq", "w")
sys.stdout = new100
print(newData)
Problem is at this line:
newData = newData+sLine+'\n'+newLine+'\n'
sLine already contains newline symbol, so you should remove the first '\n'. If length of newLine is less than 20, then newLine also contains the newline. In other case you should add the newline symbol to it.
Try this:
newData = newData + sLine + newLine
if len(seqData[k]) > 20:
newData += '\n'
sLine already contains newlines. newLine will also contain a newline if editLine is shorter or equal to 20 characters long. You can change
newData = newData+sLine+'\n'+newLine+'\n'
to
newData = newData+sLine+newLine
In cases where editLine is longer than 20 characters, the trailing newline will be cut off when you do tempLine = editLine[0:20] and you will need to append a newline to newData yourself.
According to the python documentation on readline (which is used by readlines), trailing newlines are kept in each line:
Read one entire line from the file. A trailing newline character is
kept in the string (but may be absent when a file ends with an
incomplete line). [6] If the size argument is present and
non-negative, it is a maximum byte count (including the trailing
newline) and an incomplete line may be returned. When size is not 0,
an empty string is returned only when EOF is encountered immediately.
In general, you can often get a long way in debugging a program by printing the values of your variables when you get unexpected behaviour. For instance printing sLine with print repr(sLine) would have shown you that there was a trailing newline in there.

Python JSON preserve encoding

I have a file like this:
aarónico
aaronita
ababol
abacá
abacería
abacero
ábaco
#more words, with no ascii chars
When i read and print that file to the console, it prints exactly the same, as expected, but when i do:
f.write(json.dumps({word: Lookup(line)}))
This is saved instead:
{"aar\u00f3nico": ["Stuff"]}
When i expected:
{"aarónico": ["Stuff"]}
I need to get the same when i jason.loads() it, but i don't know where or how to do the encoding or if it's needed to get it to work.
EDIT
This is the code that saves the data to a file:
with open(LEMARIO_FILE, "r") as flemario:
with open(DATA_FILE, "w") as f:
while True:
word = flemario.readline().strip()
if word == "":
break
print word #this is correct
f.write(json.dumps({word: RAELookup(word)}))
f.write("\n")
And this one loads the data and returns the dictionary object:
with open(DATA_FILE, "r") as f:
while True:
new = f.readline().strip()
if new == "":
break
print json.loads(new) #this is not
I cannot lookup the dictionaries if the keys are not the same as the saved ones.
EDIT 2
>>> import json
>>> f = open("test", "w")
>>> f.write(json.dumps({"héllö": ["stuff"]}))
>>> f.close()
>>> f = open("test", "r")
>>> print json.loads(f.read())
{u'h\xe9ll\xf6': [u'stuff']}
>>> "héllö" in {u'h\xe9ll\xf6': [u'stuff']}
False
This is normal and valid JSON behaviour. The \uxxxx escape is also used by Python, so make sure you don't confuse python literal representations with the contents of the string.
Demo in Python 3.3:
>>> import json
>>> print('aar\u00f3nico')
aarónico
>>> print(json.dumps('aar\u00f3nico'))
"aar\u00f3nico"
>>> print(json.loads(json.dumps('aar\u00f3nico')))
aarónico
In python 2.7:
>>> import json
>>> print u'aar\u00f3nico'
aarónico
>>> print(json.dumps(u'aar\u00f3nico'))
"aar\u00f3nico"
>>> print(json.loads(json.dumps(u'aar\u00f3nico')))
aarónico
When reading and writing from and to files, and when specifying just raw byte strings (and "héllö" is a raw byte string) then you are not dealing with Unicode data. You need to learn about the differences between encoded and Unicode data first. I strongly recommend you read at least 2 of the following 3 articles:
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
You were lucky with your "héllö" python raw byte string representation, Python managed to decode it automatically for you. The value read back from the file is perfectly normal and correct:
>>> print u'h\xe9ll\xf6'
héllö

UnicodeDecodeError with seek() and read()

I am following example code in Programming Python, and something confuses.Here's the code that writes simple string to a file and then reads it back
>>> data = 'sp\xe4m' # data to your script
>>> data, len(data) # 4 unicode chars, 1 nonascii
('späm', 4)
>>> data.encode('utf8'), len(data.encode('utf8')) # bytes written to file
(b'sp\xc3\xa4m', 5)
>>> f = open('test', mode='w+', encoding='utf8') # use text mode, encoded
>>> f.write(data)
>>> f.flush()
>>> f.seek(0); f.read(1) # ascii bytes work
's'
>>> f.seek(2); f.read(1) # as does 2-byte nonascii
'ä'
>>> data[3] # but offset 3 is not 'm' !
'm'
>>> f.seek(3); f.read(1)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 0:
unexpected code byte
Now, what confuses me is this, why this UnicodeDecodeError is happening if data string is utf8 encoded? Reading with manual f.read() works fine, but when using seek to jump and read(1), this error shows up.
Seeking within a file will move the read pointer by bytes, not by characters. The .read() call expects to be able to read whole characters instead. Because UTF-8 uses multiple bytes for any unicode codepoint beyond the ASCII character set, you cannot just seek into the middle of a multi-byte UTF-8 codepoint and expect .read() to work.
The U+00a4 codepoint (the glyph ä) is encoded to two bytes, C3 and A4. In the file, this means there are now 5 bytes, representing s, p, the hex bytes C3 and A4, then m.
By seeking to position 3, you moved the file header to the A4 byte, and calling .read() then fails because without the preceding C3 byte, there is not enough context to decode the character. This raises the UnicodeDecodeError; the A4 byte is unexpected, as it is not a valid UTF-8 sequence.
Seek to position 4 instead:
>>> f.seek(3); f.read(1)
'm'
Better still, don't seek around in UTF-8 data, or open the file in binary mode and decode manually.

Categories