Why does  appear in my data? - python

I downloaded the file 'pi_million_digits.txt' from here:
https://github.com/ehmatthes/pcc/blob/master/chapter_10/pi_million_digits.txt
I then used this code to open and read it:
filename = 'pi_million_digits.txt'
with open(filename) as file_object:
lines = file_object.readlines()
pi_string = ''
for line in lines:
pi_string += line.strip()
print(pi_string[:52] + "...")
print(len(pi_string))
However the output produced is correct apart from the fact it is preceded by same strange symbols: "3.141...."
What causes these strange symbols? I am stripping the lines so I'd expect such symbols to be removed.

It looks like you're opening a file with a UTF-8 encoded Byte Order Mark using the ISO-8859-1 encoding (presumably because this is the default encoding on your OS).
If you open it as bytes and read the first line, you should see something like this:
>>> next(open('pi_million_digits.txt', 'rb'))
b'\xef\xbb\xbf3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
… where \xef\xbb\xbf is the UTF-8 encoding of the BOM. Opened as ISO-8859-1, it looks like what you're getting:
>>> next(open('pi_million_digits.txt', encoding='iso-8859-1'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
… and opening it as UTF-8 shows the actual BOM character U+FEFF:
>>> next(open('pi_million_digits.txt', encoding='utf-8'))
'\ufeff3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
To strip the mark out, use the special encoding utf-8-sig:
>>> next(open('pi_million_digits.txt', encoding='utf-8-sig'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
The use of next() in the examples above is just for demonstration purposes. In your code, you just need to add the encoding argument to your open() line, e.g.
with open(filename, encoding='utf-8-sig') as file_object:
# ... etc.

Related

How to open a file with utf-8 non encoded characters?

I want to open a text file (.dat) in python and I get the following error:
'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte
but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.
Here is my code
import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
searchfile.close()
It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:
with open('compounds.dat', 'rb') as f:
data = f.read()
the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.
In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):
# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io
with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
for line in f:
if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.
if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well
with open("filename" , 'r' , encoding="utf-8",errors="ignore") as f:
f.read()

read a file and try to remove all non UTF-8 chars

I am trying to read a file and convert the string to a UTF-8 string, in order to remove some non utf-8 chars in the file string,
file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')
but I got the following error,
AttributeError: 'str' object has no attribute 'decode'
Update: I tried the code as suggested by the answer,
file_str = open(file_path, 'r', encoding='utf-8').read()
but it didn't eliminate the non utf-8 chars, so how to remove them?
Remove the .decode('utf8') call. Your file data has already been decoded, because in Python 3 the open() call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.
You probably do want to add the encoding to the open() call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:
file_str = open(file_path, 'r', encoding='utf8').read()
For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.
See the open() function documentation for further details.
If you use
file_str = open(file_path, 'r', encoding='utf8', errors='ignore').read()
, then non-UTF-8 characters will essentially be ignored. Read the open() function documentation for details. The documentation has a section on the possible values for the errors parameter.

How to open an unicode text file inside a zip?

I tried
with zipfile.ZipFile("5.csv.zip", "r") as zfile:
for name in zfile.namelist():
with zfile.open(name, 'rU') as readFile:
line = readFile.readline()
print(line)
split = line.split('\t')
it answers:
b'$0.0\t1822\t1\t1\t1\n'
Traceback (most recent call last)
File "zip.py", line 6
split = line.split('\t')
TypeError: Type str doesn't support the buffer API
How to open the text file as unicode instead of as b?
To convert a byte stream into Unicode stream, you could use io.TextIOWrapper():
encoding = 'utf-8'
with zipfile.ZipFile("5.csv.zip") as zfile:
for name in zfile.namelist():
with zfile.open(name) as readfile:
for line in io.TextIOWrapper(readfile, encoding):
print(repr(line))
Note: TextIOWrapper() uses universal newline mode by default. rU mode in zfile.open() is deprecated since version 3.4.
It avoids issues with multibyte encodings described in #Peter DeGlopper's answer.
edit For Python 3, using io.TextIOWrapper as this answer describes is the best choice. The answer below could still be helpful for 2.x. I don't think anything below is actually incorrect even for 3.x, but io.TestIOWrapper is still better.
If the file is utf-8, this will work:
# the rest of the code as above, then:
with zfile.open(name, 'rU') as readFile:
line = readFile.readline().decode('utf8')
# etc
If you're going to be iterating over the file you can use codecs.iterdecode, but that won't work with readline().
with zfile.open(name, 'rU') as readFile:
for line in codecs.iterdecode(readFile, 'utf8'):
print line
# etc
Note that neither approach is necessarily safe for multibyte encodings. For example, little-endian UTF-16 represents the newline character with the bytes b'\x0A\x00'. A non-unicode aware tool looking for newlines will split that incorrectly, leaving the null bytes on the following line. In such a case you'd have to use something that doesn't try to split the input by newlines, such as ZipFile.read, and then decode the whole byte string at once. This is not a concern for utf-8.
The reason why you're seeing that error is because you are trying to mix bytes with unicode. The argument to split must also be byte-string:
>>> line = b'$0.0\t1822\t1\t1\t1\n'
>>> line.split(b'\t')
[b'$0.0', b'1822', b'1', b'1', b'1\n']
To get a unicode string, use decode:
>>> line.decode('utf-8')
'$0.0\t1822\t1\t1\t1\n'

Encoding in python

I have problem with comparing string from file with string I entered in the program, I should get that they are equal but no matter if i use decode('utf-8') I get that they are not equal. Here's the code:
final = open("info", 'r')
exported = open("final",'w')
lines = final.readlines()
for line in lines:
if line == "Wykształcenie i praca": #error
print "ok"
and how I save file that I try read:
comm_p = bs4.BeautifulSoup(comm)
comm_f.write(comm_p.prettify().encode('utf-8'))
for string in comm_p.strings:
#print repr(string).encode('utf-8')
save = string.encode('utf-8') # there is how i save
info.write(save)
info.write("\n")
info.close()
and at the top of file I have # -- coding: utf-8 --
Any ideas?
This should do what you need:
# -- coding: utf-8 --
import io
with io.open('info', encoding='utf-8') as final:
lines = final.readlines()
for line in lines:
if line.strip() == u"Wykształcenie i praca": #error
print "ok"
You need to open the file with the right encoding, and since your string is not ascii, you should mark it as unicode.
First, you need some basic knowledge about encodings. This is a good place to start. You don't have to read everything right now, but try to get as far as you can.
About your current problem:
You're reading a UTF-8 encoded file (probably), but you're reading it as an ASCII file. open() doesn't do any conversion for you.
So what you need to do (at least):
use codecs.open("info", "r", encoding="utf-8") to read the file
use Unicode strings for comparison: if line.rstrip() == u"Wykształcenie i praca":
It is likely the difference is in a '\n' character
readlines doesn't strip '\n' - see Best method for reading newline delimited files in Python and discarding the newlines?
In general it is not a good idea to put a Unicode string in your code, it would be a good idea to read it from a resource file
use unicode for string comparision
>>> s = u'Wykształcenie i praca'
>>> s == u'Wykształcenie i praca'
True
>>>
when it comes to string unicode is the smartest move :)

Write unicode content and unicode file name in Windows

#source file is encoded in utf8
import urllib2
import re
req = urllib2.urlopen('http://people.w3.org/rishida/scripts/samples/hungarian.html')
c = req.read()#.decode('utf-8')
p = r'title="This is Latin script \(Hungarian language\)">(.+)'
text = re.search(p, c).group(1)
name = text[:10]+'.txt' #file name will have special chars in it
f = open(name, 'wb')
f.write(text) #content of file will have special chars in it
f.close()
x = raw_input('done')
As you can see the script does a couple things:
- Reads content that is known to have unicode characters from a webpage into a variable
(The source file is saved in utf-8 but this should not make a difference unless unicode strings are actually being defined in the source code... As you can see the unicode string is being defined dynamially into a variable.. what encoding the source is shouldn't matter in this scenario)
Writes a file with a name containing unicode characters
Write unicode content into this file as well
Here's the weird behavior I get (Windows 7, Python 2.7) :
When I don't use the decode function:
c = req.read()
The NAME of the file will come out gibberish, but the CONTENT of the file will come out readable (that is you can see the correct unicode hungarian characters)
Yet, when I USE the decode function:
c = req.read().decode('utf-8')
It will NOT ERROR on opening the file (really creating it with 'w' mode)
and the resulting file's NAME will be readable, yep now it shows the correct unicode characters.
So far so good right?
Well, then it WILL ERROR on trying to write the unicode content to the file:
f.write(text) #content of file will have special chars in it
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 8: ordinal not in range(128)
You see, I can't seem to have the cake and eat it too...
Either I can correctly write the NAME of the file or I can correctly write the CONTENT of the file..
How can I do both?
I've also tried writing the file with
f = codecs.open(name, encoding='utf-8', mode='wb')
But it also errors..
The only problem for you seems to be just "unreadable" file name from your original source file. This can solve your problem:
f = open(name.decode('utf-8').encode( sys.getfilesystemencoding() ) , 'wb')
While winterTTR's answer does work.. I've realized that this approach is convoluted.
Rather, all you really need to do is encode the data you write to the file. The name you don't need to encode and both the name and the content will come out "readable".
content = '\xunicode chars'.decode('utf-8')
f = open(content[:5]+'.txt', 'wb')
f.write(content.encode('utf-8'))
f.close()

Categories