Open() and codecs.open() in Python 2.7 behave strangely different - python

I have a text file with first line of unicode characters and all other lines in ASCII.
I try to read the first line as one variable, and all other lines as another. However, when I use the following code:
# -*- coding: utf-8 -*-
import codecs
import os
filename = '1.txt'
f = codecs.open(filename, 'r3', encoding='utf-8')
print f
names_f = f.readline().split(' ')
data_f = f.readlines()
print len(names_f)
print len(data_f)
f.close()
print 'And now for something completely differerent:'
g = open(filename, 'r')
names_g = g.readline().split(' ')
print g
data_g = g.readlines()
print len(names_g)
print len(data_g)
g.close()
I get the following output:
<open file '1.txt', mode 'rb' at 0x01235230>
28
7
And now for something completely differerent:
<open file '1.txt', mode 'r' at 0x017875A0>
28
77
If I don't use readlines(), whole file reads, not only first 7 lines both at codecs.open() and open().
Why does such thing happen?
And why does codecs.open() read file in binary mode, despite the 'r' parameter is added?
Upd: This is original file: http://www1.datafilehost.com/d/0792d687

Because you used .readline() first, the codecs.open() file has filled a linebuffer; the subsequent call to .readlines() returns only the buffered lines.
If you call .readlines() again, the rest of the lines are returned:
>>> f = codecs.open(filename, 'r3', encoding='utf-8')
>>> line = f.readline()
>>> len(f.readlines())
7
>>> len(f.readlines())
71
The work-around is to not mix .readline() and .readlines():
f = codecs.open(filename, 'r3', encoding='utf-8')
data_f = f.readlines()
names_f = data_f.pop(0).split(' ') # take the first line.
This behaviour is really a bug; the Python devs are aware of it, see issue 8260.
The other option is to use io.open() instead of codecs.open(); the io library is what Python 3 uses to implement the built-in open() function and is a lot more robust and versatile than the codecs module.

Related

Overwriting lines in text file [duplicate]

How can I insert a string at the beginning of each line in a text file, I have the following code:
f = open('./ampo.txt', 'r+')
with open('./ampo.txt') as infile:
for line in infile:
f.insert(0, 'EDF ')
f.close
I get the following error:
'file' object has no attribute 'insert'
Python comes with batteries included:
import fileinput
import sys
for line in fileinput.input(['./ampo.txt'], inplace=True):
sys.stdout.write('EDF {l}'.format(l=line))
Unlike the solutions already posted, this also preserves file permissions.
You can't modify a file inplace like that. Files do not support insertion. You have to read it all in and then write it all out again.
You can do this line by line if you wish. But in that case you need to write to a temporary file and then replace the original. So, for small enough files, it is just simpler to do it in one go like this:
with open('./ampo.txt', 'r') as f:
lines = f.readlines()
lines = ['EDF '+line for line in lines]
with open('./ampo.txt', 'w') as f:
f.writelines(lines)
Here's a solution where you write to a temporary file and move it into place. You might prefer this version if the file you are rewriting is very large, since it avoids keeping the contents of the file in memory, as versions that involve .read() or .readlines() will. In addition, if there is any error in reading or writing, your original file will be safe:
from shutil import move
from tempfile import NamedTemporaryFile
filename = './ampo.txt'
tmp = NamedTemporaryFile(delete=False)
with open(filename) as finput:
with open(tmp.name, 'w') as ftmp:
for line in finput:
ftmp.write('EDF '+line)
move(tmp.name, filename)
For a file not too big:
with open('./ampo.txt', 'rb+') as f:
x = f.read()
f.seek(0,0)
f.writelines(('EDF ', x.replace('\n','\nEDF ')))
f.truncate()
Note that , IN THEORY, in THIS case (the content is augmented), the f.truncate() may be not really necessary. Because the with statement is supposed to close the file correctly, that is to say, writing an EOF (end of file ) at the end before closing.
That's what I observed on examples.
But I am prudent: I think it's better to put this instruction anyway. For when the content diminishes, the with statement doesn't write an EOF to close correctly the file less far than the preceding initial EOF, hence trailing initial characters remains in the file.
So if the with statement doens't write EOF when the content diminishes, why would it write it when the content augments ?
For a big file, to avoid to put all the content of the file in RAM at once:
import os
def addsomething(filepath, ss):
if filepath.rfind('.') > filepath.rfind(os.sep):
a,_,c = filepath.rpartition('.')
tempi = a + 'temp.' + c
else:
tempi = filepath + 'temp'
with open(filepath, 'rb') as f, open(tempi,'wb') as g:
g.writelines(ss + line for line in f)
os.remove(filepath)
os.rename(tempi,filepath)
addsomething('./ampo.txt','WZE')
f = open('./ampo.txt', 'r')
lines = map(lambda l : 'EDF ' + l, f.readlines())
f.close()
f = open('./ampo.txt', 'w')
map(lambda l : f.write(l), lines)
f.close()

Utf-8 decoding with Python

I have a csv with some data, and in one row there is a text that was added after encoding it in utf-8.
This is the text:
"b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'"
I'm trying to use this text to obtain the original characters using the decode function, but it's imposible.
Does anyone know which is the correct procedure to do it?
Assuming that the line in your file is exactly like this:
b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'
And reading the line from the file gives the output:
>>> line
"b'\\xe7\\x94\\xb3\\xe8\\xbf\\xaa\\xe8\\xa5\\xbf\\xe8\\xb7\\xaf255\\xe5\\xbc\\x84660\\xe5\\x8f\\xb7\\xe5\\x92\\x8c665\\xe5\\x8f\\xb7 \\xe4\\xb8\\xad\\xe5\\x9b\\xbd\\xe4\\xb8\\x8a\\xe6\\xb5\\xb7\\xe6\\xb5\\xa6\\xe4\\xb8\\x9c\\xe6\\x96\\xb0\\xe5\\x8c\\xba 201205'"`
You can try to use eval() function:
with open(r"your_csv.csv", "r") as csvfile:
for line in csvfile:
# when you reach the desired line
b = eval(line).decode('utf-8')
Output:
>>> print(b)
'申迪西路255弄660号和665号 中国上海浦东新区 201205'
Try this:-
a = b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'
print(a.decode('utf-8')) #your decoded output
As you are saying you are reading from file then you can try with passing encoding system when reading:-
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)

readlines() cannot read lines after using readline()

The following simple code reads a CSV file and returns the number of lines of the file. As you can see in the output, the file has 501 lines.
>>> import codecs
>>> f = codecs.open("tmp.csv", "r", "utf_8")
>>> print len(f.readlines())
501
But if I insert a readline() before using readlines(), the latter does not reach at the end of the file.
>>> import codecs
>>> f = codecs.open("tmp.csv", "r", "utf_8")
>>> f.readline()
>>> print len(f.readlines())
1
Is there any basic mistake in my code? How can I mix readline() and readlines()? (actually I don't need to mix these two functions in my real program, but I am just curious...)
You can download the file at
https://dl.dropboxusercontent.com/u/16653989/tmp/tmp.csv
This has something to do with the codecs module. Because when you do the same thing with the regular python open statement, it works as expected:
f = open('tmp.csv')
f.readline()
>>> print len(f.readlines())
500

read whole file at once

I need to read whole source data from file something.zip (not uncompress it)
I tried
f = open('file.zip')
s = f.read()
f.close()
return s
but it returns only few bytes and not whole source data. Any idea how to achieve it? Thanks
Use binary mode(b) when you're dealing with binary file.
def read_zipfile(path):
with open(path, 'rb') as f:
return f.read()
BTW, use with statement instead of manual close.
As mentioned there is an EOF character (0x1A) that terminates the .read() operation. To reproduce this and demonstrate:
# Create file of 256 bytes
with open('testfile', 'wb') as fout:
fout.write(''.join(map(chr, range(256))))
# Text mode
with open('testfile') as fin:
print 'Opened in text mode is:', len(fin.read())
# Opened in text mode is: 26
# Binary mode - note 'rb'
with open('testfile', 'rb') as fin:
print 'Opened in binary mode is:', len(fin.read())
# Opened in binary mode is: 256
This should do it:
In [1]: f = open('/usr/bin/ping', 'rb')
In [2]: bytes = f.read()
In [3]: len(bytes)
Out[3]: 9728
For comparison, here's the file I opened in the code above:
-rwx------+ 1 xx yy 9.5K Jan 19 2005 /usr/bin/ping*

Replace a character by another in a file

I'd like to modify some characters of a file in-place, without having to copy the entire content of the file in another, or overwrite the existing one. However, it doesn't seem possible to just replace a character by another:
>>> f = open("foo", "a+") # file does not exist
>>> f.write("a")
1
>>> f.seek(0)
0
>>> f.write("b")
1
>>> f.seek(0)
0
>>> f.read()
'ab'
Here I'd have expected "a" to be replaced by "b", so that the content of the file would be just "b", but this is not the case. Is there a way to do this?
That's because of the mode you're using, in append mode, the file pointer is moved to the end of file before write, you should open your file in w+ mode:
f = open("foo", "w+") # file does not exist
f.write("samething")
f.seek(1)
f.write("o")
f.seek(0)
print f.read() # prints "something"
If you want to do that on an existing file without truncating it, you should open it in r+ mode for reading and writing.
Truncate the file using file.truncate first:
>>> f = open("foo", "a+")
>>> f.write('a')
>>> f.truncate(0) #truncates the file to 0 bytes
>>> f.write('b')
>>> f.seek(0)
>>> f.read()
'b'
Otherwise open the file in w+mode as suggested by #Guillaume.
import fileinput
for line in fileinput.input('abc', inplace=True):
line = line.replace('t', 'ed')
print line,
This doesn't replace character by character, instead it scans through each line replacing required character and writes the modified line.
For example:
file 'abc' contains:
i want
to replace
character
After executing, output would be:
i waned
edo replace
characeder
Will it help you? Hope so..
I believe that you may be able to modify the example from this answer.
https://stackoverflow.com/a/290494/1669208
import fileinput
for line in fileinput.input("test.txt", inplace=True):
print line.replace(char1, char2),

Categories