I am currently reproducing the following Unix command:
cat command.info fort.13 > command.fort.13
in Python with the following:
with open('command.fort.13', 'w') as outFile:
with open('fort.13', 'r') as fort13, open('command.info', 'r') as com:
for line in com.read().split('\n'):
if line.strip() != '':
print >>outFile, line
for line in fort13.read().split('\n'):
if line.strip() != '':
print >>outFile, line
which works, but there has to be a better way. Any suggestions?
Edit (2016):
This question has started getting attention again after four years. I wrote up some thoughts in a longer Jupyter Notebook here.
The crux of the issue is that my question was pertaining to the (unexpected by me) behavior of readlines. The answer I was aiming toward could have been better asked, and that question would have been better answered with read().splitlines().
The easiest way might be simply to forget about the lines, and just read in the entire file, then write it to the output:
with open('command.fort.13', 'wb') as outFile:
with open('command.info', 'rb') as com, open('fort.13', 'rb') as fort13:
outFile.write(com.read())
outFile.write(fort13.read())
As pointed out in a comment, this can cause high memory usage if either of the inputs is large (as it copies the entire file into memory first). If this might be an issue, the following will work just as well (by copying the input files in chunks):
import shutil
with open('command.fort.13', 'wb') as outFile:
with open('command.info', 'rb') as com, open('fort.13', 'rb') as fort13:
shutil.copyfileobj(com, outFile)
shutil.copyfileobj(fort13, outFile)
def cat(outfilename, *infilenames):
with open(outfilename, 'w') as outfile:
for infilename in infilenames:
with open(infilename) as infile:
for line in infile:
if line.strip():
outfile.write(line)
cat('command.fort.13', 'fort.13', 'command.info')
#!/usr/bin/env python
import fileinput
for line in fileinput.input():
print line,
Usage:
$ python cat.py command.info fort.13 > command.fort.13
Or to allow arbitrary large lines:
#!/usr/bin/env python
import sys
from shutil import copyfileobj as copy
for filename in sys.argv[1:] or ["-"]:
if filename == "-":
copy(sys.stdin, sys.stdout)
else:
with open(filename, 'rb') as file:
copy(file, sys.stdout)
The usage is the same.
Or on Python 3.3 using os.sendfile():
#!/usr/bin/env python3.3
import os
import sys
output_fd = sys.stdout.buffer.fileno()
for filename in sys.argv[1:]:
with open(filename, 'rb') as file:
while os.sendfile(output_fd, file.fileno(), None, 1 << 30) != 0:
pass
The above sendfile() call is written for Linux > 2.6.33. In principle, sendfile() can be more efficient than a combination of read/write used by other approaches.
Iterating over a file yields lines.
for line in infile:
outfile.write(line)
You can simplify this in a few ways:
with open('command.fort.13', 'w') as outFile:
with open('fort.13', 'r') as fort13, open('command.info', 'r') as com:
for line in com:
if line.strip():
print >>outFile, line
for line in fort13:
if line.strip():
print >>outFile, line
More importantly, the shutil module has the copyfileobj function:
with open('command.fort.13', 'w') as outFile:
with open('fort.13', 'r') as fort13:
shutil.copyfileobj(com, outFile)
with open('command.info', 'r') as com:
shutil.copyfileobj(fort13, outFile)
This doesn't skip the blank lines, but cat doesn't do that either, so I'm not sure you really want to.
List comprehensions are awesome for things like this:
with open('command.fort.13', 'w') as output:
for f in ['fort.13', 'command.info']:
output.write(''.join([line for line in open(f).readlines() if line.strip()]))
Related
I am trying to alter the code below so that it works in Python 3.4. However, I get the Error AttributeError: 'int' object has no attribute 'replace' in the line line.replace(",", "\t"). I am trying to understand how to rewrite this part of the code.
import os
import gzip
from io import BytesIO
import pandas as pd
try:
import urllib.request as urllib2
except ImportError:
import urllib2
baseURL = "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file="
filename = "data/irt_euryld_d.tsv.gz"
outFilePath = filename.split('/')[1][:-3]
response = urllib2.urlopen(baseURL + filename)
compressedFile = BytesIO()
compressedFile.write(response.read())
compressedFile.seek(0)
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
with open(outFilePath, 'w') as outfile:
outfile.write(decompressedFile.read().decode("utf-8", errors="ignore"))
#Now have to deal with tsv file
import csv
csvout = 'C:/Sidney/ECB.tsv'
outfile = open(csvout, "w")
with open(outFilePath, "rb") as f:
for line in f.read():
line.replace(",", "\t")
outfile.write(line)
outfile.close()
Thank You
You're writing ASCII (by default) with the 'w' mode, but the file you're getting that content from is being read as bytes with the 'rb' mode. Open that file with 'r'.
And then, as Sebastian suggests, just iterate over the file object with for line in f:. Using f.read() will read the entire thing into a single string, so if you iterate over that, you'll be iterating over each character of the file. Strictly speaking, since all you're doing is replacing a single character, the end result will be identical, but iterating over the file object is preferred (uses less memory).
Let's make better use of the with construct and go from this:
outfile = open(csvout, "w")
with open(outFilePath, "rb") as f:
for line in f.read():
line.replace(",", "\t")
outfile.write(line)
outfile.close()
to this:
with open(outFilePath, "r") as f, open(csvout, 'w') as outfile:
for line in f:
outfile.write(line.replace(",", "\t"))
Also, I should note that this is much easier to do with find-and-replace in your text editor of choice (I like Notepad++).
Try rewriting it as this:
with open(outFilePath, "r") as f:
for line in f: #don't iterate over entire file at once, go line by line
line.replace(",", "\t")
outfile.write(line)
Originally you were opening it as a 'read-binary' rb file, which returns an integer (bytes), not a string as you were expecting. In Python, int objects do not have the .replace() method, however the string object does. This is the cause for your AttributeError. Ensuring that you open it as a regular 'read' r file, will then return a string which has the .replace() method available to call.
Related post on return type of .read() here and more information available from the docs here.
I have written the following script to concatenate all the files in the directory into one single file.
Can this be optimized, in terms of
idiomatic python
time
Here is the snippet:
import time, glob
outfilename = 'all_' + str((int(time.time()))) + ".txt"
filenames = glob.glob('*.txt')
with open(outfilename, 'wb') as outfile:
for fname in filenames:
with open(fname, 'r') as readfile:
infile = readfile.read()
for line in infile:
outfile.write(line)
outfile.write("\n\n")
Use shutil.copyfileobj to copy data:
import shutil
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
shutil reads from the readfile object in chunks, writing them to the outfile fileobject directly. Do not use readline() or a iteration buffer, since you do not need the overhead of finding line endings.
Use the same mode for both reading and writing; this is especially important when using Python 3; I've used binary mode for both here.
Using Python 2.7, I did some "benchmark" testing of
outfile.write(infile.read())
vs
shutil.copyfileobj(readfile, outfile)
I iterated over 20 .txt files ranging in size from 63 MB to 313 MB with a joint file size of ~ 2.6 GB. In both methods, normal read mode performed better than binary read mode and shutil.copyfileobj was generally faster than outfile.write.
When comparing the worst combination (outfile.write, binary mode) with the best combination (shutil.copyfileobj, normal read mode), the difference was quite significant:
outfile.write, binary mode: 43 seconds, on average.
shutil.copyfileobj, normal mode: 27 seconds, on average.
The outfile had a final size of 2620 MB in normal read mode vs 2578 MB in binary read mode.
You can iterate over the lines of a file object directly, without reading the whole thing into memory:
with open(fname, 'r') as readfile:
for line in readfile:
outfile.write(line)
No need to use that many variables.
with open(outfilename, 'w') as outfile:
for fname in filenames:
with open(fname, 'r') as readfile:
outfile.write(readfile.read() + "\n\n")
I was curious to check more on performance and I used answers of Martijn Pieters and Stephen Miller.
I tried binary and text modes with shutil and without shutil. I tried to merge 270 files.
Text mode -
def using_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
outfile.write(readfile.read())
Binary mode -
def using_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
outfile.write(readfile.read())
Running times for binary mode -
Shutil - 20.161773920059204
Normal - 17.327500820159912
Running times for text mode -
Shutil - 20.47757601737976
Normal - 13.718038082122803
Looks like in both modes, shutil performs same while text mode is faster than binary.
OS: Mac OS 10.14 Mojave. Macbook Air 2017.
The fileinput module provides a natural way to iterate over multiple files
for line in fileinput.input(glob.glob("*.txt")):
outfile.write(line)
When I am trying to run this script I am getting this error:
ValueError: I/O operation on closed file.
I checked some similar questions and the doc, yet no success. And while the error is clear enough I haven't been able to figure it out. Apparently I am missing something.
# -*- coding: utf-8 -*-
import os
import re
dirpath = 'path\\to\\dir'
filenames = os.listdir(dirpath)
nb = 0
open('path\\to\\dir\\file.txt', 'w') as outfile:
for fname in filenames:
nb = nb+1
print fname
print nb
currentfile = os.path.join(dirpath, fname)
open(currentfile) as infile:
for line in infile:
outfile.write(line)
Edit: Since i removed the with from open the message error changed to:
`open (C:\\path\\to\\\\file.txt, 'w') as outfile` :
SyntaxError : invalid syntax with a pointer underneath as
Edit: Much confusion with this question. After all, i restored with and i fixed the indents a bit. And it is working just fine!
It looks like your outfile is at the same level as infile - which means at the end of the first with block, outfile is closed, so can't be written to. Indent your infile block to be inside your infile block.
with open('output', 'w') as outfile:
for a in b:
with open('input') as infile:
...
...
You can simplify your code here by using the fileinput module, and making the code clearer and less prone to wrong results:
import fileinput
from contextlib import closing
import os
with closing(fileinput.input(os.listdir(dirpath))) as fin, open('output', 'w') as fout:
fout.writelines(fin)
You use the context manager with which means the file will be closed when you exit the with scope. So the outfile is apparently closed when you use it.
with open('path\\to\\dir\\file.txt', 'w') as outfile:
for fname in filenames:
nb = nb + 1
print fname
print nb
currentfile = os.path.join(dirpath, fname)
with open(currentfile) as infile:
for line in infile:
outfile.write(line)
I have to read in a file, change a sections of the text here and there, and then write out to the same file.
Currently I do:
f = open(file)
file_str = f.read() # read it in as a string, Not line by line
f.close()
#
# do_actions_on_file_str
#
f = open(file, 'w') # to clear the file
f.write(file_str)
f.close()
But I would imagine that there is a more pythonic approach that yields the same result.
Suggestions?
That looks straightforward, and clear already. Any suggestion depends on how big the files are. If not really huge that looks fine. If really large, you could process in chunks.
But you could use a context manager, to avoid the explicit closes.
with open(filename) as f:
file_str = f.read()
# do stuff with file_str
with open(filename, "w") as f:
f.write(file_str)
If you work line by line you can use fileinput with inplace mode
import fileinput
for line in fileinput.input(mifile, inplace=1):
print process(line)
if you need to process all the text at once, then your code can be optimized a bit using with that takes care of closing the file:
with open(myfile) as f:
file_str = f.read()
#
do_actions_on_file_str
#
with open(myfile, 'w') as f:
f.write(file_str)
I am trying to form a quotes file of a specific user name in a log file. How do I remove every line that does not contain the specific user name in it? Or how do I write all the lines which contain this user name to a new file?
with open('input.txt', 'r') as rfp:
with open('output.txt', 'w') as wfp:
for line in rfp:
if ilikethis(line):
wfp.write(line)
with open(logfile) as f_in:
lines = [l for l in f_in if username in l]
with open(outfile, 'w') as f_out:
f_out.writelines(lines)
Or if you don't want to store all the lines in memory
with open(logfile) as f_in:
lines = (l for l in f_in if username in l)
with open(outfile, 'w') as f_out:
f_out.writelines(lines)
I sort of like the first one better but for a large file, it might drag.
Something along this line should suffice:
newfile = open(newfilename, 'w')
for line in file(filename, 'r'):
if name in line:
newfile.write(line)
newfile.close()
See : http://docs.python.org/tutorial/inputoutput.html#methods-of-file-objects
f.readlines() returns a list containing all the lines of data in the file.
An alternative approach to reading lines is to loop over the file object. This is memory efficient, fast, and leads to simpler code
>>> for line in f:
print line
Also you can checkout the use of with keyword. The advantage that the file is properly closed after its suite finishes
>>> with open(filename, 'r') as f:
... read_data = f.read()
>>> f.closed
True
I know you asked for python, but if you're on unix this is a job for grep.
grep name file
If you're not on unix, well... the answer above does the trick :)