I have a directory that contains a lot of .csv files, and I am trying to write a script that runs on all the files in the directory while doing the following operation:
Remove the first and last lines from all the csv files
I am running the following code:
import glob
list_of_files = glob.glob('path/to/directory/*.csv')
for file_name in list_of_files:
fi = open(file_name, 'r')
fo = open(file_name.replace('csv', 'out'), 'w') #make new output file for each file
num_of_lines = file_name.read().count('\n')
file_name.seek(0)
i = 0
for line in fi:
if i != 1 and i != num_of_lines-1:
fo.write(line)
fi.close()
fo.close()
And I run the script using python3 script.py. Though I don't get any error, I don't get any output file either.
There are multiple issues in your code. First of all you count the number of lines on the filename instead of the file-object. The second problem is that you initialize i=0 and compare against it but it never changes.
Personally I would just convert the file to a list of "lines", cut off the first and last and write all of them to the new file:
import glob
list_of_files = glob.glob('path/to/directory/*.csv')
for file_name in list_of_files:
with open(file_name, 'r') as fi:
with open(file_name.replace('csv', 'out'), 'w') as fo:
for line in list(fi)[1:-1]: # for all lines except the first and last
fo.write(line)
Using the with open allows to omit the close calls (because they are done implicitly) even if an exception occurs.
In case that still gives no output you could a print statement that shows which file is being processed:
print(file_name) # just inside the for-loop before any `open` calls.
Since you're using python-3.5 you could also use pathlib:
import pathlib
path = pathlib.Path('path/to/directory/')
# make sure it's a valid directory
assert path.is_dir(), "{} is not a valid directory".format(p.absolute())
for file_name in path.glob('*.csv'):
with file_name.open('r') as fi:
with pathlib.Path(str(file_name).replace('.csv', '.out')).open('w') as fo:
for line in list(fi)[1:-1]: # for all lines except the first and last
fo.write(line)
As Jon Clements pointed out there is a better way than [1:-1] to exclude the first and last line by using a generator function. That way you will definitely reduce the amount of memory used and it might also improve the overall performance. For example you could use:
import pathlib
def ignore_first_and_last(it):
it = iter(it)
firstline = next(it)
lastline = next(it)
for nxtline in it:
yield lastline
lastline = nxtline
path = pathlib.Path('path/to/directory/')
# make sure it's a valid directory
assert path.is_dir(), "{} is not a valid directory".format(p.absolute())
for file_name in path.glob('*.csv'):
with file_name.open('r') as fi:
with pathlib.Path(str(file_name).replace('.csv', '.out')).open('w') as fo:
for line in ignore_first_and_last(fi): # for all lines except the first and last
fo.write(line)
Related
How can I insert a string at the beginning of each line in a text file, I have the following code:
f = open('./ampo.txt', 'r+')
with open('./ampo.txt') as infile:
for line in infile:
f.insert(0, 'EDF ')
f.close
I get the following error:
'file' object has no attribute 'insert'
Python comes with batteries included:
import fileinput
import sys
for line in fileinput.input(['./ampo.txt'], inplace=True):
sys.stdout.write('EDF {l}'.format(l=line))
Unlike the solutions already posted, this also preserves file permissions.
You can't modify a file inplace like that. Files do not support insertion. You have to read it all in and then write it all out again.
You can do this line by line if you wish. But in that case you need to write to a temporary file and then replace the original. So, for small enough files, it is just simpler to do it in one go like this:
with open('./ampo.txt', 'r') as f:
lines = f.readlines()
lines = ['EDF '+line for line in lines]
with open('./ampo.txt', 'w') as f:
f.writelines(lines)
Here's a solution where you write to a temporary file and move it into place. You might prefer this version if the file you are rewriting is very large, since it avoids keeping the contents of the file in memory, as versions that involve .read() or .readlines() will. In addition, if there is any error in reading or writing, your original file will be safe:
from shutil import move
from tempfile import NamedTemporaryFile
filename = './ampo.txt'
tmp = NamedTemporaryFile(delete=False)
with open(filename) as finput:
with open(tmp.name, 'w') as ftmp:
for line in finput:
ftmp.write('EDF '+line)
move(tmp.name, filename)
For a file not too big:
with open('./ampo.txt', 'rb+') as f:
x = f.read()
f.seek(0,0)
f.writelines(('EDF ', x.replace('\n','\nEDF ')))
f.truncate()
Note that , IN THEORY, in THIS case (the content is augmented), the f.truncate() may be not really necessary. Because the with statement is supposed to close the file correctly, that is to say, writing an EOF (end of file ) at the end before closing.
That's what I observed on examples.
But I am prudent: I think it's better to put this instruction anyway. For when the content diminishes, the with statement doesn't write an EOF to close correctly the file less far than the preceding initial EOF, hence trailing initial characters remains in the file.
So if the with statement doens't write EOF when the content diminishes, why would it write it when the content augments ?
For a big file, to avoid to put all the content of the file in RAM at once:
import os
def addsomething(filepath, ss):
if filepath.rfind('.') > filepath.rfind(os.sep):
a,_,c = filepath.rpartition('.')
tempi = a + 'temp.' + c
else:
tempi = filepath + 'temp'
with open(filepath, 'rb') as f, open(tempi,'wb') as g:
g.writelines(ss + line for line in f)
os.remove(filepath)
os.rename(tempi,filepath)
addsomething('./ampo.txt','WZE')
f = open('./ampo.txt', 'r')
lines = map(lambda l : 'EDF ' + l, f.readlines())
f.close()
f = open('./ampo.txt', 'w')
map(lambda l : f.write(l), lines)
f.close()
I want to replace the first character in each line from the text file.
2 1.510932 0.442072 0.978141 0.872182
5 1.510932 0.442077 0.978141 0.872181
Above is my text file.
import sys
import glob
import os.path
list_of_files = glob.glob('/path/txt/23.txt')
for file_name in list_of_files:
f= open(file_name, 'r')
lst = []
for line in f:
f = open(file_name , 'w')
if line.startswith("2 "):
line = line.replace("2 ","7")
f.write(line)
f.close()
What i want:-
If the number starting with 2, i want to change that into 7. The problem is that, In the same line multiple 7 is there. If i change startswith character and save everything was changing
Thanks
The proper solution is (pseudo code):
open sourcefile for reading as input
open temporaryfile for writing as output
for each line in input:
fix the line
write it to output
close input
close output
replace sourcefile with temporaryfile
We use a temporary file and write along to avoid potential memory errors.
I leave it up to you to translate this to Python (hint: that's quite straightforward).
This is one approach.
Ex:
for file_name in list_of_files:
data = []
with open(file_name) as infile:
for line in infile:
if line.startswith("2 "): #Check line
line = " ".join(['7'] + line.split()[1:]) #Update line
data.append(line)
with open(file_name, "w") as outfile: #Write back to file
for line in data:
outfile.write(line+"\n")
I have a text file, that has data.
PAS_BEGIN_3600000
CMD_VERS=2
CMD_TRNS=O
CMD_REINIT=
CMD_OLIVIER=
I want to extract data from that file, where nothing is after the equal sign.
So in my new text file, I want to get
CMD_REINIT
CMD_OLIVIER
How do I do this?
My code is like that righr now.
import os, os.path
DIR_DAT = "dat"
DIR_OUTPUT = "output"
print("Psst go check in the ouptut folder ;)")
for roots, dir, files in os.walk(DIR_DAT):
for filename in files:
filename_output = "/" + os.path.splitext(filename)[0]
with open(DIR_DAT + "/" + filename) as infile, open(DIR_OUTPUT + "/bonjour.txt", "w") as outfile:
for line in infile:
if not line.strip().split("=")[-1]:
outfile.write(line)
I want to collect all data in a single file. It doesn't work. Can anyone help me ?
The third step, it do crawl that new file, and only keep single values. As four files are appended into a single one. Some data might be there four, three, two times.
And I need to keep in a new file, that I will call output.txt. Only the lines that are in common in all the files.
You can use regex:
import re
data = """PAS_BEGIN_3600000
CMD_VERS=2
CMD_TRNS=O
CMD_REINIT=
CMD_OLIVIER="""
found = re.findall(r"^\s+(.*)=\s*$",data,re.M)
print( found )
Output:
['CMD_REINIT', 'CMD_OLIVIER']
The expression looks for
^\s+ line start + whitespaces
(.*)= anything before a = which is caputred as group
\s*$ followed by optional whitespaces and line end
using the re.M (multiline) flag.
Read your files text like so:
with open("yourfile.txt","r") as f:
data = f.read()
Write your new file like so:
with open("newfile.txt","w") as f:
f.write(''.join("\n",found))
You can use http://www.regex101.com to evaluate test-text vs regex-patterns, make sure to swith to its python mode.
I suggest you the following short solution using comprehension:
with open('file.txt', 'r') as f, open('newfile.txt', 'w') as newf:
for x in (line.strip()[:-1] for line in f if line.strip().endswith("=")):
newf.write(f'{x}\n')
Try this pattern: \w+(?==$).
Demo
Using a simple iteration.
Ex:
with open(filename) as infile, open(filename2, "w") as outfile:
for line in infile: #Iterate Each line
if not line.strip().split("=")[-1]: #Check for second Val
print(line.strip().strip("="))
outfile.write(line) #Write to new file
Output:
CMD_REINIT
CMD_OLIVIER
I have a huge(240mb) csv file in which the top 2 rows are junk data.I want to remove this junk data and use the data starting after that.
I would like to know what the best options are .Since its a large file creating a copy of the file and editing it would be a time taking process.
Below is the csv eg:-
junk,,,
,,,,
No,name,place,destination
1,abx,India,SA
What I would like to have is
No,name,place,destination
1,abx,India,SA
You can do this with tail quite easily
tail -n+3 foo > result.data
You said top 3 rows but the example has remove the top 2?
tail -n+2 foo > result.data
You can find more ways here
https://unix.stackexchange.com/questions/37790/how-do-i-delete-the-first-n-lines-of-an-ascii-file-using-shell-commands
Just throw those lines away.
Use Dictreader to parse the header
import csv
with open("filename") as fp:
fp.readline()
fp.readline()
csvreader = csv.DictReader(fp, delimiter=',')
for row in csvreader:
#your code here
Due to the way file systems work, you cannot simply delete the lines from the file directly. Any method to do so will necessarily involve rewriting the entire file with the offending lines removed.
To be safe, before deleting your old file, you'll want store the new file temporarily until you are sure the new one has been successfully created. And if you want to avoid reading the entire large file into memory, you'll want to use a generator.
Here's a generator that returns every item from an iterable (such as a file-like object) after a certain number of items have already been returned:
def gen_after_x(iterable, x):
# Python 3:
yield from (item for index,item in enumerate(iterable) if index>=x)
# Python 2:
for index,item in enumerate(iterable):
if index>=x:
yield item
To make things simpler, we'll create a function to write the temporary file:
def write_file(fname, lines):
with open(fname, 'w') as f:
for line in lines:
f.write(line + '\n')
We will also need the os.remove and os.rename functions from the os module to delete the source file and rename the temp file. And we'll need copyfile from shutil to make a copy, so we can safely delete the source file.
Now to put it all together:
from os import remove, rename
from shutil import copyfile
src_file = 'big_file'
tmp_file = 'big_file_temp'
skip = 2
with open(src_file) as fin:
olines = gen_after_x(fin, skip)
write_file(tmp_file, olines)
src_file_copy = src_file + '_copy'
copyfile(src_file, src_file_copy)
try:
remove(src_file)
rename(tmp_file, src_file)
remove(src_file_copy)
except Exception:
try:
copyfile(src_file_copy, src_file)
remove(src_file_copy)
remove(tmp_file)
except Exception:
pass
raise
However, I would note that 240 MB isn't such a huge file these days; you may find it faster to do this the usual way since it cuts down on repetitive disk writes:
src_file = 'big_file'
tmp_file = 'big_file_temp'
skip = 2
with open(src_file) as f:
lines = f.readlines()
for _ in range(skip):
lines.pop(0)
with open(tmp_file, 'w') as f:
f.write('\n'.join(lines))
src_file_copy = src_file + '_copy'
copyfile(src_file, src_file_copy)
try:
remove(src_file)
rename(tmp_file, src_file)
remove(src_file_copy)
except Exception:
try:
copyfile(src_file_copy, src_file)
remove(src_file_copy)
remove(tmp_file)
except Exception:
pass
raise
...or if you prefer the more risky way:
with open(src_file) as f:
lines = f.readlines()
for _ in range(skip):
lines.pop(0)
with open(src_file, 'w') as f:
f.write('\n'.join(lines))
More details below:
1st line
2nd line
3rd line
4th line
...
Now want to insert a new line named zero line before 1st line. File looks like below:
zero line
1st line
2nd line
3rd line
4th line
...
I know sed command can do this work, but how to do it using python? Thanks
you can use fileinput
>>> import fileinput
>>> for linenum,line in enumerate( fileinput.FileInput("file",inplace=1) ):
... if linenum==0 :
... print "new line"
... print line.rstrip()
... else:
... print line.rstrip()
...
this might be of interest
http://net4geeks.com/index.php?option=com_content&task=view&id=53&Itemid=11
adapted to your question:
# read the current contents of the file
f = open('filename')
text = f.read()
f.close()
# open the file again for writing
f = open('filename', 'w')
f.write("zero line\n\n")
# write the original contents
f.write(text)
f.close()
Open the file and read the
contents into 'text'.
Close the file
Reopen the file with argument 'w' to
write
Write text to prepend to the file
Write the original contents of the
file to the file
Close file
Read the warnings in the link.
edit:
But note that this isn't entirely
safe, if your Python session crashes
after opening the file the second time
and before closing it again, you will
lose data.
Here's an implementation that fixes some deficiencies in other approaches presented sofar:
it doesn't lose data in case of an error — #kriegar's version does
supports empty files — fileinput version does not
preserves original data: doesn't mangle trailing whitespace — fileinput version does
and does not read the whole file in memory as the version from net4geeks.com does.
It mimics fileinput's error handling:
import os
def prepend(filename, data, bufsize=1<<15):
# backup the file
backupname = filename + os.extsep+'bak'
try: os.unlink(backupname) # remove previous backup if it exists
except OSError: pass
os.rename(filename, backupname)
# open input/output files, note: outputfile's permissions lost
with open(backupname) as inputfile, open(filename, 'w') as outputfile:
# prepend
outputfile.write(data)
# copy the rest
buf = inputfile.read(bufsize)
while buf:
outputfile.write(buf)
buf = inputfile.read(bufsize)
# remove backup on success
try: os.unlink(backupname)
except OSError: pass
prepend('file', '0 line\n')
You could use cat utility if it is available to copy the files. It might be more efficient:
import os
from subprocess import PIPE, Popen
def prepend_cat(filename, data, bufsize=1<<15):
# backup the file
backupname = filename + os.extsep+'bak'
try: os.unlink(backupname)
except OSError: pass
os.rename(filename, backupname)
# $ echo $data | cat - $backupname > $filename
with open(filename, 'w') as outputfile: #note: outputfile's permissions lost
p = Popen(['cat', '-', backupname], stdin=PIPE, stdout=outputfile)
p.communicate(data)
# remove backup on success
if p.poll() == 0:
try: os.unlink(backupname)
except OSError: pass
prepend_cat('file', '0 line\n')
with open(filename, 'r+') as f:
lines = f.readlines()
lines.insert(0, 'zero line\n')
f.seek(0)
f.writelines(lines)
code
L = list()
f = open('text2.txt', 'r')
for line in f.readlines():
L.append(line)
L.insert(0,"Zero\n")
f.close()
fi = open('text2.txt', 'w')
for line in xrange(len(L)):
fi.write(L[line])
fi.close()
text2.txt
Hello
The second line
3
4
5
6
output
Zero
Hello
The second line
3
4
5
6
This can be memory heavy and time consuming for large files however.
If you are worried about something like 31st line, I would just do a mod%10 on the num, to get a more accurate version.
Let me know if this helps, or if you want a better version. Also, if you want a better formatting, look into ljust and rjust for left and right justify.