Delete row from a huge csv file in python - python

I have a huge(240mb) csv file in which the top 2 rows are junk data.I want to remove this junk data and use the data starting after that.
I would like to know what the best options are .Since its a large file creating a copy of the file and editing it would be a time taking process.
Below is the csv eg:-
junk,,,
,,,,
No,name,place,destination
1,abx,India,SA
What I would like to have is
No,name,place,destination
1,abx,India,SA

You can do this with tail quite easily
tail -n+3 foo > result.data
You said top 3 rows but the example has remove the top 2?
tail -n+2 foo > result.data
You can find more ways here
https://unix.stackexchange.com/questions/37790/how-do-i-delete-the-first-n-lines-of-an-ascii-file-using-shell-commands

Just throw those lines away.
Use Dictreader to parse the header
import csv
with open("filename") as fp:
fp.readline()
fp.readline()
csvreader = csv.DictReader(fp, delimiter=',')
for row in csvreader:
#your code here

Due to the way file systems work, you cannot simply delete the lines from the file directly. Any method to do so will necessarily involve rewriting the entire file with the offending lines removed.
To be safe, before deleting your old file, you'll want store the new file temporarily until you are sure the new one has been successfully created. And if you want to avoid reading the entire large file into memory, you'll want to use a generator.
Here's a generator that returns every item from an iterable (such as a file-like object) after a certain number of items have already been returned:
def gen_after_x(iterable, x):
# Python 3:
yield from (item for index,item in enumerate(iterable) if index>=x)
# Python 2:
for index,item in enumerate(iterable):
if index>=x:
yield item
To make things simpler, we'll create a function to write the temporary file:
def write_file(fname, lines):
with open(fname, 'w') as f:
for line in lines:
f.write(line + '\n')
We will also need the os.remove and os.rename functions from the os module to delete the source file and rename the temp file. And we'll need copyfile from shutil to make a copy, so we can safely delete the source file.
Now to put it all together:
from os import remove, rename
from shutil import copyfile
src_file = 'big_file'
tmp_file = 'big_file_temp'
skip = 2
with open(src_file) as fin:
olines = gen_after_x(fin, skip)
write_file(tmp_file, olines)
src_file_copy = src_file + '_copy'
copyfile(src_file, src_file_copy)
try:
remove(src_file)
rename(tmp_file, src_file)
remove(src_file_copy)
except Exception:
try:
copyfile(src_file_copy, src_file)
remove(src_file_copy)
remove(tmp_file)
except Exception:
pass
raise
However, I would note that 240 MB isn't such a huge file these days; you may find it faster to do this the usual way since it cuts down on repetitive disk writes:
src_file = 'big_file'
tmp_file = 'big_file_temp'
skip = 2
with open(src_file) as f:
lines = f.readlines()
for _ in range(skip):
lines.pop(0)
with open(tmp_file, 'w') as f:
f.write('\n'.join(lines))
src_file_copy = src_file + '_copy'
copyfile(src_file, src_file_copy)
try:
remove(src_file)
rename(tmp_file, src_file)
remove(src_file_copy)
except Exception:
try:
copyfile(src_file_copy, src_file)
remove(src_file_copy)
remove(tmp_file)
except Exception:
pass
raise
...or if you prefer the more risky way:
with open(src_file) as f:
lines = f.readlines()
for _ in range(skip):
lines.pop(0)
with open(src_file, 'w') as f:
f.write('\n'.join(lines))

Related

Python - copy content of a file to another one except the the two first and last lines

I need to copy the content of a file to another one except the first two and the last two lines of that file.
For that purpose I wrote this function
def copy_file(self, input_file, output_file):
line_nr = 1
fout = open(output_file, "w+")
with open(input_file, 'r') as fp:
for line in fp:
if line_nr == 1 or line_nr == 2 or line_nr == 53 and line_nr == 54 :
continue
new_line = line.split(' ', 1)[-1]
fout.write(new_line)
fp.close()
I was able to achieve what I desired by hardcoding the line numbers that I wanted to ignore but I would like to make this function more generic.
How can I do it?
I want to achieve something like this
Original file
line to be ignored
line to be ignored
Timestamp message
Timestamp message
line to be ignored
line to be ignored
Expected output
message
message
What you are doing is going through the file manually, line by line, which as you can see can be impractical and tedious. A good alternative is to use list splicing and the function f.read.splitlines(True)
Try this:
def copy_file(input_file, output_file):
line_nr = 1
fout = open(output_file, "w+")
with open(input_file, 'r') as fp:
text = fp.read().splitlines(True)[2:-2]
fout.writelines(text)
fp.close()
The beginning of the code is the same as yours. However, instead of looping through the file, line by line, I use the code fp.read().splitlines(True). This returns the entire file in a list, where each element is a line. From their, I can use list splicing [2:-2] to get rid of the first 2 and last 2 elements of the list. Finally, I can then write these to the new file with fout.writelines(text), which turns a list into text to write into a file.
Here is how to do this efficiently, keeping at most "skip_last" elements in memory. This uses itertools.islice to slice the file iterator, and a collections.deque to efficiently keep a buffer of lines:
import itertools
import collections
def copy_file(
input_file: str,
output_file: str,
skip_first: int,
skip_last: int
) -> None:
with open(input_file) as fin, open(output_file, 'w') as fout:
skipped_beginning = itertools.islice(fin, skip_first, None)
buffer = collections.deque(maxlen=skip_last)
buffer.extend(itertools.islice(skipped_beginning, skip_last))
for line in skipped_beginning:
fout.write(buffer.popleft())
buffer.append(line)

Overwriting lines in text file [duplicate]

How can I insert a string at the beginning of each line in a text file, I have the following code:
f = open('./ampo.txt', 'r+')
with open('./ampo.txt') as infile:
for line in infile:
f.insert(0, 'EDF ')
f.close
I get the following error:
'file' object has no attribute 'insert'
Python comes with batteries included:
import fileinput
import sys
for line in fileinput.input(['./ampo.txt'], inplace=True):
sys.stdout.write('EDF {l}'.format(l=line))
Unlike the solutions already posted, this also preserves file permissions.
You can't modify a file inplace like that. Files do not support insertion. You have to read it all in and then write it all out again.
You can do this line by line if you wish. But in that case you need to write to a temporary file and then replace the original. So, for small enough files, it is just simpler to do it in one go like this:
with open('./ampo.txt', 'r') as f:
lines = f.readlines()
lines = ['EDF '+line for line in lines]
with open('./ampo.txt', 'w') as f:
f.writelines(lines)
Here's a solution where you write to a temporary file and move it into place. You might prefer this version if the file you are rewriting is very large, since it avoids keeping the contents of the file in memory, as versions that involve .read() or .readlines() will. In addition, if there is any error in reading or writing, your original file will be safe:
from shutil import move
from tempfile import NamedTemporaryFile
filename = './ampo.txt'
tmp = NamedTemporaryFile(delete=False)
with open(filename) as finput:
with open(tmp.name, 'w') as ftmp:
for line in finput:
ftmp.write('EDF '+line)
move(tmp.name, filename)
For a file not too big:
with open('./ampo.txt', 'rb+') as f:
x = f.read()
f.seek(0,0)
f.writelines(('EDF ', x.replace('\n','\nEDF ')))
f.truncate()
Note that , IN THEORY, in THIS case (the content is augmented), the f.truncate() may be not really necessary. Because the with statement is supposed to close the file correctly, that is to say, writing an EOF (end of file ) at the end before closing.
That's what I observed on examples.
But I am prudent: I think it's better to put this instruction anyway. For when the content diminishes, the with statement doesn't write an EOF to close correctly the file less far than the preceding initial EOF, hence trailing initial characters remains in the file.
So if the with statement doens't write EOF when the content diminishes, why would it write it when the content augments ?
For a big file, to avoid to put all the content of the file in RAM at once:
import os
def addsomething(filepath, ss):
if filepath.rfind('.') > filepath.rfind(os.sep):
a,_,c = filepath.rpartition('.')
tempi = a + 'temp.' + c
else:
tempi = filepath + 'temp'
with open(filepath, 'rb') as f, open(tempi,'wb') as g:
g.writelines(ss + line for line in f)
os.remove(filepath)
os.rename(tempi,filepath)
addsomething('./ampo.txt','WZE')
f = open('./ampo.txt', 'r')
lines = map(lambda l : 'EDF ' + l, f.readlines())
f.close()
f = open('./ampo.txt', 'w')
map(lambda l : f.write(l), lines)
f.close()

Write to files with dynamic file names?

I need to split a large txt file (about 100GB, 1 billion rows) by DATE. The file looks like this
ID*DATE*company
1111*201101*geico
1234*201402*travelers
3214*201003*statefarm
...
Basically there are 60 months so I should be getting 60 sub-files. My Python script is
with open("myBigFile.txt") as f:
for line in f:
claim = line.split("*")
with open("DATE-"+str(claim[1])+".txt", "a") as fy:
fy.write(claim[0]+"*"+claim[2]+"\n")
Now since the number of records is huge, this runs too slow because it needs to open/close file for every row. So I'm thinking about first open 60 sub-files and then scan the file, write each row to corresponding sub-file. The sub-files are not closed until all the rows are scanned. However since python automatically close the file whenever the reference is removed (http://blog.lerner.co.il/dont-use-python-close-files-answer-depends/), I have to use some dynamic file name, something like
claim[1].write(claim[0]+"*"+claim[2]+"\n")
Note that you can't name a fy and fy.write(claim[0]+"*"+claim[2]+"\n") because that will close the file whenever fy is changed. Is that possible in Python? Thanks!
How about something like this:
with open("myBigFile.txt") as f:
subfiles = {}
for line in f:
claim = line.split("*")
if not str(claim[1]) in subfiles:
subfiles[str(claim[1])] = open("DATE-" + str(claim[1]) + ".txt", "a")
subfile[str(claim[1])].write(claim[0]+"*"+claim[2]+"\n")
I believe this should do it.
Just to mention, I have currently not placed a limit on the number of files opened at a given moment. To implement that simply check the size of the list using 'len()' and close all files or a few files.
Here's a solution that will close the file handles as part of a context manager, unlike the other answers this also closes the subfiles when an error happens :-)
from contextlib import contextmanager
#contextmanager
def file_writer():
fp = {}
def write(line):
id, date, company = line.split('*')
outdata = "{}*{}\n".format(id, company)
try:
fp[date].write(outdata)
except KeyError:
fname = 'DATE-{}.txt'.format(date)
fp[date] = open(fname, 'a') # should it be a+?
fp[date].write(outdata)
yield write
for f in fp.values():
f.close()
def process():
with open("myBigFile.txt") as f:
with file_writer() as write:
for i, line in enumerate(f):
try:
write(line)
except:
print('the error happened on line %d [%s]' % (i, line))
I don't know if there is anything more that can be done speed-wise on a single processor/disk. You can always split the file into n chunks and use n processes to process a chunk each (where n is the number of separate disks you have available..)
You can use the csv module to simplify slightly, and use a dictionary to store the file objects:
import csv
with open("myBigFile.txt") as big_file:
reader = csv.reader(big_file, delimiter='*')
subfiles = {}
for id, date, company in reader:
try:
subfile = subfiles[date]
except KeyError:
subfile = open('DATE-{}.txt'.format(date), 'a')
subfiles[date] = subfile
subfile.write('{}*{}\n'.format(id, company))
for subfile in subfiles.values():
subfile.close()

Python glob gives no result

I have a directory that contains a lot of .csv files, and I am trying to write a script that runs on all the files in the directory while doing the following operation:
Remove the first and last lines from all the csv files
I am running the following code:
import glob
list_of_files = glob.glob('path/to/directory/*.csv')
for file_name in list_of_files:
fi = open(file_name, 'r')
fo = open(file_name.replace('csv', 'out'), 'w') #make new output file for each file
num_of_lines = file_name.read().count('\n')
file_name.seek(0)
i = 0
for line in fi:
if i != 1 and i != num_of_lines-1:
fo.write(line)
fi.close()
fo.close()
And I run the script using python3 script.py. Though I don't get any error, I don't get any output file either.
There are multiple issues in your code. First of all you count the number of lines on the filename instead of the file-object. The second problem is that you initialize i=0 and compare against it but it never changes.
Personally I would just convert the file to a list of "lines", cut off the first and last and write all of them to the new file:
import glob
list_of_files = glob.glob('path/to/directory/*.csv')
for file_name in list_of_files:
with open(file_name, 'r') as fi:
with open(file_name.replace('csv', 'out'), 'w') as fo:
for line in list(fi)[1:-1]: # for all lines except the first and last
fo.write(line)
Using the with open allows to omit the close calls (because they are done implicitly) even if an exception occurs.
In case that still gives no output you could a print statement that shows which file is being processed:
print(file_name) # just inside the for-loop before any `open` calls.
Since you're using python-3.5 you could also use pathlib:
import pathlib
path = pathlib.Path('path/to/directory/')
# make sure it's a valid directory
assert path.is_dir(), "{} is not a valid directory".format(p.absolute())
for file_name in path.glob('*.csv'):
with file_name.open('r') as fi:
with pathlib.Path(str(file_name).replace('.csv', '.out')).open('w') as fo:
for line in list(fi)[1:-1]: # for all lines except the first and last
fo.write(line)
As Jon Clements pointed out there is a better way than [1:-1] to exclude the first and last line by using a generator function. That way you will definitely reduce the amount of memory used and it might also improve the overall performance. For example you could use:
import pathlib
def ignore_first_and_last(it):
it = iter(it)
firstline = next(it)
lastline = next(it)
for nxtline in it:
yield lastline
lastline = nxtline
path = pathlib.Path('path/to/directory/')
# make sure it's a valid directory
assert path.is_dir(), "{} is not a valid directory".format(p.absolute())
for file_name in path.glob('*.csv'):
with file_name.open('r') as fi:
with pathlib.Path(str(file_name).replace('.csv', '.out')).open('w') as fo:
for line in ignore_first_and_last(fi): # for all lines except the first and last
fo.write(line)

Python - Script that appends rows; checks for duplicates before writing

I'm writing a script that has a for loop to extract a list of variables from each 'data_i.csv' file in a folder, then appends that list as a new row in a single 'output.csv' file.
My objective is to define the headers of the file once and then append data to the 'output.csv' container-file so it will function as a backlog for a standard measurement.
The first time I run the script it will add all the files in the folder. Next time I run it, I want it to only append files that have been added since. I thought one way of doing this would be to check for duplicates, but the codes I found for that so far only searched for consecutive duplicates.
Do you have suggestions?
Here's how I made it so far:
import csv, os
# Find csv files
for csvFilename in os.listdir('.'):
if not csvFilename.endswith('.csv'):
continue
# Read in csv file and choose certain cells
csvRows = []
csvFileObj = open(csvFilename)
csvData = csv.reader(csvFileObj,delimiter=' ',skipinitialspace='True')
csvLines = list(csvData)
cellID = csvLines[4][3]
# Read in several variables...
csvRows = [cellID]
csvFileObj.close()
resultFile = open("Output.csv", 'a') #open in 'append' modus
wr = csv.writer(resultFile)
wr.writerows([csvRows])
csvFileObj.close()
resultFile.close()
This is the final script after mgc's answer:
import csv, os
f = open('Output.csv', 'r+')
merged_files = csv.reader(f)
merged_files = list()
for csvFilename in os.listdir('.'):
if not csvFilename.endswith('_spm.txt'):
continue
if csvFilename in merged_files:
continue
csvRows = []
csvFileObj = open(csvFilename)
csvData = csv.reader(csvFileObj,delimiter=' ',skipinitialspace='True')
csvLines = list(csvData)
waferID = csvLines[4][3]
temperature = csvLines[21][2]
csvRows = [waferID,thickness]
merged_files.append(csvRows)
csvFileObj.close()
wr = csv.writer(f)
wr.writerows(merged_files)
f.close()
You can keep track of the name of each file already handled. If this log file don't need to be human readable, you can use pickle. At the start of your script, you can do :
import pickle
try:
with open('merged_log', 'rb') as f:
merged_files = pickle.load(f)
except FileNotFoundError:
merged_files = set()
Then you can add a condition to avoid files previously treated :
if filename in merged_files: continue
Then when you are processing a file you can do :
merged_files.add(filename)
And keep trace of your variable at the end of your script (so it will be used on a next use) :
with open('merged_log', 'wb') as f:
pickle.dump(merged_files, f)
(However there is other options to your problem, for example you can slightly change the name of your file once it has been processed, like changing the extension from .csv to .csv_ or moving processed files in a subfolder, etc.)
Also, in the example in your question, i don't think that you need to open (and close) your output file on each iteration of your for loop. Open it once before your loop, write what you have to write, then close it when you have leaved the loop.

Categories