how to sample a very big CSV file(6GB) - python

There is a big CSV file (with first line as header), now I want to sample it in 100 pieces (line_num%100 for example), how to do that efficiently with main memory constraint?
separate the file into 100 smaller one.Or every 1/100th line as sub file 1, every 2/100th line as sub file 2,...,every 100/100th line as file 100.
to get 100 files with size about 600 M.
Not get 100 lines or a 1/100 size's sample.
I tried to execute like this:
fi = [open('split_data//%d.csv'%i,'w') for i in range(100)]
i = 0
with open('data//train.csv') as fin:
first = fin.readline()
for line in fin:
fi[i%100].write(line)
i = i + 1
for i in range(100):
fi[i].close()
But the file is too big to run it with limited memory, how to deal with it?
I want to make it with one round~
(My code works but it consumes too much time and i mistakenly thought it collapsed, sorry for that~~)

To split a file into 100 parts as stated in comments (I want to split the file to 100 parts in modulus'ing way i.e. range(200)-->| [0,100]; [1,101]; [2,102] and Yes, separate a big one to hundreds of smaller files)
import csv
files = [open('part_{}'.format(n), 'wb') for n in xrange(100)]
csvouts = [csv.writer(f) for f in files]
with open('yourcsv') as fin:
csvin = csv.reader(fin)
next(csvin, None) # Skip header
for rowno, row in enumerate(csvin):
csvouts[rowno % 100].writerow(row)
for f in files:
f.close()
You can islice over the file with a step instead of modulus'ing the line number, eg:
import csv
from itertools import islice
with open('yourcsv') as fin:
csvin = csv.reader(fin)
# Skip header, and then return every 100th until file ends
for line in islice(csvin, 1, None, 100):
# do something with line
Example:
r = xrange(1000)
res = list(islice(r, 1, None, 100))
# [1, 101, 201, 301, 401, 501, 601, 701, 801, 901]

Based on #Jon Clements answer, I would also benchmark this variation:
import csv
from itertools import islice
with open('in.csv') as fin:
first = fin.readline() # discard the header
csvin = csv.reader( islice(fin, None, None, 100) ) # this line is the only difference
for row in csvin:
print row # do something with row
If you only want 100 samples, you can use this idea which just makes 100 reads at equally spaced locations within the file. This should work well for CSV files whose line lengths are essentially uniform.
def sample100(path):
with open(path) as fin:
end = os.fstat(fin.fileno()).st_size
fin.readline() # skip the first line
start = fin.tell()
step = (end - start) / 100
offset = start
while offset < end:
fin.seek(offset)
fin.readline() # this might not be a complete line
if fin.tell() < end:
yield fin.readline() # this is a complete non-empty line
else:
break # not really necessary...
offset = offset + step
for row in csv.reader( sample100('in.csv') ):
# do something with row

I think you can just open the same file 10 times and then manipulate (read) each one independently effectively splitting it up into sub-file without actually doing it.
Unfortunately this requires knowing in advance how many rows there are in the file and that requires reading the whole thing once to count them. On the other hand this should be relatively quick since no other processing takes place.
To illustrate and test this approach I created a simpler — only one item per row — and much smaller csv test file that looked something like this (the first line is the header row and not counted):
line_no
1
2
3
4
5
...
9995
9996
9997
9998
9999
10000
Here's the code and sample output:
from collections import deque
import csv
# count number of rows in csv file
# (this requires reading the whole file)
file_name = 'mycsvfile.csv'
with open(file_name, 'rb') as csv_file:
for num_rows, _ in enumerate(csv.reader(csv_file)): pass
rows_per_section = num_rows // 10
print 'number of rows: {:,d}'.format(num_rows)
print 'rows per section: {:,d}'.format(rows_per_section)
csv_files = [open(file_name, 'rb') for _ in xrange(10)]
csv_readers = [csv.reader(f) for f in csv_files]
map(next, csv_readers) # skip header
# position each file handle at its starting position in file
for i in xrange(10):
for j in xrange(i * rows_per_section):
try:
next(csv_readers[i])
except StopIteration:
pass
# read rows from each of the sections
for i in xrange(rows_per_section):
# elements are one row from each section
rows = [next(r) for r in csv_readers]
print rows # show what was read
# clean up
for i in xrange(10):
csv_files[i].close()
Output:
number of rows: 10,000
rows per section: 1,000
[['1'], ['1001'], ['2001'], ['3001'], ['4001'], ['5001'], ['6001'], ['7001'], ['8001'], ['9001']]
[['2'], ['1002'], ['2002'], ['3002'], ['4002'], ['5002'], ['6002'], ['7002'], ['8002'], ['9002']]
...
[['998'], ['1998'], ['2998'], ['3998'], ['4998'], ['5998'], ['6998'], ['7998'], ['8998'], ['9998']]
[['999'], ['1999'], ['2999'], ['3999'], ['4999'], ['5999'], ['6999'], ['7999'], ['8999'], ['9999']]
[['1000'], ['2000'], ['3000'], ['4000'], ['5000'], ['6000'], ['7000'], ['8000'], ['9000'], ['10000']]

Related

Find and remove duplicates in a CSV file

I have a large CSV file (1.8 GB) with three columns. Each row contains two strings and a numerical value. The problem is that they are duplicate but swapped.
Example:
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
The desired output would look like this:
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
because the third row contains the same information like the first row.
EDIT
The data basically looks like this (Strings in the first two columns and a numerical value in the third, 40 Million lines):
Blockquote
Can you handle awk:
$ awk -F, '++seen[$3]==1' file
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
Explaied:
$ awk -F, ' # set comma as field delimiter
++seen[$3]==1 # count instances of the third field to hash, printing only first
' file
Update:
$ awk -F, '++seen[($1<$2?$1 FS $2:$2 FS $1)]==1' file
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
It hashes every met combination of first and second field so that "ABC,DEF"=="DEF,ABC" and counts them printing only the first. ($1<$2?$1 FS $2:$2 FS $1): if first field is less than second, hash 1st,2nd else hash 2nd,1st.
From the problem description, the mandate for a line to be NOT omitted is when
the first and second fields in either order when concatenated should be unique.
If so, below awk would help
awk -F, '{seen[$1,$2]++;seen[$2,$1]++}seen[$1,$2]==1 && seen[$2,$1]==1' filename
Sample Input
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
GHI,ABC,123
DEF,ABC,123
ABC,GHI,123
DEF,GHI,123
Sample Output
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
GHI,ABC,123
DEF,GHI,123
Note: This question was done before the OP changed the python tag for awk tag.
If you don't mind the order of the elements you might do:
with open("in.csv", "r") as file:
lines = set()
for line in file:
lines.add(frozenset(line.strip("\n").split(",")))
with open("out.csv", "w") as file:
for line in lines:
file.write(",".join(line)+"\n")
Output:
Col2,COL1,Col3
EFG,454,ABC
DEF,123,ABC
Note that you might want to treat the first line (the titles) in an special way to not loose their order.
But if the order matter you could use the code from Maintaining the order of the elements in a frozen set:
from itertools import filterfalse
def unique_everseen(iterable, key=None):
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
with open("in.csv", "r") as file:
lines = []
for line in file:
lines.append(line.strip("\n").split(","))
with open("out.csv", "w") as file:
for line in unique_everseen(lines, key=frozenset):
file.write(",".join(line)+"\n")
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
The OP said that both codes seem to not work on large files (1.8 Gb). I think it may be due to the fact that both codes store the file in a list using the RAM, and a file of 1.8 GB might take all the available space on memory.
In order to solve that I made a few more attempts. Sadly, I must say that all of them are extremely slow compared to the first attempt. The firsts codes sacrifice RAM consumption for speed, but the following codes sacrifice speed, CPU and hard drive for less RAM consumption (instead of consuming the whole file size in RAM they take less than 50 Mb).
Since all of this examples needs a higher hard drive usage, it's advisable to has the "input" and "output" file on different hard drives.
My first attempt using less RAM is with the shelve module:
import shelve, os
with shelve.open("tmp") as db:
with open("in.csv", "r") as file:
for line in file:
l = line.strip("\n").split(",")
l.sort()
db[",".join(l)] = l
with open("out.csv", "w") as file:
for v in db.values():
file.write(",".join(v)+"\n")
os.remove("temp.bak")
os.remove("temp.dat")
os.remove("temp.dir")
Sadly, this code takes hundred of times more than the first two codes which uses the RAM.
Another attempt is:
with open("in.csv", "r") as fileRead:
# total = sum(1 for _ in fileRead)
# fileRead.seek(0)
# i = 0
with open("out.csv", "w") as _:
pass
with open("out.csv", "r+") as fileWrite:
for lineRead in fileRead:
# i += 1
line = lineRead.strip("\n").split(",")
lineSet = set(line)
write = True
fileWrite.seek(0)
for lineWrite in fileWrite:
if lineSet == set(lineWrite.strip("\n").split(",")):
write = False
if write:
pass
fileWrite.write(",".join(line)+"\n")
# if i / total * 100 % 1 == 0: print(f"{i / total * 100}% ({i} / {total})")
This is slightly faster but not much.
If your computer has several cores, you could try to use multiprocessing:
from multiprocessing import Process, Queue, cpu_count
from os import remove
def slave(number, qIn, qOut):
name = f"slave-{number}.csv"
with open(name, "w") as file:
pass
with open(name, "r+") as file:
while True:
if not qIn.empty():
get = qIn.get()
if get == False:
qOut.put(name)
break
else:
write = True
file.seek(0)
for line in file:
if set(line.strip("\n").split(",")) == get[1]:
write = False
break
if write:
file.write(get[0])
def master():
qIn = Queue(1)
qOut = Queue()
slaves = cpu_count()
slavesList = []
for n in range(slaves):
slavesList.append(Process(target=slave, daemon=True, args=(n, qIn, qOut)))
for s in slavesList:
s.start()
with open("in.csv", "r") as file:
for line in file:
lineSet = set(line.strip("\n").split(","))
qIn.put((line, lineSet))
for _ in range(slaves):
qIn.put(False)
for s in slavesList:
s.join()
slavesList = []
with open(qOut.get(), "r+") as fileMaster:
for x in range(slaves-1):
file = qOut.get()
with open(file, "r") as fileSlave:
for lineSlave in fileSlave:
lineSet = set(lineSlave.strip("\n").split(","))
write = True
fileMaster.seek(0)
for lineMaster in fileMaster:
if set(lineMaster.strip("\n").split(",")) == lineSet:
write = False
break
if write:
fileMaster.write(lineSlave)
slavesList.append(Process(target=remove, daemon=True, args=(file,)))
slavesList[-1].start()
for s in slavesList:
s.join()
As you can see, I have the disappointing task to tell you that my both attempts work really slow. I hope you find a better approach, otherwise, it will take hours if not days to execute on 1,8 GB of data (the real time will primarily depend on the number of repeated values, which reduces time).
A new attempt: instead of storing every in file, this attempt stores the active portion on memory, and then write down on a file in order to process chunks faster. Then, the chunks must be read again by using one of the above methods:
lines = set()
maxLines = 1000 # This is the amount of lines that will be stored at the same time on RAM. Higher numbers are faster but requeires more RAM on the computer
perfect = True
with open("in.csv", "r") as fileRead:
total = sum(1 for _ in fileRead)
fileRead.seek(0)
i = 0
with open("tmp.csv", "w") as fileWrite:
for line in fileRead:
if (len(lines) < maxLines):
lines.add(frozenset(line.strip("\n").split(",")))
i += 1
if i / total * 100 % 1 == 0: print(f"Reading {i / total * 100}% ({i} / {total})")
else:
perfect = False
j = 0
for line in lines:
j += 1
fileWrite.write(",".join(line) + "\n")
if i / total * 100 % 1 == 0: print(f"Storing {i / total * 100}% ({i} / {total})")
lines = set()
if (not perfect):
use_one_of_the_above_methods() # Remember to read the tmp.csv and not the in.csv
This might boost the speed. You can change maxLines by any number you like, remember that higher the number, greater speed (not sure if really big numbers do the opposite) but higher RAM consumption.
If you want to use csv library itself:-
you can use a DictReader and DictWriter.
Import csv
def main():
"""Read csv file, delete duplicates and write it."""
with open('test.csv', 'r',newline='') as inputfile:
with open('testout.csv', 'w', newline='') as outputfile:
duplicatereader = csv.DictReader(inputfile, delimiter=',')
uniquewrite = csv.DictWriter(outputfile, fieldnames=['address', 'floor', 'date', 'price'], delimiter=',')
uniquewrite.writeheader()
keysread = []
for row in duplicatereader:
key = (row['date'], row['price'])
if key not in keysread:
print(row)
keysread.append(key)
uniquewrite.writerow(row)
if __name__ == '__main__':
main()

sum up values according to string condition of a list generated by a for loop

My code searches for particular files and calls upon a separate .py file to output some data. I manually appended a row for the file size of each file. I simply want to append to the end of the iterations the sum of all of the file sizes of the files found. I guess this would involve using boolean indexing, however I could not find any good reference. I want to find all the columns that are labelled 'file sizes' and then sum all of their values.
one sample iteration (I randomly put many 'file sizes' near each other, but in the real data, it would be separated by about 15 lines)
xd = """Version 3.1.5.0
GetFileName C:\\users\\trinh\\downloads\\higgi022_20150612_007_bsadig_100fm_aft_newIonTrap3.raw
GetCreatorID thermo
GetVersionNumber 64
file size 1010058
file size 200038
file size 48576986
file size 387905
misc tester
more python"""
at the end of the for loop I want to sum all the file sizes (this is very wrong but this is my best attempt):
zd = xd.split()
for aline in zd:
if 'file size' in aline:
sum = 0
for eachitem in aline[1:]:
sum += eaechitem
print(sum)
For the example data you have given, to get the total of all lines that start with file size you can do the following:
xd = """Version 3.1.5.0
GetFileName C:\\users\\trinh\\downloads\\higgi022_20150612_007_bsadig_100fm_aft_newIonTrap3.raw
GetCreatorID thermo
GetVersionNumber 64
file size 1010058
file size 200038
file size 48576986
file size 387905
misc tester
more python"""
total = 0
for line in xd.splitlines():
if line.startswith('file size'):
total += int(line.split()[2])
print(total)
This would display:
50174987
This first splits xd into lines, and for each line determines if it starts with the words file size. If it does it then uses split() to break the line into 3 parts. The third part contains the size as a string, so it needs to be converted to an integer using int().
To extend this to work on a file, you would first need to read the file and total the necessary lines, and then open it in append mode to write the total:
with open('data.txt') as f_input:
total = 0
for line in f_input:
if line.startswith('file size'):
total += int(line.split()[2])
with open('data.txt', 'a') as f_output:
f_output.write("\nTotal file size: {}\n".format(total))
Based on your current script, you could incorporate it as follows:
import os
import csv
from subprocess import run, PIPE
pathfile = 'C:\\users\\trinh\\downloads'
msfilepath = 'C:\\users\\trinh\\downloads\\msfilereader.py'
file_size_total = 0
with open("output.csv", "w", newline='') as csvout:
writer = csv.writer(csvout, delimiter=',')
for root, dirs, files in os.walk(pathfile):
for f in files:
if f.endswith(".raw"):
fp = os.path.join(root, f) #join the directory root and the file name
p = run(['python', msfilepath, fp], stdout=PIPE) #run the MSfilereader.py path and each iterated raw file found
p = p.stdout.decode('utf-8')
for aline in p.split('\r\n'):
header = aline.split(' ', 1)
writer.writerows([header])
if 'END SECTION' in aline and aline.endswith('###'):
file_size = os.stat(fp).st_size
file_size_total += file_size
lst_filsz = ['file size', str(file_size)]
writer.writerow(lst_filsz)
writer.writerow(["Total file size:", file_size_total])
This would give you a total of ALL file size entries. It would also be possible to add sub-totals for each section if that was required.
Note, when using with open(...., it is not necessary to also add a close() for the file, as soon as you leave the scope of the with statement, the file is automatically closed.

Read only one random row from CSV file and move to another CSV

I'm facing a problem in reading random rows from a large csv file and moving it to another CSV file using 0.18.1 pandas and 2.7.10 Python on Windows.
I want to load only the randomly selected rows into the memory and move them to another CSV. I don't want to load the entire content of first CSV into memory.
This is the code I used:
import random
file_size = 100
f = open("customers.csv",'r')
o = open("train_select.csv", 'w')
for i in range(0, 50):
offset = random.randrange(file_size)
f.seek(offset)
f.readline()
random_line = f.readline()
o.write(random_line)
The current output looks something like this:
2;flhxu-name;tum-firstname; 17520;buo-city;1966/04/24;wfyz-street; 96;GA;GEORGIA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA
My problems are 2 fold:
I want to see the header also in the second csv and not just the rows.
A row should be selected by random function only once.
The output should be something like this:
id;name;firstname;zip;city;birthdate;street;housenr;stateCode;state
2;flhxu-name;tum-firstname; 17520;buo-city;1966/04/24;wfyz-street; 96;GA;GEORGIA
1;jwcdf-name;fsj-firstname; 13520;oem-city;1954/02/07;amrb-street; 145;AK;ALASKA
You have do simpler than that:
first, read the customers file fully, title is a special case, keep it out.
shuffle the list of lines (that's what you were looking for)
write back title + shuffled lines
code:
import random
with open("customers.csv",'r') as f:
title = f.readline()
lines = f.readlines()
random.shuffle(lines)
with open("train_select.csv", 'w') as f:
f.write(title)
f.writelines(lines)
EDIT: if you don't want to hold the whole file in memory, here's an alternative. The only drawback is that you have to read the file once (but not store in memory) to compute line offsets:
import random
input_file = "customers.csv"
line_offsets = list()
# just read the title
with open(input_file,'r') as f:
title = f.readline()
# store offset of the first
while True:
# store offset of the next line start
line_offsets.append(f.tell())
line = f.readline()
if line=="":
break
# now shuffle the offsets
random.shuffle(line_offsets)
# and write the output file
with open("train_select.csv", 'w') as fw:
fw.write(title)
for offset in line_offsets:
# seek to a line start
f.seek(offset)
fw.write(f.readline())
At OP request, and since my 2 previous implementations had to read the input file, here's a more complex implementation where the file is not read in advance.
It uses bisect to store the couples of offsets of the lines, and a minimum line len (to be configured) in order to avoid that the random list is too long for nothing.
Basically, the program generates randomly ordered offsets ranging from offset of the second line (title line skipped) to the end of file, by step of minimum_line_len.
For each offset, it checks if line has not already been read (using bisect, which is fast but further testing is complex because of the corner cases).
- if not read, skip back to find previous linefeed (that is reading the file, can't do otherwise) write it in the output file, store the start/end offsets in the couple list
- if already read, skip
the code:
import random,os,bisect
input_file = "csv2.csv"
input_size = os.path.getsize(input_file)
smallest_line_len = 4
line_offsets = []
# just read the title
with open(input_file,'r') as f, open("train_select.csv", 'w') as fw:
# read title and write it back
title = f.readline()
fw.write(title)
# generate offset list, starting from current pos to the end of file
# with a step of min line len to avoid generating too many numbers
# (this can be 1 but that will take a while)
offset_list = list(range(f.tell(),input_size,smallest_line_len))
# shuffle the list at random
random.shuffle(offset_list)
# now loop through the offsets
for offset in offset_list:
# look if the offset is already contained in the list of sorted tuples
insertion_point = bisect.bisect(line_offsets,(offset,0))
if len(line_offsets)>0 and insertion_point == len(line_offsets) and line_offsets[-1][1]>offset:
# bisect tells to insert at the end: check if within last couple boundary: if so, already processed
continue
elif insertion_point < len(line_offsets) and (offset==line_offsets[insertion_point][0] or
(0 < insertion_point and line_offsets[insertion_point-1][0]<=offset<=line_offsets[insertion_point-1][1])):
# offset is already known, line has already been processed: skip
continue
else:
# offset is not known: rewind until we meet an end of line
f.seek(offset)
while True:
c=f.read(1)
if c=="\n":
# we found the line terminator of the previous line: OK
break
offset -= 1
f.seek(offset)
# now store the current position: start of the current line
line_start = offset+1
# now read the line fully
line = f.readline()
# now compute line end (approx..)
line_end = f.tell() - 1
# and insert the "line" in the sorted list
line_offsets.insert(insertion_point,(line_start,line_end))
fw.write(line)
if

Python - How do I find line position in file, and move around that line?

I am parsing a large data file using:
reader = csv.DictReader(open('Sourcefile.txt','rt'), delimiter = '\t')
for row in reader:
etc
etc
Parsing works great but I am performing calculations on the data, which require me to directly access the line I'm on, the line before, or to skip 10 lines ahead.
I can't figure out how to get the actual line number of the file I am in, and how to move to some other line in the file (ex: "Current_Line" + 10) and start accessing data from that point forward in the file.
Is the solution to read the entire file into an array, rather than trying to move back and forth in the file? I am expecting this file to be upwards of 160MB and assumed moving back and forth in the file would be most memory efficient.
Use csvreader.next() to get to the next line. To get 10 lines forward, call it 10 times or use a in-range loop.
Use csvreader.line_num to get the current line number.
Thanks to "Steven Rumbalski" for pointing out, that you can only trust in this if your data contains no newline-characters (0x0A).
To get the line before the current line, simpy cache the last row in a variable.
More information here: https://docs.python.org/2/library/csv.html
Edit
A Small example:
import csv
reader = csv.DictReader(open('Sourcefile.txt','rt'), delimiter = '\t')
last_line = None
for row in reader:
print("Current row: %s (line %d)" % (row, reader.line_num));
# do Sth with the row
last_line = row
if reader.line_num % 10 == 0:
print("Modulo 10! Skipping 5 lines");
try:
for i in range(5):
last_line = reader.next()
except: # File is finished
break
This does exactly the same, but in my eyes it is better code:
import csv
reader = csv.DictReader(open('Sourcefile.txt','rt'), delimiter = '\t')
last_line = None
skip = 0
for row in reader:
if skip > 0:
skip -= 1
continue;
print("Current row: %s (line %d)" % (row, reader.line_num));
# do Sth with the row
last_line = row
if reader.line_num % 10 == 0:
print("Modulo 10! Skipping 5 lines");
skip += 5
print("File is done!")
For maximal flexibility (and memory use) you can copy the whole csv instance into an array. Effectively caching the whole table.
import csv
reader = csv.DictReader(open('Sourcefile.txt','rt'), delimiter = '|')
fn = reader.fieldnames
t = []
for k in reader.__iter__():
t.append(k)
print(fn)
print(t[0])
# you can now access a row (as a dictionary) in the list t[0] is the second row in the file and fn is the first
# Fn is a list of keys that can be applied to each row t
# t[0][fn[0]] gives the row name of the first row
# fn is a list so the order of the columns is preserved.
# Each element in t is a dictionary, so to preserve the columns we use fn

Cut a large text file in small files

I have a text files which contains 1000000 lines. I want to split it in files that contains 15000 lines each. E.g, first file contains 1 to 15000 lines, next file 15001 to 30000 lines and so on. This is what I have done :
lines = open('myfile.txt').readlines()
open('1_15000.txt', 'w').writelines(lines[0:15000])
open('15001_30000.txt', 'w').writelines(lines[15000:30000])
open('30000_45000.txt', 'w').writelines(lines[30000:45000])
open('45000_60000.txt', 'w').writelines(lines[45000:60000])
...
...
... so on till 1000000
But this code looks too long. Is there any way I can do this using any loop so that I don't have to wrote separate code for each file?
lines = open('myfile.txt').readlines()
for i in range(0, 1000000, 15000):
open('{0}_{1}.txt'.format(i+1, i+15000), 'w').writelines(lines[i:i+15000])
Hope this helps.
You could try something like:
lines = open('myfile.txt').readlines()
count = 0
incr = 15000
while count<len(lines):
open(str(count)+'_'+str(count+incr)+'.txt', 'w').writelines(lines[count:incr])
count += incr
Note that you can also do this with the linux split utility. No need to reinvent the wheel!
lines = open('myfile.txt').readlines()
loads the entire file into a Python list. You won't want to do this when a file is large since it may cause your machine to run out of memory.
This splits the file into chunks of N lines. Each chunk is a list. It stops when a chunk is an empty list.
import itertools as IT
N = 15000
with open('data', 'rb') as f:
for i, chunk in enumerate(iter(lambda: list(IT.islice(f, N)), [])):
outfile = '{:06d}_{:06d}.txt'.format(i*N, (i+1)*N)
with open(outfile, 'wb') as g:
g.writelines(chunk)
If the file contains N empty lines, then the above method may end prematurely. Or if N were really large, reading N lines into a Python list may cause a MemoryError. You could avoid these problems by handling one line at a time (by calling next(f)) and catching the StopIteration exception that signals the end of the file:
import itertools as IT
N = 15000
with open('data', 'rb') as f:
try:
for i in IT.count():
outfile = '{:06d}_{:06d}.txt'.format(i*N, (i+1)*N)
with open(outfile, 'wb') as g:
for j in range(N):
line = next(f)
g.write(line)
except StopIteration:
pass

Categories