Cut a large text file in small files

Cut a large text file in small files - python

I have a text files which contains 1000000 lines. I want to split it in files that contains 15000 lines each. E.g, first file contains 1 to 15000 lines, next file 15001 to 30000 lines and so on. This is what I have done :
lines = open('myfile.txt').readlines()
open('1_15000.txt', 'w').writelines(lines[0:15000])
open('15001_30000.txt', 'w').writelines(lines[15000:30000])
open('30000_45000.txt', 'w').writelines(lines[30000:45000])
open('45000_60000.txt', 'w').writelines(lines[45000:60000])
...
...
... so on till 1000000
But this code looks too long. Is there any way I can do this using any loop so that I don't have to wrote separate code for each file?

lines = open('myfile.txt').readlines()
for i in range(0, 1000000, 15000):
open('{0}_{1}.txt'.format(i+1, i+15000), 'w').writelines(lines[i:i+15000])
Hope this helps.

You could try something like:
lines = open('myfile.txt').readlines()
count = 0
incr = 15000
while count<len(lines):
open(str(count)+'_'+str(count+incr)+'.txt', 'w').writelines(lines[count:incr])
count += incr

Note that you can also do this with the linux split utility. No need to reinvent the wheel!

lines = open('myfile.txt').readlines()
loads the entire file into a Python list. You won't want to do this when a file is large since it may cause your machine to run out of memory.
This splits the file into chunks of N lines. Each chunk is a list. It stops when a chunk is an empty list.
import itertools as IT
N = 15000
with open('data', 'rb') as f:
for i, chunk in enumerate(iter(lambda: list(IT.islice(f, N)), [])):
outfile = '{:06d}_{:06d}.txt'.format(i*N, (i+1)*N)
with open(outfile, 'wb') as g:
g.writelines(chunk)
If the file contains N empty lines, then the above method may end prematurely. Or if N were really large, reading N lines into a Python list may cause a MemoryError. You could avoid these problems by handling one line at a time (by calling next(f)) and catching the StopIteration exception that signals the end of the file:
import itertools as IT
N = 15000
with open('data', 'rb') as f:
try:
for i in IT.count():
outfile = '{:06d}_{:06d}.txt'.format(i*N, (i+1)*N)
with open(outfile, 'wb') as g:
for j in range(N):
line = next(f)
g.write(line)
except StopIteration:
pass

Related

Counting the number of lines in a gzip file using python

I'm trying to count the number of lines in a gz archive. There is only 1 json format text file per gz. But when I open the archive and count the lines the count is way off what I'd expect. The file contains 522 lines, but my code is returning 668480 lines.
import gzip
f = gzip.open(myfile, 'rb')
file_content = f.read()
for i, l in enumerate(file_content):
pass
i += 1
print("File {1} contain {0} lines".format(i, myfile))

You are iterating over all characters not the lines. You can iterate lines the following way
import gzip
with gzip.open(myfile, 'rb') as f:
for i, l in enumerate(f):
pass
print("File {1} contain {0} lines".format(i + 1, myfile))

For a performant way to count the lines in a gzip file you can use the pragzip package:
import pragzip
result = 0
with pragzip.open(myfile) as file:
while chunk := file.read( 1024*1024 ):
result += chunk.count(b'\n')
print(f"Number of lines: {result}")
Comparing the timing of the above with #DmitryKovriga's answer:
Number of lines: 33468793
Elapsed time is 22.373915 seconds.
File datasets/binance-futures_incremental_book_L2_2020-07-01_BTCUSDT.csv.gz contain 33468793 lines
Elapsed time is 31.278056 seconds.
A speed up of more like 10x should be possible with a suitable setup. See https://unix.stackexchange.com/a/713093/163459 for more info.

how to sample a very big CSV file(6GB)

There is a big CSV file (with first line as header), now I want to sample it in 100 pieces (line_num%100 for example), how to do that efficiently with main memory constraint?
separate the file into 100 smaller one.Or every 1/100th line as sub file 1, every 2/100th line as sub file 2,...,every 100/100th line as file 100.
to get 100 files with size about 600 M.
Not get 100 lines or a 1/100 size's sample.
I tried to execute like this:
fi = [open('split_data//%d.csv'%i,'w') for i in range(100)]
i = 0
with open('data//train.csv') as fin:
first = fin.readline()
for line in fin:
fi[i%100].write(line)
i = i + 1
for i in range(100):
fi[i].close()
But the file is too big to run it with limited memory, how to deal with it?
I want to make it with one round~
(My code works but it consumes too much time and i mistakenly thought it collapsed, sorry for that~~)

To split a file into 100 parts as stated in comments (I want to split the file to 100 parts in modulus'ing way i.e. range(200)-->| [0,100]; [1,101]; [2,102] and Yes, separate a big one to hundreds of smaller files)
import csv
files = [open('part_{}'.format(n), 'wb') for n in xrange(100)]
csvouts = [csv.writer(f) for f in files]
with open('yourcsv') as fin:
csvin = csv.reader(fin)
next(csvin, None) # Skip header
for rowno, row in enumerate(csvin):
csvouts[rowno % 100].writerow(row)
for f in files:
f.close()
You can islice over the file with a step instead of modulus'ing the line number, eg:
import csv
from itertools import islice
with open('yourcsv') as fin:
csvin = csv.reader(fin)
# Skip header, and then return every 100th until file ends
for line in islice(csvin, 1, None, 100):
# do something with line
Example:
r = xrange(1000)
res = list(islice(r, 1, None, 100))
# [1, 101, 201, 301, 401, 501, 601, 701, 801, 901]

Based on #Jon Clements answer, I would also benchmark this variation:
import csv
from itertools import islice
with open('in.csv') as fin:
first = fin.readline() # discard the header
csvin = csv.reader( islice(fin, None, None, 100) ) # this line is the only difference
for row in csvin:
print row # do something with row
If you only want 100 samples, you can use this idea which just makes 100 reads at equally spaced locations within the file. This should work well for CSV files whose line lengths are essentially uniform.
def sample100(path):
with open(path) as fin:
end = os.fstat(fin.fileno()).st_size
fin.readline() # skip the first line
start = fin.tell()
step = (end - start) / 100
offset = start
while offset < end:
fin.seek(offset)
fin.readline() # this might not be a complete line
if fin.tell() < end:
yield fin.readline() # this is a complete non-empty line
else:
break # not really necessary...
offset = offset + step
for row in csv.reader( sample100('in.csv') ):
# do something with row

I think you can just open the same file 10 times and then manipulate (read) each one independently effectively splitting it up into sub-file without actually doing it.
Unfortunately this requires knowing in advance how many rows there are in the file and that requires reading the whole thing once to count them. On the other hand this should be relatively quick since no other processing takes place.
To illustrate and test this approach I created a simpler — only one item per row — and much smaller csv test file that looked something like this (the first line is the header row and not counted):
line_no
1
2
3
4
5
...
9995
9996
9997
9998
9999
10000
Here's the code and sample output:
from collections import deque
import csv
# count number of rows in csv file
# (this requires reading the whole file)
file_name = 'mycsvfile.csv'
with open(file_name, 'rb') as csv_file:
for num_rows, _ in enumerate(csv.reader(csv_file)): pass
rows_per_section = num_rows // 10
print 'number of rows: {:,d}'.format(num_rows)
print 'rows per section: {:,d}'.format(rows_per_section)
csv_files = [open(file_name, 'rb') for _ in xrange(10)]
csv_readers = [csv.reader(f) for f in csv_files]
map(next, csv_readers) # skip header
# position each file handle at its starting position in file
for i in xrange(10):
for j in xrange(i * rows_per_section):
try:
next(csv_readers[i])
except StopIteration:
pass
# read rows from each of the sections
for i in xrange(rows_per_section):
# elements are one row from each section
rows = [next(r) for r in csv_readers]
print rows # show what was read
# clean up
for i in xrange(10):
csv_files[i].close()
Output:
number of rows: 10,000
rows per section: 1,000
[['1'], ['1001'], ['2001'], ['3001'], ['4001'], ['5001'], ['6001'], ['7001'], ['8001'], ['9001']]
[['2'], ['1002'], ['2002'], ['3002'], ['4002'], ['5002'], ['6002'], ['7002'], ['8002'], ['9002']]
...
[['998'], ['1998'], ['2998'], ['3998'], ['4998'], ['5998'], ['6998'], ['7998'], ['8998'], ['9998']]
[['999'], ['1999'], ['2999'], ['3999'], ['4999'], ['5999'], ['6999'], ['7999'], ['8999'], ['9999']]
[['1000'], ['2000'], ['3000'], ['4000'], ['5000'], ['6000'], ['7000'], ['8000'], ['9000'], ['10000']]

Split a file based on number of occurrences of 1 in position 1 of a line

I routinely use PowerShell to split larger text or csv files in to smaller files for quicker processing. However, I have a few files that come over that are an usual format. These are basically print files to a text file. Each record starts with a single line that starts with a 1 and there is nothing else on the line.
What I need to be able to do is to split a file based on the number of statements. So, basically if I want to split the file in to chunks of 3000 statements, I would go down until I see the 3001 occurrence of 1 in position 1 and copy everything before that to the new file. I can run this from windows, linux or OS X so pretty much anything is open for the split.
Any ideas would be greatly appreciated.

Maybe try recognizing it by the fact that there is a '1' plus a new line?
with open(input_file, 'r') as f:
my_string = f.read()
my_list = my_string.split('\n1\n')
Separates each record to a list assuming it has the following format:
1
....
....
1
....
....
....
You can then output each element in the list to a separate file.
for x in range(len(my_list)):
print >> str(x)+'.txt', my_list[x]

To avoid loading the file in memory, you could define a function that generates records incrementally and then use itertool's grouper recipe to write each 3000 records to a new file:
#!/usr/bin/env python3
from itertools import zip_longest
with open('input.txt') as input_file:
files = zip_longest(*[generate_records(input_file)]*3000, filevalue=())
for n, records in enumerate(files):
open('output{n}.txt'.format(n=n), 'w') as output_file:
output_file.writelines(''.join(lines)
for r in records for lines in r)
where generate_records() yields one record at a time where a record is also an iterator over lines in the input file:
from itertools import chain
def generate_records(input_file, start='1\n', eof=[]):
def record(yield_start=True):
if yield_start:
yield start
for line in input_file:
if line == start: # start new record
break
yield line
else: # EOF
eof.append(True)
# the first record may include lines before the first 1\n
yield chain(record(yield_start=False),
record())
while not eof:
yield record()
generate_records() is a generator that yield generators like itertools.groupby() does.
For performance reasons, you could read/write chunks of multiple lines at once.

How can I split a large file csv file (7GB) in Python

I have a 7GB csv file which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?

You don't need Python to split a csv file. Using your shell:
$ split -l 100 data.csv
Would split data.csv in chunks of 100 lines.

I had to do a similar task, and used the pandas package:
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
chunk.to_csv('chunk{}.csv'.format(i), index=False)

Here is a little python script I used to split a file data.csv into several CSV part files. The number of part files can be controlled with chunk_size (number of lines per part file).
The header line (column names) of the original file is copied into every part CSV file.
It works for big files because it reads one line at a time with readline() instead of loading the complete file into memory at once.
#!/usr/bin/env python3
def main():
chunk_size = 9998 # lines
def write_chunk(part, lines):
with open('data_part_'+ str(part) +'.csv', 'w') as f_out:
f_out.write(header)
f_out.writelines(lines)
with open('data.csv', 'r') as f:
count = 0
header = f.readline()
lines = []
for line in f:
count += 1
lines.append(line)
if count % chunk_size == 0:
write_chunk(count // chunk_size, lines)
lines = []
# write remainder
if len(lines) > 0:
write_chunk((count // chunk_size) + 1, lines)
if __name__ == '__main__':
main()

This graph shows the runtime difference of the different approaches outlined by other posters (on an 8 core machine when splitting a 2.9 GB file with 11.8 million rows of data into ~290 files).
The shell approach is from Thomas Orozco, Python approach s from Roberto, Pandas approach is from Quentin Febvre and here's the Dask snippet:
ddf = dd.read_csv("../nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2015.csv", blocksize=10000000, dtype=dtypes)
ddf.to_csv("../tmp/split_csv_dask")
I'd recommend Dask for splitting files, even though it's not the fastest, because it's the most flexible solution (you can write out different file formats, perform processing operations before writing, easily modify compression formats, etc.). The Pandas approach is almost as flexible, but cannot perform processing on the entire dataset (like sorting the entire dataset before writing).
Bash / native Python filesystem operations are clearly quicker, but that's not what I'm typically looking for when I have a large CSV. I'm typically interested in splitting large CSVs into smaller Parquet files, for performant, production data analyses. I don't usually care if the actually splitting takes a couple minutes more. I'm more interested in splitting accurately.
I wrote a blog post that discusses this in more detail. You can probably Google around and find the post.

See the Python docs on file objects (the object returned by open(filename) - you can choose to read a specified number of bytes, or use readline to work through one line at a time.

Maybe something like this?
#!/usr/local/cpython-3.3/bin/python
import csv
divisor = 10
outfileno = 1
outfile = None
with open('big.csv', 'r') as infile:
for index, row in enumerate(csv.reader(infile)):
if index % divisor == 0:
if outfile is not None:
outfile.close()
outfilename = 'big-{}.csv'.format(outfileno)
outfile = open(outfilename, 'w')
outfileno += 1
writer = csv.writer(outfile)
writer.writerow(row)

I agree with #jonrsharpe readline should be able to read one line at a time even for big files.
If you are dealing with big csv files might I suggest using pandas.read_csv. I often use it for the same purpose and always find it awesome (and fast). Takes a bit of time to get used to idea of DataFrames. But once you get over that it speeds up large operations like yours massively.
Hope it helps.

here is my code which might help
import os
import pandas as pd
import uuid
class FileSettings(object):
def __init__(self, file_name, row_size=100):
self.file_name = file_name
self.row_size = row_size
class FileSplitter(object):
def __init__(self, file_settings):
self.file_settings = file_settings
if type(self.file_settings).__name__ != "FileSettings":
raise Exception("Please pass correct instance ")
self.df = pd.read_csv(self.file_settings.file_name,
chunksize=self.file_settings.row_size)
def run(self, directory="temp"):
try:os.makedirs(directory)
except Exception as e:pass
counter = 0
while True:
try:
file_name = "{}/{}_{}_row_{}_{}.csv".format(
directory, self.file_settings.file_name.split(".")[0], counter, self.file_settings.row_size, uuid.uuid4().__str__()
)
df = next(self.df).to_csv(file_name)
counter = counter + 1
except StopIteration:
break
except Exception as e:
print("Error:",e)
break
return True
def main():
helper = FileSplitter(FileSettings(
file_name='sample1.csv',
row_size=10
))
helper.run()
main()

In the case of wanting to split by rough boundaries in bytes, the newest datapoints being the bottom-most ones and wanting to put the newest datapoints in the first file:
from pathlib import Path
TEN_MB = 10000000
FIVE_MB = 5000000
def split_file_into_chunks(path, chunk_size=TEN_MB):
path = str(path)
output_prefix = path.rpartition('.')[0]
output_ext = path.rpartition('.')[-1]
with open(path, 'rb') as f:
seek_positions = []
for x, line in enumerate(f):
if not x:
header = line
seek_positions.append(f.tell())
part = 0
last_seek_pos = seek_positions[-1]
for seek_pos in reversed(seek_positions):
if last_seek_pos-seek_pos >= chunk_size:
with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
f.seek(seek_pos)
f_out.write(header)
f_out.write(f.read(last_seek_pos-seek_pos))
last_seek_pos = seek_pos
part += 1
with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
f.seek(0)
f_out.write(f.read(last_seek_pos))
Path(path).rename(path+'~')
Path(f'{output_prefix}.arch.0.{output_ext}').rename(path)
Path(path+'~').unlink()

How do I split a huge text file in python

I have a huge text file (~1GB) and sadly the text editor I use won't read such a large file. However, if I can just split it into two or three parts I'll be fine, so, as an exercise I wanted to write a program in python to do it.
What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.
Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?
I'll be writing this code test-first, so there's no need to give me a complete answer, unless its a one-liner ;-)

linux has a split command
split -l 100000 file.txt
would split into files of equal 100,000 line size

Check out os.stat() for file size and file.readlines([sizehint]). Those two functions should be all you need for the reading part, and hopefully you know how to do the writing :)

Now, there is a pypi module available that you can use to split files of any size into chunks. Check this out
https://pypi.org/project/filesplit/

As an alternative method, using the logging library:
>>> import logging.handlers
>>> log = logging.getLogger()
>>> fh = logging.handlers.RotatingFileHandler("D://filename.txt",
maxBytes=2**20*100, backupCount=100)
# 100 MB each, up to a maximum of 100 files
>>> log.addHandler(fh)
>>> log.setLevel(logging.INFO)
>>> f = open("D://biglog.txt")
>>> while True:
... log.info(f.readline().strip())
Your files will appear as follows:
filename.txt (end of file)
filename.txt.1
filename.txt.2
...
filename.txt.10 (start of file)
This is a quick and easy way to make a huge log file match your RotatingFileHandler implementation.

This generator method is a (slow) way to get a slice of lines without blowing up your memory.
import itertools
def slicefile(filename, start, end):
lines = open(filename)
return itertools.islice(lines, start, end)
out = open("/blah.txt", "w")
for line in slicefile("/python27/readme.txt", 10, 15):
out.write(line)

don't forget seek() and mmap() for random access to files.
def getSomeChunk(filename, start, len):
fobj = open(filename, 'r+b')
m = mmap.mmap(fobj.fileno(), 0)
return m[start:start+len]

While Ryan Ginstrom's answer is correct, it does take longer than it should (as he has already noted). Here's a way to circumvent the multiple calls to itertools.islice by successively iterating over the open file descriptor:
def splitfile(infilepath, chunksize):
fname, ext = infilepath.rsplit('.',1)
i = 0
written = False
with open(infilepath) as infile:
while True:
outfilepath = "{}{}.{}".format(fname, i, ext)
with open(outfilepath, 'w') as outfile:
for line in (infile.readline() for _ in range(chunksize)):
outfile.write(line)
written = bool(line)
if not written:
break
i += 1

You can use wc and split (see the respective manpages) to get the desired effect. In bash:
split -dl$((`wc -l 'filename'|sed 's/ .*$//'` / 3 + 1)) filename filename-chunk.
produces 3 parts of the same linecount (with a rounding error in the last, of course), named filename-chunk.00 to filename-chunk.02.

I've written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.
def Split(inputFile,numParts,outputName):
fileSize=os.stat(inputFile).st_size
parts=FileSizeParts(fileSize,numParts)
openInputFile = open(inputFile, 'r')
outPart=1
for part in parts:
if openInputFile.tell()<fileSize:
fullOutputName=outputName+os.extsep+str(outPart)
outPart+=1
openOutputFile=open(fullOutputName,'w')
openOutputFile.writelines(openInputFile.readlines(part))
openOutputFile.close()
openInputFile.close()
return outPart-1

usage - split.py filename splitsizeinkb
import os
import sys
def getfilesize(filename):
with open(filename,"rb") as fr:
fr.seek(0,2) # move to end of the file
size=fr.tell()
print("getfilesize: size: %s" % size)
return fr.tell()
def splitfile(filename, splitsize):
# Open original file in read only mode
if not os.path.isfile(filename):
print("No such file as: \"%s\"" % filename)
return
filesize=getfilesize(filename)
with open(filename,"rb") as fr:
counter=1
orginalfilename = filename.split(".")
readlimit = 5000 #read 5kb at a time
n_splits = filesize//splitsize
print("splitfile: No of splits required: %s" % str(n_splits))
for i in range(n_splits+1):
chunks_count = int(splitsize)//int(readlimit)
data_5kb = fr.read(readlimit) # read
# Create split files
print("chunks_count: %d" % chunks_count)
with open(orginalfilename[0]+"_{id}.".format(id=str(counter))+orginalfilename[1],"ab") as fw:
fw.seek(0)
fw.truncate()# truncate original if present
while data_5kb:
fw.write(data_5kb)
if chunks_count:
chunks_count-=1
data_5kb = fr.read(readlimit)
else: break
counter+=1
if __name__ == "__main__":
if len(sys.argv) < 3: print("Filename or splitsize not provided: Usage: filesplit.py filename splitsizeinkb ")
else:
filesize = int(sys.argv[2]) * 1000 #make into kb
filename = sys.argv[1]
splitfile(filename, filesize)

Here is a python script you can use for splitting large files using subprocess:
"""
Splits the file into the same directory and
deletes the original file
"""
import subprocess
import sys
import os
SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2' # subprocess expects a string, i.e. 2 = aa, ab, ac etc..
if __name__ == "__main__":
file_path = sys.argv[1]
# i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
os.path.dirname(file_path) + '/'])
# Remove the original file once done splitting
try:
os.remove(file_path)
except OSError:
pass
You can call it externally:
import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))
You can also import subprocess and run it directly in your program.
The issue with this approach is high memory usage: subprocess creates a fork with a memory footprint same size as your process and if your process memory is already heavy, it doubles it for the time that it runs. The same thing with os.system.
Here is another pure python way of doing this, although I haven't tested it on huge files, it's going to be slower but be leaner on memory:
CHUNK_SIZE = 5000
def yield_csv_rows(reader, chunk_size):
"""
Opens file to ingest, reads each line to return list of rows
Expects the header is already removed
Replacement for ingest_csv
:param reader: dictReader
:param chunk_size: int, chunk size
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
with open(local_file_path, 'rb') as f:
f.readline().strip().replace('"', '')
reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
chunks = yield_csv_rows(reader, CHUNK_SIZE)
for chunk in chunks:
if not chunk:
break
# Do something with your chunk here
Here is another example using readlines():
"""
Simple example using readlines()
where the 'file' is generated via:
seq 10000 > file
"""
CHUNK_SIZE = 5
def yield_rows(reader, chunk_size):
"""
Yield row chunks
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
def batch_operation(data):
for item in data:
print(item)
with open('file', 'r') as f:
chunks = yield_rows(f.readlines(), CHUNK_SIZE)
for _chunk in chunks:
batch_operation(_chunk)
The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks. Unfortunately readlines opens the whole file in memory, its better to use the reader example for performance. Although if you can easily fit what you need into memory and need to process it in chunks this should suffice.

You can achieve splitting any file to chunks like below, here the CHUNK_SIZE is 500000 bytes(500kb) and content can be any file :
for idx,val in enumerate(get_chunk(content, CHUNK_SIZE)):
data=val
index=idx
def get_chunk(content,size):
for i in range(0,len(content),size):
yield content[i:i+size]

This worked for me
import os
fil = "inputfile"
outfil = "outputfile"
f = open(fil,'r')
numbits = 1000000000
for i in range(0,os.stat(fil).st_size/numbits+1):
o = open(outfil+str(i),'w')
segment = f.readlines(numbits)
for c in range(0,len(segment)):
o.write(segment[c]+"\n")
o.close()

I had a requirement to split csv files for import into Dynamics CRM since the file size limit for import is 8MB and the files we receive are much larger. This program allows user to input FileNames and LinesPerFile, and then splits the specified files into the requested number of lines. I can't believe how fast it works!
# user input FileNames and LinesPerFile
FileCount = 1
FileNames = []
while True:
FileName = raw_input('File Name ' + str(FileCount) + ' (enter "Done" after last File):')
FileCount = FileCount + 1
if FileName == 'Done':
break
else:
FileNames.append(FileName)
LinesPerFile = raw_input('Lines Per File:')
LinesPerFile = int(LinesPerFile)
for FileName in FileNames:
File = open(FileName)
# get Header row
for Line in File:
Header = Line
break
FileCount = 0
Linecount = 1
for Line in File:
#skip Header in File
if Line == Header:
continue
#create NewFile with Header every [LinesPerFile] Lines
if Linecount % LinesPerFile == 1:
FileCount = FileCount + 1
NewFileName = FileName[:FileName.find('.')] + '-Part' + str(FileCount) + FileName[FileName.find('.'):]
NewFile = open(NewFileName,'w')
NewFile.write(Header)
NewFile.write(Line)
Linecount = Linecount + 1
NewFile.close()

import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)
For example if you want 50000 lines in one files and path is /home/data then you can run below command
subprocess.run('split -l 50000 /home/data', shell = True)
If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want
! wc -l file_path
in this case
! wc -l /home/data
And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows

Or, a python version of wc and split:
lines = 0
for l in open(filename): lines += 1
Then some code to read the first lines/3 into one file, the next lines/3 into another , etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cut a large text file in small files - python

lines = open('myfile.txt').readlines() for i in range(0, 1000000, 15000): open('{0}_{1}.txt'.format(i+1, i+15000), 'w').writelines(lines[i:i+15000]) Hope this helps.

You could try something like: lines = open('myfile.txt').readlines() count = 0 incr = 15000 while count<len(lines): open(str(count)+'_'+str(count+incr)+'.txt', 'w').writelines(lines[count:incr]) count += incr

Note that you can also do this with the linux split utility. No need to reinvent the wheel!

Related

Counting the number of lines in a gzip file using python

how to sample a very big CSV file(6GB)

Split a file based on number of occurrences of 1 in position 1 of a line

How can I split a large file csv file (7GB) in Python

How do I split a huge text file in python

Categories

Resources