How can I split a large file csv file (7GB) in Python - python

I have a 7GB csv file which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?

You don't need Python to split a csv file. Using your shell:
$ split -l 100 data.csv
Would split data.csv in chunks of 100 lines.

I had to do a similar task, and used the pandas package:
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
chunk.to_csv('chunk{}.csv'.format(i), index=False)

Here is a little python script I used to split a file data.csv into several CSV part files. The number of part files can be controlled with chunk_size (number of lines per part file).
The header line (column names) of the original file is copied into every part CSV file.
It works for big files because it reads one line at a time with readline() instead of loading the complete file into memory at once.
#!/usr/bin/env python3
def main():
chunk_size = 9998 # lines
def write_chunk(part, lines):
with open('data_part_'+ str(part) +'.csv', 'w') as f_out:
f_out.write(header)
f_out.writelines(lines)
with open('data.csv', 'r') as f:
count = 0
header = f.readline()
lines = []
for line in f:
count += 1
lines.append(line)
if count % chunk_size == 0:
write_chunk(count // chunk_size, lines)
lines = []
# write remainder
if len(lines) > 0:
write_chunk((count // chunk_size) + 1, lines)
if __name__ == '__main__':
main()

This graph shows the runtime difference of the different approaches outlined by other posters (on an 8 core machine when splitting a 2.9 GB file with 11.8 million rows of data into ~290 files).
The shell approach is from Thomas Orozco, Python approach s from Roberto, Pandas approach is from Quentin Febvre and here's the Dask snippet:
ddf = dd.read_csv("../nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2015.csv", blocksize=10000000, dtype=dtypes)
ddf.to_csv("../tmp/split_csv_dask")
I'd recommend Dask for splitting files, even though it's not the fastest, because it's the most flexible solution (you can write out different file formats, perform processing operations before writing, easily modify compression formats, etc.). The Pandas approach is almost as flexible, but cannot perform processing on the entire dataset (like sorting the entire dataset before writing).
Bash / native Python filesystem operations are clearly quicker, but that's not what I'm typically looking for when I have a large CSV. I'm typically interested in splitting large CSVs into smaller Parquet files, for performant, production data analyses. I don't usually care if the actually splitting takes a couple minutes more. I'm more interested in splitting accurately.
I wrote a blog post that discusses this in more detail. You can probably Google around and find the post.

See the Python docs on file objects (the object returned by open(filename) - you can choose to read a specified number of bytes, or use readline to work through one line at a time.

Maybe something like this?
#!/usr/local/cpython-3.3/bin/python
import csv
divisor = 10
outfileno = 1
outfile = None
with open('big.csv', 'r') as infile:
for index, row in enumerate(csv.reader(infile)):
if index % divisor == 0:
if outfile is not None:
outfile.close()
outfilename = 'big-{}.csv'.format(outfileno)
outfile = open(outfilename, 'w')
outfileno += 1
writer = csv.writer(outfile)
writer.writerow(row)

I agree with #jonrsharpe readline should be able to read one line at a time even for big files.
If you are dealing with big csv files might I suggest using pandas.read_csv. I often use it for the same purpose and always find it awesome (and fast). Takes a bit of time to get used to idea of DataFrames. But once you get over that it speeds up large operations like yours massively.
Hope it helps.

here is my code which might help
import os
import pandas as pd
import uuid
class FileSettings(object):
def __init__(self, file_name, row_size=100):
self.file_name = file_name
self.row_size = row_size
class FileSplitter(object):
def __init__(self, file_settings):
self.file_settings = file_settings
if type(self.file_settings).__name__ != "FileSettings":
raise Exception("Please pass correct instance ")
self.df = pd.read_csv(self.file_settings.file_name,
chunksize=self.file_settings.row_size)
def run(self, directory="temp"):
try:os.makedirs(directory)
except Exception as e:pass
counter = 0
while True:
try:
file_name = "{}/{}_{}_row_{}_{}.csv".format(
directory, self.file_settings.file_name.split(".")[0], counter, self.file_settings.row_size, uuid.uuid4().__str__()
)
df = next(self.df).to_csv(file_name)
counter = counter + 1
except StopIteration:
break
except Exception as e:
print("Error:",e)
break
return True
def main():
helper = FileSplitter(FileSettings(
file_name='sample1.csv',
row_size=10
))
helper.run()
main()

In the case of wanting to split by rough boundaries in bytes, the newest datapoints being the bottom-most ones and wanting to put the newest datapoints in the first file:
from pathlib import Path
TEN_MB = 10000000
FIVE_MB = 5000000
def split_file_into_chunks(path, chunk_size=TEN_MB):
path = str(path)
output_prefix = path.rpartition('.')[0]
output_ext = path.rpartition('.')[-1]
with open(path, 'rb') as f:
seek_positions = []
for x, line in enumerate(f):
if not x:
header = line
seek_positions.append(f.tell())
part = 0
last_seek_pos = seek_positions[-1]
for seek_pos in reversed(seek_positions):
if last_seek_pos-seek_pos >= chunk_size:
with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
f.seek(seek_pos)
f_out.write(header)
f_out.write(f.read(last_seek_pos-seek_pos))
last_seek_pos = seek_pos
part += 1
with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
f.seek(0)
f_out.write(f.read(last_seek_pos))
Path(path).rename(path+'~')
Path(f'{output_prefix}.arch.0.{output_ext}').rename(path)
Path(path+'~').unlink()

Related

Batching very large text file in python

I am trying to batch a very large text file (approximately 150 gigabytes) into several smaller text files (approximately 10 gigabytes).
My general process will be:
# iterate over file one line at a time
# accumulate batch as string
--> # given a certain count that correlates to the size of my current accumulated batch and when that size is met: (this is where I am unsure)
# write to file
# accumulate size count
I have a rough metric to calculate when to batch (when the desired batch size) but am not so clear how I should calculate how often to write to disk for a given batch. For example, if my batch size is 10 gigabytes, I assume I will need to iteratively write rather than hold the entire 10 gigbyte batch in memory. I obviously do not want to write more than I have to as this could be quite expensive.
Do ya'll have any rough calculations or tricks that you like to use to figure out when to write to disk for task such as this, e.g. size vs memory or something?
Assuming your large file is simple unstructured text, i.e. this is no good for structured text like JSON, here's an alternative to reading every single line: read large binary bites of the input file until at your chunksize then read a couple of lines, close the current output file and move on to the next.
I compared this with line-by-line using #tdelaney code adapted with the same chunksize as my code - that code took 250s to split a 12GiB input file into 6x2GiB chunks, whereas this took ~50s so maybe five times faster and looks like it's I/O bound on my SSD running >200MiB/s read and write, where the line-by-line was running 40-50MiB/s read and write.
I turned buffering off because there's not a lot of point. The size of bite and the buffering setting may be tuneable to improve performance, haven't tried any other settings as for me it seems to be I/O bound anyway.
import time
outfile_template = "outfile-{}.txt"
infile_name = "large.text"
chunksize = 2_000_000_000
MEB = 2**20 # mebibyte
bitesize = 4_000_000 # the size of the reads (and writes) working up to chunksize
count = 0
starttime = time.perf_counter()
infile = open(infile_name, "rb", buffering=0)
outfile = open(outfile_template.format(count), "wb", buffering=0)
while True:
byteswritten = 0
while byteswritten < chunksize:
bite = infile.read(bitesize)
# check for EOF
if not bite:
break
outfile.write(bite)
byteswritten += len(bite)
# check for EOF
if not bite:
break
for i in range(2):
l = infile.readline()
# check for EOF
if not l:
break
outfile.write(l)
# check for EOF
if not l:
break
outfile.close()
count += 1
print( count )
outfile = open(outfile_template.format(count), "wb", buffering=0)
outfile.close()
infile.close()
endtime = time.perf_counter()
elapsed = endtime-starttime
print( f"Elapsed= {elapsed}" )
NOTE I haven't exhaustively tested this doesn't lose data, although no evidence it does lose anything you should validate that yourself.
Might be useful to add some robustness by checking when at the end of a chunk to see how much data is left to read, so you don't end up with the last output file being 0-length (or shorter than bitesize)
HTH
barny
I used slightly modificated version of this for parsing 250GB json, I choose how many smaller files I need number_of_slices and then I find positions where to slice a file (I always look for line end). FInally i slice file with file.seek and file.read(chunk)
import os
import mmap
FULL_PATH_TO_FILE = 'full_path_to_a_big_file'
OUTPUT_PATH = 'full_path_to_a_output_dir' # where sliced files will be generated
def next_newline_finder(mmapf):
def nl_find(mmapf):
while 1:
current = hex(mmapf.read_byte())
if hex(ord('\n')) == current: # or whatever line-end symbol
return(mmapf.tell())
return nl_find(mmapf)
# find positions where to slice a file
file_info = os.stat(FULL_PATH_TO_FILE)
file_size = file_info.st_size
positions_for_file_slice = [0]
number_of_slices = 15 # say u want slice the big file to 15 smaller files
size_per_slice = file_size//number_of_slices
with open(FULL_PATH_TO_FILE, "r+b") as f:
mmapf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
slice_counter = 1
while slice_counter < number_of_slices:
pos = size_per_slice*slice_counter
mmapf.seek(pos)
newline_pos = next_newline_finder(mmapf)
positions_for_file_slice.append(newline_pos)
slice_counter += 1
# create ranges for found positions (from, to)
positions_for_file_slice = [(pos, positions_for_file_slice[i+1]) if i < (len(positions_for_file_slice)-1) else (
positions_for_file_slice[i], file_size) for i, pos in enumerate(positions_for_file_slice)]
# do actual slice of a file
with open(FULL_PATH_TO_FILE, "rb") as f:
for i, position_pair in enumerate(positions_for_file_slice):
read_from, read_to = position_pair
f.seek(read_from)
chunk = f.read(read_to-read_from)
with open(os.path.join(OUTPUT_PATH, f'dummyfile{i}.json'), 'wb') as chunk_file:
chunk_file.write(chunk)
Here is an example of line-by-line writes. Its opened in binary mode to avoid the line decode step which takes a modest amount of time but can skew character counts. For instance, utf-8 encoding may use multiple bytes on disk for a single python character.
4 Meg is a guess at buffering. The idea is to get the operating system to read more of the file at once, reducing seek times. Whether this works or the best number to use is debatable - and will be different for different operating systems. I found 4 meg makes a difference... but that was years ago and things change.
outfile_template = "outfile-{}.txt"
infile_name = "infile.txt"
chunksize = 10_000_000_000
MEB = 2**20 # mebibyte
count = 0
byteswritten = 0
infile = open(infile_name, "rb", buffering=4*MEB)
outfile = open(outfile_template.format(count), "wb", buffering=4*MEB)
try:
for line in infile:
if byteswritten > chunksize:
outfile.close()
byteswritten = 0
count += 1
outfile = open(outfile_template.format(count), "wb", buffering=4*MEB)
outfile.write(line)
byteswritten += len(line)
finally:
infile.close()
outfile.close()

Find and remove duplicates in a CSV file

I have a large CSV file (1.8 GB) with three columns. Each row contains two strings and a numerical value. The problem is that they are duplicate but swapped.
Example:
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
The desired output would look like this:
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
because the third row contains the same information like the first row.
EDIT
The data basically looks like this (Strings in the first two columns and a numerical value in the third, 40 Million lines):
Blockquote
Can you handle awk:
$ awk -F, '++seen[$3]==1' file
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
Explaied:
$ awk -F, ' # set comma as field delimiter
++seen[$3]==1 # count instances of the third field to hash, printing only first
' file
Update:
$ awk -F, '++seen[($1<$2?$1 FS $2:$2 FS $1)]==1' file
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
It hashes every met combination of first and second field so that "ABC,DEF"=="DEF,ABC" and counts them printing only the first. ($1<$2?$1 FS $2:$2 FS $1): if first field is less than second, hash 1st,2nd else hash 2nd,1st.
From the problem description, the mandate for a line to be NOT omitted is when
the first and second fields in either order when concatenated should be unique.
If so, below awk would help
awk -F, '{seen[$1,$2]++;seen[$2,$1]++}seen[$1,$2]==1 && seen[$2,$1]==1' filename
Sample Input
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
GHI,ABC,123
DEF,ABC,123
ABC,GHI,123
DEF,GHI,123
Sample Output
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
GHI,ABC,123
DEF,GHI,123
Note: This question was done before the OP changed the python tag for awk tag.
If you don't mind the order of the elements you might do:
with open("in.csv", "r") as file:
lines = set()
for line in file:
lines.add(frozenset(line.strip("\n").split(",")))
with open("out.csv", "w") as file:
for line in lines:
file.write(",".join(line)+"\n")
Output:
Col2,COL1,Col3
EFG,454,ABC
DEF,123,ABC
Note that you might want to treat the first line (the titles) in an special way to not loose their order.
But if the order matter you could use the code from Maintaining the order of the elements in a frozen set:
from itertools import filterfalse
def unique_everseen(iterable, key=None):
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
with open("in.csv", "r") as file:
lines = []
for line in file:
lines.append(line.strip("\n").split(","))
with open("out.csv", "w") as file:
for line in unique_everseen(lines, key=frozenset):
file.write(",".join(line)+"\n")
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
The OP said that both codes seem to not work on large files (1.8 Gb). I think it may be due to the fact that both codes store the file in a list using the RAM, and a file of 1.8 GB might take all the available space on memory.
In order to solve that I made a few more attempts. Sadly, I must say that all of them are extremely slow compared to the first attempt. The firsts codes sacrifice RAM consumption for speed, but the following codes sacrifice speed, CPU and hard drive for less RAM consumption (instead of consuming the whole file size in RAM they take less than 50 Mb).
Since all of this examples needs a higher hard drive usage, it's advisable to has the "input" and "output" file on different hard drives.
My first attempt using less RAM is with the shelve module:
import shelve, os
with shelve.open("tmp") as db:
with open("in.csv", "r") as file:
for line in file:
l = line.strip("\n").split(",")
l.sort()
db[",".join(l)] = l
with open("out.csv", "w") as file:
for v in db.values():
file.write(",".join(v)+"\n")
os.remove("temp.bak")
os.remove("temp.dat")
os.remove("temp.dir")
Sadly, this code takes hundred of times more than the first two codes which uses the RAM.
Another attempt is:
with open("in.csv", "r") as fileRead:
# total = sum(1 for _ in fileRead)
# fileRead.seek(0)
# i = 0
with open("out.csv", "w") as _:
pass
with open("out.csv", "r+") as fileWrite:
for lineRead in fileRead:
# i += 1
line = lineRead.strip("\n").split(",")
lineSet = set(line)
write = True
fileWrite.seek(0)
for lineWrite in fileWrite:
if lineSet == set(lineWrite.strip("\n").split(",")):
write = False
if write:
pass
fileWrite.write(",".join(line)+"\n")
# if i / total * 100 % 1 == 0: print(f"{i / total * 100}% ({i} / {total})")
This is slightly faster but not much.
If your computer has several cores, you could try to use multiprocessing:
from multiprocessing import Process, Queue, cpu_count
from os import remove
def slave(number, qIn, qOut):
name = f"slave-{number}.csv"
with open(name, "w") as file:
pass
with open(name, "r+") as file:
while True:
if not qIn.empty():
get = qIn.get()
if get == False:
qOut.put(name)
break
else:
write = True
file.seek(0)
for line in file:
if set(line.strip("\n").split(",")) == get[1]:
write = False
break
if write:
file.write(get[0])
def master():
qIn = Queue(1)
qOut = Queue()
slaves = cpu_count()
slavesList = []
for n in range(slaves):
slavesList.append(Process(target=slave, daemon=True, args=(n, qIn, qOut)))
for s in slavesList:
s.start()
with open("in.csv", "r") as file:
for line in file:
lineSet = set(line.strip("\n").split(","))
qIn.put((line, lineSet))
for _ in range(slaves):
qIn.put(False)
for s in slavesList:
s.join()
slavesList = []
with open(qOut.get(), "r+") as fileMaster:
for x in range(slaves-1):
file = qOut.get()
with open(file, "r") as fileSlave:
for lineSlave in fileSlave:
lineSet = set(lineSlave.strip("\n").split(","))
write = True
fileMaster.seek(0)
for lineMaster in fileMaster:
if set(lineMaster.strip("\n").split(",")) == lineSet:
write = False
break
if write:
fileMaster.write(lineSlave)
slavesList.append(Process(target=remove, daemon=True, args=(file,)))
slavesList[-1].start()
for s in slavesList:
s.join()
As you can see, I have the disappointing task to tell you that my both attempts work really slow. I hope you find a better approach, otherwise, it will take hours if not days to execute on 1,8 GB of data (the real time will primarily depend on the number of repeated values, which reduces time).
A new attempt: instead of storing every in file, this attempt stores the active portion on memory, and then write down on a file in order to process chunks faster. Then, the chunks must be read again by using one of the above methods:
lines = set()
maxLines = 1000 # This is the amount of lines that will be stored at the same time on RAM. Higher numbers are faster but requeires more RAM on the computer
perfect = True
with open("in.csv", "r") as fileRead:
total = sum(1 for _ in fileRead)
fileRead.seek(0)
i = 0
with open("tmp.csv", "w") as fileWrite:
for line in fileRead:
if (len(lines) < maxLines):
lines.add(frozenset(line.strip("\n").split(",")))
i += 1
if i / total * 100 % 1 == 0: print(f"Reading {i / total * 100}% ({i} / {total})")
else:
perfect = False
j = 0
for line in lines:
j += 1
fileWrite.write(",".join(line) + "\n")
if i / total * 100 % 1 == 0: print(f"Storing {i / total * 100}% ({i} / {total})")
lines = set()
if (not perfect):
use_one_of_the_above_methods() # Remember to read the tmp.csv and not the in.csv
This might boost the speed. You can change maxLines by any number you like, remember that higher the number, greater speed (not sure if really big numbers do the opposite) but higher RAM consumption.
If you want to use csv library itself:-
you can use a DictReader and DictWriter.
Import csv
def main():
"""Read csv file, delete duplicates and write it."""
with open('test.csv', 'r',newline='') as inputfile:
with open('testout.csv', 'w', newline='') as outputfile:
duplicatereader = csv.DictReader(inputfile, delimiter=',')
uniquewrite = csv.DictWriter(outputfile, fieldnames=['address', 'floor', 'date', 'price'], delimiter=',')
uniquewrite.writeheader()
keysread = []
for row in duplicatereader:
key = (row['date'], row['price'])
if key not in keysread:
print(row)
keysread.append(key)
uniquewrite.writerow(row)
if __name__ == '__main__':
main()

Divide .csv file into chunks with Python

I have a large .csv file that is well over 300 gb. I would like to chunk it into smaller files of 100,000,000 rows each (each row has approximately 55-60 bytes).
I wrote the following code:
import pandas as pd
df = pd.read_csv('/path/to/really/big.csv',header=None,chunksize=100000000)
count = 1
for chunk in df:
name = '/output/to/this/directory/file_%s.csv' %s count
chunk.to_csv(name,header=None,index=None)
print(count)
count+=1
This code works fine, and I have plenty of memory on disk to store the approximate 5.5-6 gb at a time, but it's slow.
Is there a better way?
EDIT
I have written the following iterative solution:
with open('/path/to/really/big.csv', 'r') as csvfile:
read_rows = csv.reader(csvfile)
file_count = 1
row_count = 1
f = open('/output/to/this/directory/file_%s.csv' %s count,'w')
for row in read_rows:
f.write(''.join(row))
row_count+=1
if row_count % 100000000 == 0:
f.close()
file_count += 1
f = open('/output/to/this/directory/file_%s.csv' %s count,'w')
EDIT 2
I would like to call attention to Vor's comment about using a Unix/Linux split command, this is the fastest solution I have found.
there is an existing tool for this in Unix/Linux.
split -l 100000 -d source destination
will add two digit numerical suffix to destination prefix for the chunks.
You don't really need to read all that data into a pandas DataFrame just to split the file - you don't even need to read the data all into memory at all. You could seek to the approximate offset you want to split at, then scan forward until you find a line break, and loop reading much smaller chunks from the source file into a destination file between your start and end offsets. (This approach assumes your CSV doesn't have any column values with embedded newlines.)
SMALL_CHUNK = 100000
def write_chunk(source_file, start, end, dest_name):
pos = start
source_file.seek(pos)
with open(dest_name, 'w') as dest_file:
for chunk_start in range(start, end, SMALL_CHUNK):
chunk_end = min(chunk_start + SMALL_CHUNK, end)
dest_file.write(source_file.read(chunk_end - chunk_start))
Actually, an intermediate solution could be to use the csv module - that would still parse all of the lines in the file, which isn't strictly necessary, but would avoid reading huge arrays into memory for each chunk.

Cut a large text file in small files

I have a text files which contains 1000000 lines. I want to split it in files that contains 15000 lines each. E.g, first file contains 1 to 15000 lines, next file 15001 to 30000 lines and so on. This is what I have done :
lines = open('myfile.txt').readlines()
open('1_15000.txt', 'w').writelines(lines[0:15000])
open('15001_30000.txt', 'w').writelines(lines[15000:30000])
open('30000_45000.txt', 'w').writelines(lines[30000:45000])
open('45000_60000.txt', 'w').writelines(lines[45000:60000])
...
...
... so on till 1000000
But this code looks too long. Is there any way I can do this using any loop so that I don't have to wrote separate code for each file?
lines = open('myfile.txt').readlines()
for i in range(0, 1000000, 15000):
open('{0}_{1}.txt'.format(i+1, i+15000), 'w').writelines(lines[i:i+15000])
Hope this helps.
You could try something like:
lines = open('myfile.txt').readlines()
count = 0
incr = 15000
while count<len(lines):
open(str(count)+'_'+str(count+incr)+'.txt', 'w').writelines(lines[count:incr])
count += incr
Note that you can also do this with the linux split utility. No need to reinvent the wheel!
lines = open('myfile.txt').readlines()
loads the entire file into a Python list. You won't want to do this when a file is large since it may cause your machine to run out of memory.
This splits the file into chunks of N lines. Each chunk is a list. It stops when a chunk is an empty list.
import itertools as IT
N = 15000
with open('data', 'rb') as f:
for i, chunk in enumerate(iter(lambda: list(IT.islice(f, N)), [])):
outfile = '{:06d}_{:06d}.txt'.format(i*N, (i+1)*N)
with open(outfile, 'wb') as g:
g.writelines(chunk)
If the file contains N empty lines, then the above method may end prematurely. Or if N were really large, reading N lines into a Python list may cause a MemoryError. You could avoid these problems by handling one line at a time (by calling next(f)) and catching the StopIteration exception that signals the end of the file:
import itertools as IT
N = 15000
with open('data', 'rb') as f:
try:
for i in IT.count():
outfile = '{:06d}_{:06d}.txt'.format(i*N, (i+1)*N)
with open(outfile, 'wb') as g:
for j in range(N):
line = next(f)
g.write(line)
except StopIteration:
pass

How do I split a huge text file in python

I have a huge text file (~1GB) and sadly the text editor I use won't read such a large file. However, if I can just split it into two or three parts I'll be fine, so, as an exercise I wanted to write a program in python to do it.
What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.
Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?
I'll be writing this code test-first, so there's no need to give me a complete answer, unless its a one-liner ;-)
linux has a split command
split -l 100000 file.txt
would split into files of equal 100,000 line size
Check out os.stat() for file size and file.readlines([sizehint]). Those two functions should be all you need for the reading part, and hopefully you know how to do the writing :)
Now, there is a pypi module available that you can use to split files of any size into chunks. Check this out
https://pypi.org/project/filesplit/
As an alternative method, using the logging library:
>>> import logging.handlers
>>> log = logging.getLogger()
>>> fh = logging.handlers.RotatingFileHandler("D://filename.txt",
maxBytes=2**20*100, backupCount=100)
# 100 MB each, up to a maximum of 100 files
>>> log.addHandler(fh)
>>> log.setLevel(logging.INFO)
>>> f = open("D://biglog.txt")
>>> while True:
... log.info(f.readline().strip())
Your files will appear as follows:
filename.txt (end of file)
filename.txt.1
filename.txt.2
...
filename.txt.10 (start of file)
This is a quick and easy way to make a huge log file match your RotatingFileHandler implementation.
This generator method is a (slow) way to get a slice of lines without blowing up your memory.
import itertools
def slicefile(filename, start, end):
lines = open(filename)
return itertools.islice(lines, start, end)
out = open("/blah.txt", "w")
for line in slicefile("/python27/readme.txt", 10, 15):
out.write(line)
don't forget seek() and mmap() for random access to files.
def getSomeChunk(filename, start, len):
fobj = open(filename, 'r+b')
m = mmap.mmap(fobj.fileno(), 0)
return m[start:start+len]
While Ryan Ginstrom's answer is correct, it does take longer than it should (as he has already noted). Here's a way to circumvent the multiple calls to itertools.islice by successively iterating over the open file descriptor:
def splitfile(infilepath, chunksize):
fname, ext = infilepath.rsplit('.',1)
i = 0
written = False
with open(infilepath) as infile:
while True:
outfilepath = "{}{}.{}".format(fname, i, ext)
with open(outfilepath, 'w') as outfile:
for line in (infile.readline() for _ in range(chunksize)):
outfile.write(line)
written = bool(line)
if not written:
break
i += 1
You can use wc and split (see the respective manpages) to get the desired effect. In bash:
split -dl$((`wc -l 'filename'|sed 's/ .*$//'` / 3 + 1)) filename filename-chunk.
produces 3 parts of the same linecount (with a rounding error in the last, of course), named filename-chunk.00 to filename-chunk.02.
I've written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.
def Split(inputFile,numParts,outputName):
fileSize=os.stat(inputFile).st_size
parts=FileSizeParts(fileSize,numParts)
openInputFile = open(inputFile, 'r')
outPart=1
for part in parts:
if openInputFile.tell()<fileSize:
fullOutputName=outputName+os.extsep+str(outPart)
outPart+=1
openOutputFile=open(fullOutputName,'w')
openOutputFile.writelines(openInputFile.readlines(part))
openOutputFile.close()
openInputFile.close()
return outPart-1
usage - split.py filename splitsizeinkb
import os
import sys
def getfilesize(filename):
with open(filename,"rb") as fr:
fr.seek(0,2) # move to end of the file
size=fr.tell()
print("getfilesize: size: %s" % size)
return fr.tell()
def splitfile(filename, splitsize):
# Open original file in read only mode
if not os.path.isfile(filename):
print("No such file as: \"%s\"" % filename)
return
filesize=getfilesize(filename)
with open(filename,"rb") as fr:
counter=1
orginalfilename = filename.split(".")
readlimit = 5000 #read 5kb at a time
n_splits = filesize//splitsize
print("splitfile: No of splits required: %s" % str(n_splits))
for i in range(n_splits+1):
chunks_count = int(splitsize)//int(readlimit)
data_5kb = fr.read(readlimit) # read
# Create split files
print("chunks_count: %d" % chunks_count)
with open(orginalfilename[0]+"_{id}.".format(id=str(counter))+orginalfilename[1],"ab") as fw:
fw.seek(0)
fw.truncate()# truncate original if present
while data_5kb:
fw.write(data_5kb)
if chunks_count:
chunks_count-=1
data_5kb = fr.read(readlimit)
else: break
counter+=1
if __name__ == "__main__":
if len(sys.argv) < 3: print("Filename or splitsize not provided: Usage: filesplit.py filename splitsizeinkb ")
else:
filesize = int(sys.argv[2]) * 1000 #make into kb
filename = sys.argv[1]
splitfile(filename, filesize)
Here is a python script you can use for splitting large files using subprocess:
"""
Splits the file into the same directory and
deletes the original file
"""
import subprocess
import sys
import os
SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2' # subprocess expects a string, i.e. 2 = aa, ab, ac etc..
if __name__ == "__main__":
file_path = sys.argv[1]
# i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
os.path.dirname(file_path) + '/'])
# Remove the original file once done splitting
try:
os.remove(file_path)
except OSError:
pass
You can call it externally:
import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))
You can also import subprocess and run it directly in your program.
The issue with this approach is high memory usage: subprocess creates a fork with a memory footprint same size as your process and if your process memory is already heavy, it doubles it for the time that it runs. The same thing with os.system.
Here is another pure python way of doing this, although I haven't tested it on huge files, it's going to be slower but be leaner on memory:
CHUNK_SIZE = 5000
def yield_csv_rows(reader, chunk_size):
"""
Opens file to ingest, reads each line to return list of rows
Expects the header is already removed
Replacement for ingest_csv
:param reader: dictReader
:param chunk_size: int, chunk size
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
with open(local_file_path, 'rb') as f:
f.readline().strip().replace('"', '')
reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
chunks = yield_csv_rows(reader, CHUNK_SIZE)
for chunk in chunks:
if not chunk:
break
# Do something with your chunk here
Here is another example using readlines():
"""
Simple example using readlines()
where the 'file' is generated via:
seq 10000 > file
"""
CHUNK_SIZE = 5
def yield_rows(reader, chunk_size):
"""
Yield row chunks
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
def batch_operation(data):
for item in data:
print(item)
with open('file', 'r') as f:
chunks = yield_rows(f.readlines(), CHUNK_SIZE)
for _chunk in chunks:
batch_operation(_chunk)
The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks. Unfortunately readlines opens the whole file in memory, its better to use the reader example for performance. Although if you can easily fit what you need into memory and need to process it in chunks this should suffice.
You can achieve splitting any file to chunks like below, here the CHUNK_SIZE is 500000 bytes(500kb) and content can be any file :
for idx,val in enumerate(get_chunk(content, CHUNK_SIZE)):
data=val
index=idx
def get_chunk(content,size):
for i in range(0,len(content),size):
yield content[i:i+size]
This worked for me
import os
fil = "inputfile"
outfil = "outputfile"
f = open(fil,'r')
numbits = 1000000000
for i in range(0,os.stat(fil).st_size/numbits+1):
o = open(outfil+str(i),'w')
segment = f.readlines(numbits)
for c in range(0,len(segment)):
o.write(segment[c]+"\n")
o.close()
I had a requirement to split csv files for import into Dynamics CRM since the file size limit for import is 8MB and the files we receive are much larger. This program allows user to input FileNames and LinesPerFile, and then splits the specified files into the requested number of lines. I can't believe how fast it works!
# user input FileNames and LinesPerFile
FileCount = 1
FileNames = []
while True:
FileName = raw_input('File Name ' + str(FileCount) + ' (enter "Done" after last File):')
FileCount = FileCount + 1
if FileName == 'Done':
break
else:
FileNames.append(FileName)
LinesPerFile = raw_input('Lines Per File:')
LinesPerFile = int(LinesPerFile)
for FileName in FileNames:
File = open(FileName)
# get Header row
for Line in File:
Header = Line
break
FileCount = 0
Linecount = 1
for Line in File:
#skip Header in File
if Line == Header:
continue
#create NewFile with Header every [LinesPerFile] Lines
if Linecount % LinesPerFile == 1:
FileCount = FileCount + 1
NewFileName = FileName[:FileName.find('.')] + '-Part' + str(FileCount) + FileName[FileName.find('.'):]
NewFile = open(NewFileName,'w')
NewFile.write(Header)
NewFile.write(Line)
Linecount = Linecount + 1
NewFile.close()
import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)
For example if you want 50000 lines in one files and path is /home/data then you can run below command
subprocess.run('split -l 50000 /home/data', shell = True)
If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want
! wc -l file_path
in this case
! wc -l /home/data
And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows
Or, a python version of wc and split:
lines = 0
for l in open(filename): lines += 1
Then some code to read the first lines/3 into one file, the next lines/3 into another , etc.

Categories