Is it possible to split a file? For example you have huge wordlist, I want to split it so that it becomes more than one file. How is this possible?
This one splits a file up by newlines and writes it back out. You can change the delimiter easily. This can also handle uneven amounts as well, if you don't have a multiple of splitLen lines (20 in this example) in your input file.
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
# This is shorthand and not friendly with memory
# on very large files (Sean Cavanagh), but it works.
input = open('input.txt', 'r').read().split('\n')
at = 1
for lines in range(0, len(input), splitLen):
# First, get the list slice
outputData = input[lines:lines+splitLen]
# Now open the output file, join the new slice with newlines
# and write it out. Then close the file.
output = open(outputBase + str(at) + '.txt', 'w')
output.write('\n'.join(outputData))
output.close()
# Increment the counter
at += 1
A better loop for sli's example, not hogging memory :
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
input = open('input.txt', 'r')
count = 0
at = 0
dest = None
for line in input:
if count % splitLen == 0:
if dest: dest.close()
dest = open(outputBase + str(at) + '.txt', 'w')
at += 1
dest.write(line)
count += 1
Solution to split binary files into chapters .000, .001, etc.:
FILE = 'scons-conversion.7z'
MAX = 500*1024*1024 # 500Mb - max chapter size
BUF = 50*1024*1024*1024 # 50GB - memory buffer size
chapters = 0
uglybuf = ''
with open(FILE, 'rb') as src:
while True:
tgt = open(FILE + '.%03d' % chapters, 'wb')
written = 0
while written < MAX:
if len(uglybuf) > 0:
tgt.write(uglybuf)
tgt.write(src.read(min(BUF, MAX - written)))
written += min(BUF, MAX - written)
uglybuf = src.read(1)
if len(uglybuf) == 0:
break
tgt.close()
if len(uglybuf) == 0:
break
chapters += 1
def split_file(file, prefix, max_size, buffer=1024):
"""
file: the input file
prefix: prefix of the output files that will be created
max_size: maximum size of each created file in bytes
buffer: buffer size in bytes
Returns the number of parts created.
"""
with open(file, 'r+b') as src:
suffix = 0
while True:
with open(prefix + '.%s' % suffix, 'w+b') as tgt:
written = 0
while written < max_size:
data = src.read(buffer)
if data:
tgt.write(data)
written += buffer
else:
return suffix
suffix += 1
def cat_files(infiles, outfile, buffer=1024):
"""
infiles: a list of files
outfile: the file that will be created
buffer: buffer size in bytes
"""
with open(outfile, 'w+b') as tgt:
for infile in sorted(infiles):
with open(infile, 'r+b') as src:
while True:
data = src.read(buffer)
if data:
tgt.write(data)
else:
break
Sure it's possible:
open input file
open output file 1
count = 0
for each line in file:
write to output file
count = count + 1
if count > maxlines:
close output file
open next output file
count = 0
import re
PATENTS = 'patent.data'
def split_file(filename):
# Open file to read
with open(filename, "r") as r:
# Counter
n=0
# Start reading file line by line
for i, line in enumerate(r):
# If line match with teplate -- <?xml --increase counter n
if re.match(r'\<\?xml', line):
n+=1
# This "if" can be deleted, without it will start naming from 1
# or you can keep it. It depends where is "re" will find at
# first time the template. In my case it was first line
if i == 0:
n = 0
# Write lines to file
with open("{}-{}".format(PATENTS, n), "a") as f:
f.write(line)
split_file(PATENTS)
As a result you will get:
patent.data-0
patent.data-1
patent.data-N
You can use use this pypi filesplit module.
This is a late answer, but a new question was linked here and none of the answers mentioned itertools.groupby.
Assuming you have a (huge) file file.txt that you want to split in chunks of MAXLINES lines file_part1.txt, ..., file_partn.txt, you could do:
with open(file.txt) as fdin:
for i, sub in itertools.groupby(enumerate(fdin), lambda x: 1 + x[0]//3):
fdout = open("file_part{}.txt".format(i))
for _, line in sub:
fdout.write(line)
import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)
For example if you want 50000 lines in one files and path is /home/data then you can run below command
subprocess.run('split -l 50000 /home/data', shell = True)
If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want
! wc -l file_path
in this case
! wc -l /home/data
And Just so you know output file will not have file extension but its same extension as input file
You can change it manually if Windows
All the provided answers are good and (probably) work However, they need to load the file into memory (as a whole or partially). We know Python is not very efficient in this kind of tasks (or at least is not as efficient as the OS level commands).
I found the following is the most efficient way to do it:
import os
MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"
if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
print("Done:")
print(os.system(f"ls {PREFIX}??"))
else:
print("Failed!")
Read more about split here: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/
Related
I have a large CSV file (1.8 GB) with three columns. Each row contains two strings and a numerical value. The problem is that they are duplicate but swapped.
Example:
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
The desired output would look like this:
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
because the third row contains the same information like the first row.
EDIT
The data basically looks like this (Strings in the first two columns and a numerical value in the third, 40 Million lines):
Blockquote
Can you handle awk:
$ awk -F, '++seen[$3]==1' file
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
Explaied:
$ awk -F, ' # set comma as field delimiter
++seen[$3]==1 # count instances of the third field to hash, printing only first
' file
Update:
$ awk -F, '++seen[($1<$2?$1 FS $2:$2 FS $1)]==1' file
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
It hashes every met combination of first and second field so that "ABC,DEF"=="DEF,ABC" and counts them printing only the first. ($1<$2?$1 FS $2:$2 FS $1): if first field is less than second, hash 1st,2nd else hash 2nd,1st.
From the problem description, the mandate for a line to be NOT omitted is when
the first and second fields in either order when concatenated should be unique.
If so, below awk would help
awk -F, '{seen[$1,$2]++;seen[$2,$1]++}seen[$1,$2]==1 && seen[$2,$1]==1' filename
Sample Input
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
GHI,ABC,123
DEF,ABC,123
ABC,GHI,123
DEF,GHI,123
Sample Output
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
GHI,ABC,123
DEF,GHI,123
Note: This question was done before the OP changed the python tag for awk tag.
If you don't mind the order of the elements you might do:
with open("in.csv", "r") as file:
lines = set()
for line in file:
lines.add(frozenset(line.strip("\n").split(",")))
with open("out.csv", "w") as file:
for line in lines:
file.write(",".join(line)+"\n")
Output:
Col2,COL1,Col3
EFG,454,ABC
DEF,123,ABC
Note that you might want to treat the first line (the titles) in an special way to not loose their order.
But if the order matter you could use the code from Maintaining the order of the elements in a frozen set:
from itertools import filterfalse
def unique_everseen(iterable, key=None):
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
with open("in.csv", "r") as file:
lines = []
for line in file:
lines.append(line.strip("\n").split(","))
with open("out.csv", "w") as file:
for line in unique_everseen(lines, key=frozenset):
file.write(",".join(line)+"\n")
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
The OP said that both codes seem to not work on large files (1.8 Gb). I think it may be due to the fact that both codes store the file in a list using the RAM, and a file of 1.8 GB might take all the available space on memory.
In order to solve that I made a few more attempts. Sadly, I must say that all of them are extremely slow compared to the first attempt. The firsts codes sacrifice RAM consumption for speed, but the following codes sacrifice speed, CPU and hard drive for less RAM consumption (instead of consuming the whole file size in RAM they take less than 50 Mb).
Since all of this examples needs a higher hard drive usage, it's advisable to has the "input" and "output" file on different hard drives.
My first attempt using less RAM is with the shelve module:
import shelve, os
with shelve.open("tmp") as db:
with open("in.csv", "r") as file:
for line in file:
l = line.strip("\n").split(",")
l.sort()
db[",".join(l)] = l
with open("out.csv", "w") as file:
for v in db.values():
file.write(",".join(v)+"\n")
os.remove("temp.bak")
os.remove("temp.dat")
os.remove("temp.dir")
Sadly, this code takes hundred of times more than the first two codes which uses the RAM.
Another attempt is:
with open("in.csv", "r") as fileRead:
# total = sum(1 for _ in fileRead)
# fileRead.seek(0)
# i = 0
with open("out.csv", "w") as _:
pass
with open("out.csv", "r+") as fileWrite:
for lineRead in fileRead:
# i += 1
line = lineRead.strip("\n").split(",")
lineSet = set(line)
write = True
fileWrite.seek(0)
for lineWrite in fileWrite:
if lineSet == set(lineWrite.strip("\n").split(",")):
write = False
if write:
pass
fileWrite.write(",".join(line)+"\n")
# if i / total * 100 % 1 == 0: print(f"{i / total * 100}% ({i} / {total})")
This is slightly faster but not much.
If your computer has several cores, you could try to use multiprocessing:
from multiprocessing import Process, Queue, cpu_count
from os import remove
def slave(number, qIn, qOut):
name = f"slave-{number}.csv"
with open(name, "w") as file:
pass
with open(name, "r+") as file:
while True:
if not qIn.empty():
get = qIn.get()
if get == False:
qOut.put(name)
break
else:
write = True
file.seek(0)
for line in file:
if set(line.strip("\n").split(",")) == get[1]:
write = False
break
if write:
file.write(get[0])
def master():
qIn = Queue(1)
qOut = Queue()
slaves = cpu_count()
slavesList = []
for n in range(slaves):
slavesList.append(Process(target=slave, daemon=True, args=(n, qIn, qOut)))
for s in slavesList:
s.start()
with open("in.csv", "r") as file:
for line in file:
lineSet = set(line.strip("\n").split(","))
qIn.put((line, lineSet))
for _ in range(slaves):
qIn.put(False)
for s in slavesList:
s.join()
slavesList = []
with open(qOut.get(), "r+") as fileMaster:
for x in range(slaves-1):
file = qOut.get()
with open(file, "r") as fileSlave:
for lineSlave in fileSlave:
lineSet = set(lineSlave.strip("\n").split(","))
write = True
fileMaster.seek(0)
for lineMaster in fileMaster:
if set(lineMaster.strip("\n").split(",")) == lineSet:
write = False
break
if write:
fileMaster.write(lineSlave)
slavesList.append(Process(target=remove, daemon=True, args=(file,)))
slavesList[-1].start()
for s in slavesList:
s.join()
As you can see, I have the disappointing task to tell you that my both attempts work really slow. I hope you find a better approach, otherwise, it will take hours if not days to execute on 1,8 GB of data (the real time will primarily depend on the number of repeated values, which reduces time).
A new attempt: instead of storing every in file, this attempt stores the active portion on memory, and then write down on a file in order to process chunks faster. Then, the chunks must be read again by using one of the above methods:
lines = set()
maxLines = 1000 # This is the amount of lines that will be stored at the same time on RAM. Higher numbers are faster but requeires more RAM on the computer
perfect = True
with open("in.csv", "r") as fileRead:
total = sum(1 for _ in fileRead)
fileRead.seek(0)
i = 0
with open("tmp.csv", "w") as fileWrite:
for line in fileRead:
if (len(lines) < maxLines):
lines.add(frozenset(line.strip("\n").split(",")))
i += 1
if i / total * 100 % 1 == 0: print(f"Reading {i / total * 100}% ({i} / {total})")
else:
perfect = False
j = 0
for line in lines:
j += 1
fileWrite.write(",".join(line) + "\n")
if i / total * 100 % 1 == 0: print(f"Storing {i / total * 100}% ({i} / {total})")
lines = set()
if (not perfect):
use_one_of_the_above_methods() # Remember to read the tmp.csv and not the in.csv
This might boost the speed. You can change maxLines by any number you like, remember that higher the number, greater speed (not sure if really big numbers do the opposite) but higher RAM consumption.
If you want to use csv library itself:-
you can use a DictReader and DictWriter.
Import csv
def main():
"""Read csv file, delete duplicates and write it."""
with open('test.csv', 'r',newline='') as inputfile:
with open('testout.csv', 'w', newline='') as outputfile:
duplicatereader = csv.DictReader(inputfile, delimiter=',')
uniquewrite = csv.DictWriter(outputfile, fieldnames=['address', 'floor', 'date', 'price'], delimiter=',')
uniquewrite.writeheader()
keysread = []
for row in duplicatereader:
key = (row['date'], row['price'])
if key not in keysread:
print(row)
keysread.append(key)
uniquewrite.writerow(row)
if __name__ == '__main__':
main()
Newbie here. Ultimate mission is to learn how to take two big yaml files and split them into several hundred small files. I haven't yet figured out how to use the ID # as the filename, so one thing at a time.
First: split the big files into many. Here's a tiny bit of my test data file test-file.yml. Each post has a - delimiter on a line by itself:
-
ID: 627
more_post_meta_data_and_content
-
ID: 628
And here's my code that isn't working. So far I don't see why:
with open('test-file.yml', 'r') as myfile:
start = 0
cntr = 1
holding = ''
for i in myfile.read().split('\n'):
if (i == '-\n'):
if start==1:
with open(str(cntr) + '.md','w') as opfile:
opfile.write(op)
opfile.close()
holding=''
cntr += 1
else:
start=1
else:
if holding =='':
holding = i
else:
holding = holding + '\n' + i
myfile.close()
All hints, suggestions, pointers welcome. Thanks.
Reading the entire file into memory and then splitting the memory regions is very inefficient if the input files are large. Try this instead:
with open('test-file.yml', 'r') as myfile:
opfile = None
cntr = 1
for line in myfile:
if line == '-\n':
if opfile is not None:
opfile.close()
opfile = open('{0}.md'.format(cntr),'w')
cntr += 1
opfile.write(line)
opfile.close()
Notice also, you don't close things you have opened in a with context manager; the very purpose of the context manager is to take care of this for you.
As a newbie myself, at first glance your trying to write an undeclared variable op to your output. You were nearly spot on, just need to iterate through your opfile and write the contents:
with open('test-file.yml', 'r') as myfile:
start = 0
cntr = 1
holding = ''
for i in myfile.read().split('\n'):
if (i == '-\n'):
if start==1:
with open(str(cntr) + '.md','w') as opfile:
for line in opfile:
op = line
opfile.write(op)
opfile.close()
holding=''
cntr += 1
else:
start=1
else:
if holding =='':
holding = i
else:
holding = holding + '\n' + i
myfile.close()
Hope this helps!
When you are working in a with context on an open file, the with will automatically take care of closing it for you when you exit this block. So you don't need file.close() anywhere.
There is a function called readlines that outputs a generator that reads a line from an open file one line at a time. That will work much more efficiently than read() followed by a split(). Think about it. You are loading a massive file in memory, and then asking CPU to split that ginormous text by \n character. Not very efficient.
You wrote opfile.write(op). Where is this op defined? Don't you want to write the content in holding that you've defined?
Try the following.
with open('test.data', 'r') as myfile:
counter = 1
content = ""
start = True
for line in myfile.readlines():
if line == "-\n" and not start:
with open(str(counter) + '.md', 'w') as opfile:
opfile.write(content)
content = ""
counter += 1
else:
if not start:
content += line
start = False
# write the last file if test-file.yml doesn't end with a dash
if content != "":
with open(str(counter) + '.md', 'w') as opfile:
opfile.write(content)
I currently working my way through a free online python school. The following template has been given to me, and I am to complete the function so that it returns the size of the file (number of bytes) and the number of newlines ("\n"). I'm completely stuck. Any help would be appreciated.
def readFile(filename):
f = open(filename)
size = 0
lines = 0
buf = f.read()
while buf!="":
buf = f.read()
f.close()
return (size, lines)
So the buf variable contains a chunk of data.
Since you are still learning I will use a very basic approach:
nl_count = 0 # Number of new line characters
tot_count = 0 # Total number of characters
for character in buf:
if character == '\n':
nl_count += 1
tot_count += 1
Now you will have to adjust this to fit to your code, but this should give you something to start with.
You could read all the lines in at once and use the list rather than the file itself. For example,
def readFile(filename="test.txt"):
f = open(filename)
# Read all lines in at once
buf = f.readlines()
f.close()
# Each element of the list will be a new line
lines = len(buf)
# The total of the total of each line
size = sum([len(i) for i in buf])
return (size, lines)
I have a file which I want to split it into many other parts. i want to use python code...
Eg: the data in my file is like this
>2165320 21411 200802 8894-,...,765644-
TTCGGAGCTTACTAATTTTAAATATGAAGAATGCCAATATAAGTTTTGATTTCGAAAATACTTTTTTACTAGTTAAAAATTCATGATTTTCTACATCTATAACAATTTGTGTTTTTTTTAAACATCTTCCAGTGTCCTAAGTGTATATTTTTTAACGCAATGTTTGAATACTTTTAGGGTTTACCTTATTTAATTTGATTTTTAATGTGAGTTGTAATCACTGGTGAGCATACTGTTTTTCTTTTGTTCAGTAATATTGCATTTGTAGCTTTTGTATTGCTTAGATATATCACATTAAATCCTTTGTTCAGAAACCCATCCGACAGGGAGTCATAGGTGCCACACTAGTGGTCGAGGATCTAGGATGTCGGAAGGTCAACAATGGGGTAAAACACTAATTTTTTAATTTCTTGTATTTACCAAATTTACTGATTTTGCATTTAGTAGATGGTATATATACTCTTCTACCTTGTACAGTTGATGGTACCTGACTAAATATGTTTTATTTCCTTCTCCAGGATCTTTATGTAGTACGATTCTACAGTCGTCAAGAGGAGGGTAGAAAAGGAGAAGTAAGTTATAATATTTCTGAGCTTTTTTCTTTTTAATTGTTGTTGATAGAAAGTTGTGCCATATACATGTTTTAAGGTGGTGTA
>2165799 14641 135356 16580+,...,680341-
AAGGTAGGAGGTACTCGTGCTAATGGAGGAGCTAATGGTACACCAAACCGACGGCTGTCACTTAATGCTCATCAAAACGGAAGCAGGTCCACAACAAAAGATGGAAAAAAAGACATCAGACCAGTTGCTCCTGTGAATTATGTGGCCATATCAAAAGAAGATGCTGCTTCCCATGTTTCTGGTACCGAACCAATCCCGGCATCACCCTAATAATGAGATCTTCATTATCAACCCTACAATTTCATCTTTGTAGCATGATCAAATACTAGTTACTGCTTTAGGAATTATAATATGGAGTGACAAGTAATTAGAGAGGAACTGTTTTGAGCTGTGTATGTTCAATTTGCCATTTGGAGGTTTTCTCAATACATGTGCCCTTTAATATGAAAATATAGTGCTATTCTTGCCTTTCTCCAAACCCTGGCTCCTCCTATTCATCGGTTTCTT
>2169677 23891 1928391 1298391,…..,739483-
CTAGCTGATCGAGCTGATCGTAGTGAGCTATCGAGCTGACTACTAGCTAGTCGTGATAGCTGATCGAGCTGACTGATGTGCTAGTAGTAGTTTCATGATTTTCTACATCTATAACAATTTGTGTTTTTTTTAAACATCTTCCAGTGTCCTAAGTGTATATTTTTTAACGCAATGTTTGAATACTTTTAGGGTTTACCTTATTTAATTTGATTTTTAATGTGAGTTGTAATCACTGGTGAGCATACTGTTTTTCTTTTGTTCAGTAATATTGCATTTGTAGCTTTTGTATTGCTTAGATATATCACATTAAATCCTTTGTTCAGAAACCCATCCGACAGGGAGTCATAGGTGCCACACTAGTGGTCGAGGATCTAGGATGTCGGAAGGTCAACAATGGGGTAAAACACTAATTTTTTAATTTCTTGTATTTACCAAATTTACTGATTTTGCATTTAGTAGATGGTATATATACTCTTCTACCTTGTACAGTTGATGGTACCTGACTAAATATGTTTTATTTCCTTCTCCAGGATCTTTATGTAGTACGATTCTACAGTCGTCAAGAGGAGGGTAGAAAAGGAGAAGTAAGTTATAATATTTCTGAGCTTTTTTCTTTTTAATTGTTGTTGATAGAAAGTTGTGCCATATACATGTTTTA
And so on.
So now I want to split the file from ’>’ sing to next one n store this in a separate file.
Like 1st file will have
>2165320 21411 200802 8894-,...,765644-
TTCG…..GTA
data.
2nd file will have
>2165799 14641 135356 16580+,...,680341-
AAGG….GTTTCTT
data and so on.
To write each blank-line-separated group of lines to a separate file you could use itertools.groupby():
#!/usr/bin/env python
import sys
from itertools import groupby
def blank(line, mark=[0]):
if not line.strip(): # blank line
mark[0] ^= 1 # mark the start of new group
return mark[0]
for i, (_, group) in enumerate(groupby(sys.stdin, blank), start=1):
with open("group-%03d.txt" % (i,), "w") as outfile:
outfile.writelines(group)
Usage:
$ python split-on-blank.py < input_file.txt
If you work with such formats often; consider using a proper parser such as provided by Bio.SeqIO.parse() function from biopython.
It seems your data is just newline separated, so all you would need to do is loop over the lines and write the non-blank ones to incrementing files:
with open("source.txt") as f:
counter = 1
for line in f:
if not line.strip():
continue
with open("out_%03d.txt" % counter, 'w') as out:
out.write(line)
counter += 1
This will assume that each group is really one long line (it wasn't clear to me the real format).
Because you haven't given us much explanation about the real format of this file, here is another option in case those really are newline characters between lines that should be in the same file. If "#" is a solid indicator of a new group, we can just use it to indicate a new file:
with open("source.txt") as f:
counter = 1
out = None
for line in f:
if line.lstrip().startswith("#"):
if out is not None:
out.close()
out_name = "out_%03d.txt" % counter
counter += 1
out = open(out_name, 'w')
out.write(line)
if out is not None:
out.close()
with open("source.txt") as f:
counter = 1
for line in f:
if counter % 3 == 0:
continue
with open("out_%03d.txt" % counter, 'w') as out:
out.write(line)
counter += 1
I have a huge text file (~1GB) and sadly the text editor I use won't read such a large file. However, if I can just split it into two or three parts I'll be fine, so, as an exercise I wanted to write a program in python to do it.
What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.
Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?
I'll be writing this code test-first, so there's no need to give me a complete answer, unless its a one-liner ;-)
linux has a split command
split -l 100000 file.txt
would split into files of equal 100,000 line size
Check out os.stat() for file size and file.readlines([sizehint]). Those two functions should be all you need for the reading part, and hopefully you know how to do the writing :)
Now, there is a pypi module available that you can use to split files of any size into chunks. Check this out
https://pypi.org/project/filesplit/
As an alternative method, using the logging library:
>>> import logging.handlers
>>> log = logging.getLogger()
>>> fh = logging.handlers.RotatingFileHandler("D://filename.txt",
maxBytes=2**20*100, backupCount=100)
# 100 MB each, up to a maximum of 100 files
>>> log.addHandler(fh)
>>> log.setLevel(logging.INFO)
>>> f = open("D://biglog.txt")
>>> while True:
... log.info(f.readline().strip())
Your files will appear as follows:
filename.txt (end of file)
filename.txt.1
filename.txt.2
...
filename.txt.10 (start of file)
This is a quick and easy way to make a huge log file match your RotatingFileHandler implementation.
This generator method is a (slow) way to get a slice of lines without blowing up your memory.
import itertools
def slicefile(filename, start, end):
lines = open(filename)
return itertools.islice(lines, start, end)
out = open("/blah.txt", "w")
for line in slicefile("/python27/readme.txt", 10, 15):
out.write(line)
don't forget seek() and mmap() for random access to files.
def getSomeChunk(filename, start, len):
fobj = open(filename, 'r+b')
m = mmap.mmap(fobj.fileno(), 0)
return m[start:start+len]
While Ryan Ginstrom's answer is correct, it does take longer than it should (as he has already noted). Here's a way to circumvent the multiple calls to itertools.islice by successively iterating over the open file descriptor:
def splitfile(infilepath, chunksize):
fname, ext = infilepath.rsplit('.',1)
i = 0
written = False
with open(infilepath) as infile:
while True:
outfilepath = "{}{}.{}".format(fname, i, ext)
with open(outfilepath, 'w') as outfile:
for line in (infile.readline() for _ in range(chunksize)):
outfile.write(line)
written = bool(line)
if not written:
break
i += 1
You can use wc and split (see the respective manpages) to get the desired effect. In bash:
split -dl$((`wc -l 'filename'|sed 's/ .*$//'` / 3 + 1)) filename filename-chunk.
produces 3 parts of the same linecount (with a rounding error in the last, of course), named filename-chunk.00 to filename-chunk.02.
I've written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.
def Split(inputFile,numParts,outputName):
fileSize=os.stat(inputFile).st_size
parts=FileSizeParts(fileSize,numParts)
openInputFile = open(inputFile, 'r')
outPart=1
for part in parts:
if openInputFile.tell()<fileSize:
fullOutputName=outputName+os.extsep+str(outPart)
outPart+=1
openOutputFile=open(fullOutputName,'w')
openOutputFile.writelines(openInputFile.readlines(part))
openOutputFile.close()
openInputFile.close()
return outPart-1
usage - split.py filename splitsizeinkb
import os
import sys
def getfilesize(filename):
with open(filename,"rb") as fr:
fr.seek(0,2) # move to end of the file
size=fr.tell()
print("getfilesize: size: %s" % size)
return fr.tell()
def splitfile(filename, splitsize):
# Open original file in read only mode
if not os.path.isfile(filename):
print("No such file as: \"%s\"" % filename)
return
filesize=getfilesize(filename)
with open(filename,"rb") as fr:
counter=1
orginalfilename = filename.split(".")
readlimit = 5000 #read 5kb at a time
n_splits = filesize//splitsize
print("splitfile: No of splits required: %s" % str(n_splits))
for i in range(n_splits+1):
chunks_count = int(splitsize)//int(readlimit)
data_5kb = fr.read(readlimit) # read
# Create split files
print("chunks_count: %d" % chunks_count)
with open(orginalfilename[0]+"_{id}.".format(id=str(counter))+orginalfilename[1],"ab") as fw:
fw.seek(0)
fw.truncate()# truncate original if present
while data_5kb:
fw.write(data_5kb)
if chunks_count:
chunks_count-=1
data_5kb = fr.read(readlimit)
else: break
counter+=1
if __name__ == "__main__":
if len(sys.argv) < 3: print("Filename or splitsize not provided: Usage: filesplit.py filename splitsizeinkb ")
else:
filesize = int(sys.argv[2]) * 1000 #make into kb
filename = sys.argv[1]
splitfile(filename, filesize)
Here is a python script you can use for splitting large files using subprocess:
"""
Splits the file into the same directory and
deletes the original file
"""
import subprocess
import sys
import os
SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2' # subprocess expects a string, i.e. 2 = aa, ab, ac etc..
if __name__ == "__main__":
file_path = sys.argv[1]
# i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
os.path.dirname(file_path) + '/'])
# Remove the original file once done splitting
try:
os.remove(file_path)
except OSError:
pass
You can call it externally:
import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))
You can also import subprocess and run it directly in your program.
The issue with this approach is high memory usage: subprocess creates a fork with a memory footprint same size as your process and if your process memory is already heavy, it doubles it for the time that it runs. The same thing with os.system.
Here is another pure python way of doing this, although I haven't tested it on huge files, it's going to be slower but be leaner on memory:
CHUNK_SIZE = 5000
def yield_csv_rows(reader, chunk_size):
"""
Opens file to ingest, reads each line to return list of rows
Expects the header is already removed
Replacement for ingest_csv
:param reader: dictReader
:param chunk_size: int, chunk size
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
with open(local_file_path, 'rb') as f:
f.readline().strip().replace('"', '')
reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
chunks = yield_csv_rows(reader, CHUNK_SIZE)
for chunk in chunks:
if not chunk:
break
# Do something with your chunk here
Here is another example using readlines():
"""
Simple example using readlines()
where the 'file' is generated via:
seq 10000 > file
"""
CHUNK_SIZE = 5
def yield_rows(reader, chunk_size):
"""
Yield row chunks
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
def batch_operation(data):
for item in data:
print(item)
with open('file', 'r') as f:
chunks = yield_rows(f.readlines(), CHUNK_SIZE)
for _chunk in chunks:
batch_operation(_chunk)
The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks. Unfortunately readlines opens the whole file in memory, its better to use the reader example for performance. Although if you can easily fit what you need into memory and need to process it in chunks this should suffice.
You can achieve splitting any file to chunks like below, here the CHUNK_SIZE is 500000 bytes(500kb) and content can be any file :
for idx,val in enumerate(get_chunk(content, CHUNK_SIZE)):
data=val
index=idx
def get_chunk(content,size):
for i in range(0,len(content),size):
yield content[i:i+size]
This worked for me
import os
fil = "inputfile"
outfil = "outputfile"
f = open(fil,'r')
numbits = 1000000000
for i in range(0,os.stat(fil).st_size/numbits+1):
o = open(outfil+str(i),'w')
segment = f.readlines(numbits)
for c in range(0,len(segment)):
o.write(segment[c]+"\n")
o.close()
I had a requirement to split csv files for import into Dynamics CRM since the file size limit for import is 8MB and the files we receive are much larger. This program allows user to input FileNames and LinesPerFile, and then splits the specified files into the requested number of lines. I can't believe how fast it works!
# user input FileNames and LinesPerFile
FileCount = 1
FileNames = []
while True:
FileName = raw_input('File Name ' + str(FileCount) + ' (enter "Done" after last File):')
FileCount = FileCount + 1
if FileName == 'Done':
break
else:
FileNames.append(FileName)
LinesPerFile = raw_input('Lines Per File:')
LinesPerFile = int(LinesPerFile)
for FileName in FileNames:
File = open(FileName)
# get Header row
for Line in File:
Header = Line
break
FileCount = 0
Linecount = 1
for Line in File:
#skip Header in File
if Line == Header:
continue
#create NewFile with Header every [LinesPerFile] Lines
if Linecount % LinesPerFile == 1:
FileCount = FileCount + 1
NewFileName = FileName[:FileName.find('.')] + '-Part' + str(FileCount) + FileName[FileName.find('.'):]
NewFile = open(NewFileName,'w')
NewFile.write(Header)
NewFile.write(Line)
Linecount = Linecount + 1
NewFile.close()
import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)
For example if you want 50000 lines in one files and path is /home/data then you can run below command
subprocess.run('split -l 50000 /home/data', shell = True)
If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want
! wc -l file_path
in this case
! wc -l /home/data
And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows
Or, a python version of wc and split:
lines = 0
for l in open(filename): lines += 1
Then some code to read the first lines/3 into one file, the next lines/3 into another , etc.