Split a file in python - Faster way - python

I need to be able to split a huge file (10GB) into multiple files. The only criteria is the header from the original file, have to be copied to all smaller files.
Thus, wrote a program in python to achieve the same. However, the program is painstakingly slow. Is there a way, to speed up the program.
from pathlib import Path
import sys
import os
import string
import glob
directoryToLoadFrom = "c:\\directory\\"
directoryToWriteTo = "C:\\outputDirectory\\"
# Set the last business day
filesToRead = directoryToLoadFrom + "output*.csv"
listNoOfOutputFiles= sorted(glob.glob(filesToRead), key=os.path.getmtime)
# For each file name
splitLen = 100000
for filename in listNoOfOutputFiles:
print ('Currently working on ')
print (filename)
entirePath, filenameWithExtension= os.path.split(filename)
filenameOnly = filenameWithExtension.split(".")[0] #Just get the filename
extensionOnly =filenameWithExtension.split(".")[-1] # Just get the extension
with open(filename, 'r') as curFileContents:
header_line = curFileContents.readline()
filecnt = 1
while 1:
curlineCnt = 0
targetFileName = directoryToWriteTo + filenameOnly + "-" + str(filecnt) + "." + extensionOnly
print ('Writing to ')
print (targetFileName)
outputFile = open(targetFileName,"w")
outputFile.write(header_line)
for line in curFileContents:
outputFile.write(line)
curlineCnt +=1
if ( curlineCnt > splitLen):
break
filecnt += 1
if ( curlineCnt < splitLen):
outputFile.close()
break

You can utilize multiprocessing to complete this task quickly. Divide the logic in smaller chunks and then execute these chunks in separate processes. For example, you can make a process for each new file you want to create. Read more about multiprocessing here

Related

python script to merge more than 200 very large csv very in just one

I have been trying to merge several .csv files form different subfolders (all with the same name) into one. I tried with R but I got the result of not enough memory for carry the process (it should merge more than 20 million of rows). I am now working with a python script to try to get it (see below) it has many columns too so I don't need all of them but also I dont know if can choose which columns to add to the new csv:
import glob
import csv
import os
path= 'C:\\path\\to\\folder\\where\\all\\files\\are-allowated-in-subfolders'
result = glob.glob('*/certificates.csv')
#for i in result:
#full_path = "C:\\path\\to\\folder\\where\\all\\files\\are-allowated-in-subfolders\\" + result
#print(full_path)
os.chdir(path)
i=0
for root, directories, files in os.walk(path, topdown=False):
for name in files:
print(name)
try:
i += 1
if i % 10000 == 0:
#just to see the progress
print(i)
if name == 'certificates.csv':
creader = csv.reader(open(name))
cwriter = csv.writer(open('processed_' + name, 'w'))
for cline in creader:
new_line = [val for col, val in enumerate(cline)]
cwriter.writerow(new_line)
except:
print('problem with file: ' + name)
pass
but it doesn't work, and neither return any error so at the moment I am completely stuck.
Your indentation is wrong, and you are overwriting the output file for each new input file. Also, you are not using the glob result for anything. If the files you want to read are all immediately in subdirectories of path, you can do away with the os.walk() call and do the glob after you os.chdir().
import glob
import csv
import os
# No real need to have a variable for this really
path = 'C:\\path\\to\\folder\\where\\all\\files\\are-allowated-in-subfolders'
os.chdir(path)
# Obviously, can't use input file name in output file name
# because there is only one output file for many input files
with open('processed.csv', 'w') as dest:
cwriter = csv.writer(dest)
for i, name in enumerate(glob.glob('*/certificates.csv'), 1):
if i % 10000 == 0:
#just to see the progress
print(i)
try:
with open(name) as csvin:
creader = csv.reader(csvin)
for cline in creader:
# no need to enumerate fields
cwriter.writerow(cline)
except:
print('problem with file: ' + name)
You probably just need to keep a merged.csv file open whilst reading in each of the certificates.csv files. glob.glob() can be used to recursively find all suitable files:
import glob
import csv
import os
path = r'C:\path\to\folder\where\all\files\are-allowated-in-subfolders'
os.chdir(path)
with open('merged.csv', 'w', newline='') as f_merged:
csv_merged = csv.writer(f_merged)
for filename in glob.glob(os.path.join(path, '*/certificates.csv'), recursive=True):
print(filename)
try:
with open(filename) as f_csv:
csv_merged.writerows(csv.reader(f_csv))
except:
print('problem with file: ', filename)
An r prefix can be added to your path to avoid needing to escape each backslash. Also newline='' should be added to the open() when using a csv.writer() to stop extra blank lines being written to your output file.

Downloading Data From .txt file containing URLs with Python again

I am currently trying to extract the raw data from a .txt file of 10 urls, and put the raw data from each line(URL) in the .txt file. And then repeat the process with the processed data(the raw data from the same original .txt file stripped of the html) by using Python.
import commands
import os
import json
# RAW DATA
input = open('uri.txt', 'r')
t_1 = open('command', 'w')
counter_1 = 0
for line in input:
counter_1 += 1
if counter_1 < 11:
filename = str(counter_1)
print str(line)
filename= str(count)
command ='curl ' + '"' + str(line).rstrip('\n') + '"'+ '> ./rawData/' + filename
output_1 = commands.getoutput(command)
input.close()
# PROCESSED DATA
counter_2 = 0
input = open('uri.txt','r')
t_2 = open('command','w')
for line in input:
counter_2 += 1
if counter_2 <11:
filename = str(counter_2) + '-processed'
command = 'lynx -dump -force_html ' + '"'+ str(line).rstrip('\n') + '"'+'> ./processedData/' + filename
print command
output_2 = commands.getoutput(command)
input.close()
I am attempting to do all of this with one script. Can anyone help me refine my code so I can run it? it should loop through the code completely once for each kind line in the .txt file. For example, I should have 1 raw & 1 processed .txt file for every url line in my .txt file.
Break your code up into functions. Currently the code is hard to read and debug. Make a function called get_raw() and a function called get_processed(). Then for your main loop, you can do
for line in file:
get_raw(line)
get_processed(line)
Or something similar. Also you should avoid using 'magic numbers' like counter<11. Why is it 11? Is it the number of the lines in the file? If it is you can get the number of lines with len().

How to run a code for multiple fastq files?

I would run the following code for multiple fastq files in a folder. In a folder I have different fastq files; first I have to read one file and perform the required operations, then store results in a separate file. fastq and then read second file, perform the same operations and save results in new 2nd file.fastq. Repeat the same procedure for all the files in the folder.
How can I do? Can someone suggest me a way to this this?
from Bio.SeqIO.QualityIO import FastqGeneralIterator
fout=open("prova_FiltraN_CE_filt.fastq","w")
fin=open("prova_FiltraN_CE.fastq","rU")
maxN=0
countall=0
countincl=0
with open("prova_FiltraN_CE.fastq", "rU") as handle:
for (title, sequence, quality) in FastqGeneralIterator(handle):
countN = sequence.count("N", 0, len(sequence))
countall+=1
if countN==maxN:
fout.write("#%s\n%s\n+\n%s\n" % (title, sequence, quality))
countincl+=1
fin.close
fout.close
print countall, countincl
I think the following will do what you want. What I did was make your code into a function (and modified it to be what I think is more correct) and then called that function for every .fastq file found in the designated folder. The output file names are generated from the input files found.
from Bio.SeqIO.QualityIO import FastqGeneralIterator
import glob
import os
def process(in_filepath, out_filepath):
maxN = 0
countall = 0
countincl = 0
with open(in_filepath, "rU") as fin:
with open(out_filepath, "w") as fout:
for (title, sequence, quality) in FastqGeneralIterator(fin):
countN = sequence.count("N", 0, len(sequence))
countall += 1
if countN == maxN:
fout.write("#%s\n%s\n+\n%s\n" % (title, sequence, quality))
countincl += 1
print os.path.split(in_filepath)[1], countall, countincl
folder = "/path/to/folder" # folder to process
for in_filepath in glob.glob(os.path.join(folder, "*.fastq")):
root, ext = os.path.splitext(in_filepath)
if not root.endswith("_filt"): # avoid processing existing output files
out_filepath = root + "_filt" + ext
process(in_filepath, out_filepath)

How do I split a huge text file in python

I have a huge text file (~1GB) and sadly the text editor I use won't read such a large file. However, if I can just split it into two or three parts I'll be fine, so, as an exercise I wanted to write a program in python to do it.
What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.
Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?
I'll be writing this code test-first, so there's no need to give me a complete answer, unless its a one-liner ;-)
linux has a split command
split -l 100000 file.txt
would split into files of equal 100,000 line size
Check out os.stat() for file size and file.readlines([sizehint]). Those two functions should be all you need for the reading part, and hopefully you know how to do the writing :)
Now, there is a pypi module available that you can use to split files of any size into chunks. Check this out
https://pypi.org/project/filesplit/
As an alternative method, using the logging library:
>>> import logging.handlers
>>> log = logging.getLogger()
>>> fh = logging.handlers.RotatingFileHandler("D://filename.txt",
maxBytes=2**20*100, backupCount=100)
# 100 MB each, up to a maximum of 100 files
>>> log.addHandler(fh)
>>> log.setLevel(logging.INFO)
>>> f = open("D://biglog.txt")
>>> while True:
... log.info(f.readline().strip())
Your files will appear as follows:
filename.txt (end of file)
filename.txt.1
filename.txt.2
...
filename.txt.10 (start of file)
This is a quick and easy way to make a huge log file match your RotatingFileHandler implementation.
This generator method is a (slow) way to get a slice of lines without blowing up your memory.
import itertools
def slicefile(filename, start, end):
lines = open(filename)
return itertools.islice(lines, start, end)
out = open("/blah.txt", "w")
for line in slicefile("/python27/readme.txt", 10, 15):
out.write(line)
don't forget seek() and mmap() for random access to files.
def getSomeChunk(filename, start, len):
fobj = open(filename, 'r+b')
m = mmap.mmap(fobj.fileno(), 0)
return m[start:start+len]
While Ryan Ginstrom's answer is correct, it does take longer than it should (as he has already noted). Here's a way to circumvent the multiple calls to itertools.islice by successively iterating over the open file descriptor:
def splitfile(infilepath, chunksize):
fname, ext = infilepath.rsplit('.',1)
i = 0
written = False
with open(infilepath) as infile:
while True:
outfilepath = "{}{}.{}".format(fname, i, ext)
with open(outfilepath, 'w') as outfile:
for line in (infile.readline() for _ in range(chunksize)):
outfile.write(line)
written = bool(line)
if not written:
break
i += 1
You can use wc and split (see the respective manpages) to get the desired effect. In bash:
split -dl$((`wc -l 'filename'|sed 's/ .*$//'` / 3 + 1)) filename filename-chunk.
produces 3 parts of the same linecount (with a rounding error in the last, of course), named filename-chunk.00 to filename-chunk.02.
I've written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.
def Split(inputFile,numParts,outputName):
fileSize=os.stat(inputFile).st_size
parts=FileSizeParts(fileSize,numParts)
openInputFile = open(inputFile, 'r')
outPart=1
for part in parts:
if openInputFile.tell()<fileSize:
fullOutputName=outputName+os.extsep+str(outPart)
outPart+=1
openOutputFile=open(fullOutputName,'w')
openOutputFile.writelines(openInputFile.readlines(part))
openOutputFile.close()
openInputFile.close()
return outPart-1
usage - split.py filename splitsizeinkb
import os
import sys
def getfilesize(filename):
with open(filename,"rb") as fr:
fr.seek(0,2) # move to end of the file
size=fr.tell()
print("getfilesize: size: %s" % size)
return fr.tell()
def splitfile(filename, splitsize):
# Open original file in read only mode
if not os.path.isfile(filename):
print("No such file as: \"%s\"" % filename)
return
filesize=getfilesize(filename)
with open(filename,"rb") as fr:
counter=1
orginalfilename = filename.split(".")
readlimit = 5000 #read 5kb at a time
n_splits = filesize//splitsize
print("splitfile: No of splits required: %s" % str(n_splits))
for i in range(n_splits+1):
chunks_count = int(splitsize)//int(readlimit)
data_5kb = fr.read(readlimit) # read
# Create split files
print("chunks_count: %d" % chunks_count)
with open(orginalfilename[0]+"_{id}.".format(id=str(counter))+orginalfilename[1],"ab") as fw:
fw.seek(0)
fw.truncate()# truncate original if present
while data_5kb:
fw.write(data_5kb)
if chunks_count:
chunks_count-=1
data_5kb = fr.read(readlimit)
else: break
counter+=1
if __name__ == "__main__":
if len(sys.argv) < 3: print("Filename or splitsize not provided: Usage: filesplit.py filename splitsizeinkb ")
else:
filesize = int(sys.argv[2]) * 1000 #make into kb
filename = sys.argv[1]
splitfile(filename, filesize)
Here is a python script you can use for splitting large files using subprocess:
"""
Splits the file into the same directory and
deletes the original file
"""
import subprocess
import sys
import os
SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2' # subprocess expects a string, i.e. 2 = aa, ab, ac etc..
if __name__ == "__main__":
file_path = sys.argv[1]
# i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
os.path.dirname(file_path) + '/'])
# Remove the original file once done splitting
try:
os.remove(file_path)
except OSError:
pass
You can call it externally:
import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))
You can also import subprocess and run it directly in your program.
The issue with this approach is high memory usage: subprocess creates a fork with a memory footprint same size as your process and if your process memory is already heavy, it doubles it for the time that it runs. The same thing with os.system.
Here is another pure python way of doing this, although I haven't tested it on huge files, it's going to be slower but be leaner on memory:
CHUNK_SIZE = 5000
def yield_csv_rows(reader, chunk_size):
"""
Opens file to ingest, reads each line to return list of rows
Expects the header is already removed
Replacement for ingest_csv
:param reader: dictReader
:param chunk_size: int, chunk size
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
with open(local_file_path, 'rb') as f:
f.readline().strip().replace('"', '')
reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
chunks = yield_csv_rows(reader, CHUNK_SIZE)
for chunk in chunks:
if not chunk:
break
# Do something with your chunk here
Here is another example using readlines():
"""
Simple example using readlines()
where the 'file' is generated via:
seq 10000 > file
"""
CHUNK_SIZE = 5
def yield_rows(reader, chunk_size):
"""
Yield row chunks
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
def batch_operation(data):
for item in data:
print(item)
with open('file', 'r') as f:
chunks = yield_rows(f.readlines(), CHUNK_SIZE)
for _chunk in chunks:
batch_operation(_chunk)
The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks. Unfortunately readlines opens the whole file in memory, its better to use the reader example for performance. Although if you can easily fit what you need into memory and need to process it in chunks this should suffice.
You can achieve splitting any file to chunks like below, here the CHUNK_SIZE is 500000 bytes(500kb) and content can be any file :
for idx,val in enumerate(get_chunk(content, CHUNK_SIZE)):
data=val
index=idx
def get_chunk(content,size):
for i in range(0,len(content),size):
yield content[i:i+size]
This worked for me
import os
fil = "inputfile"
outfil = "outputfile"
f = open(fil,'r')
numbits = 1000000000
for i in range(0,os.stat(fil).st_size/numbits+1):
o = open(outfil+str(i),'w')
segment = f.readlines(numbits)
for c in range(0,len(segment)):
o.write(segment[c]+"\n")
o.close()
I had a requirement to split csv files for import into Dynamics CRM since the file size limit for import is 8MB and the files we receive are much larger. This program allows user to input FileNames and LinesPerFile, and then splits the specified files into the requested number of lines. I can't believe how fast it works!
# user input FileNames and LinesPerFile
FileCount = 1
FileNames = []
while True:
FileName = raw_input('File Name ' + str(FileCount) + ' (enter "Done" after last File):')
FileCount = FileCount + 1
if FileName == 'Done':
break
else:
FileNames.append(FileName)
LinesPerFile = raw_input('Lines Per File:')
LinesPerFile = int(LinesPerFile)
for FileName in FileNames:
File = open(FileName)
# get Header row
for Line in File:
Header = Line
break
FileCount = 0
Linecount = 1
for Line in File:
#skip Header in File
if Line == Header:
continue
#create NewFile with Header every [LinesPerFile] Lines
if Linecount % LinesPerFile == 1:
FileCount = FileCount + 1
NewFileName = FileName[:FileName.find('.')] + '-Part' + str(FileCount) + FileName[FileName.find('.'):]
NewFile = open(NewFileName,'w')
NewFile.write(Header)
NewFile.write(Line)
Linecount = Linecount + 1
NewFile.close()
import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)
For example if you want 50000 lines in one files and path is /home/data then you can run below command
subprocess.run('split -l 50000 /home/data', shell = True)
If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want
! wc -l file_path
in this case
! wc -l /home/data
And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows
Or, a python version of wc and split:
lines = 0
for l in open(filename): lines += 1
Then some code to read the first lines/3 into one file, the next lines/3 into another , etc.

How can I split a file in python?

Is it possible to split a file? For example you have huge wordlist, I want to split it so that it becomes more than one file. How is this possible?
This one splits a file up by newlines and writes it back out. You can change the delimiter easily. This can also handle uneven amounts as well, if you don't have a multiple of splitLen lines (20 in this example) in your input file.
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
# This is shorthand and not friendly with memory
# on very large files (Sean Cavanagh), but it works.
input = open('input.txt', 'r').read().split('\n')
at = 1
for lines in range(0, len(input), splitLen):
# First, get the list slice
outputData = input[lines:lines+splitLen]
# Now open the output file, join the new slice with newlines
# and write it out. Then close the file.
output = open(outputBase + str(at) + '.txt', 'w')
output.write('\n'.join(outputData))
output.close()
# Increment the counter
at += 1
A better loop for sli's example, not hogging memory :
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
input = open('input.txt', 'r')
count = 0
at = 0
dest = None
for line in input:
if count % splitLen == 0:
if dest: dest.close()
dest = open(outputBase + str(at) + '.txt', 'w')
at += 1
dest.write(line)
count += 1
Solution to split binary files into chapters .000, .001, etc.:
FILE = 'scons-conversion.7z'
MAX = 500*1024*1024 # 500Mb - max chapter size
BUF = 50*1024*1024*1024 # 50GB - memory buffer size
chapters = 0
uglybuf = ''
with open(FILE, 'rb') as src:
while True:
tgt = open(FILE + '.%03d' % chapters, 'wb')
written = 0
while written < MAX:
if len(uglybuf) > 0:
tgt.write(uglybuf)
tgt.write(src.read(min(BUF, MAX - written)))
written += min(BUF, MAX - written)
uglybuf = src.read(1)
if len(uglybuf) == 0:
break
tgt.close()
if len(uglybuf) == 0:
break
chapters += 1
def split_file(file, prefix, max_size, buffer=1024):
"""
file: the input file
prefix: prefix of the output files that will be created
max_size: maximum size of each created file in bytes
buffer: buffer size in bytes
Returns the number of parts created.
"""
with open(file, 'r+b') as src:
suffix = 0
while True:
with open(prefix + '.%s' % suffix, 'w+b') as tgt:
written = 0
while written < max_size:
data = src.read(buffer)
if data:
tgt.write(data)
written += buffer
else:
return suffix
suffix += 1
def cat_files(infiles, outfile, buffer=1024):
"""
infiles: a list of files
outfile: the file that will be created
buffer: buffer size in bytes
"""
with open(outfile, 'w+b') as tgt:
for infile in sorted(infiles):
with open(infile, 'r+b') as src:
while True:
data = src.read(buffer)
if data:
tgt.write(data)
else:
break
Sure it's possible:
open input file
open output file 1
count = 0
for each line in file:
write to output file
count = count + 1
if count > maxlines:
close output file
open next output file
count = 0
import re
PATENTS = 'patent.data'
def split_file(filename):
# Open file to read
with open(filename, "r") as r:
# Counter
n=0
# Start reading file line by line
for i, line in enumerate(r):
# If line match with teplate -- <?xml --increase counter n
if re.match(r'\<\?xml', line):
n+=1
# This "if" can be deleted, without it will start naming from 1
# or you can keep it. It depends where is "re" will find at
# first time the template. In my case it was first line
if i == 0:
n = 0
# Write lines to file
with open("{}-{}".format(PATENTS, n), "a") as f:
f.write(line)
split_file(PATENTS)
As a result you will get:
patent.data-0
patent.data-1
patent.data-N
You can use use this pypi filesplit module.
This is a late answer, but a new question was linked here and none of the answers mentioned itertools.groupby.
Assuming you have a (huge) file file.txt that you want to split in chunks of MAXLINES lines file_part1.txt, ..., file_partn.txt, you could do:
with open(file.txt) as fdin:
for i, sub in itertools.groupby(enumerate(fdin), lambda x: 1 + x[0]//3):
fdout = open("file_part{}.txt".format(i))
for _, line in sub:
fdout.write(line)
import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)
For example if you want 50000 lines in one files and path is /home/data then you can run below command
subprocess.run('split -l 50000 /home/data', shell = True)
If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want
! wc -l file_path
in this case
! wc -l /home/data
And Just so you know output file will not have file extension but its same extension as input file
You can change it manually if Windows
All the provided answers are good and (probably) work However, they need to load the file into memory (as a whole or partially). We know Python is not very efficient in this kind of tasks (or at least is not as efficient as the OS level commands).
I found the following is the most efficient way to do it:
import os
MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"
if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
print("Done:")
print(os.system(f"ls {PREFIX}??"))
else:
print("Failed!")
Read more about split here: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/

Categories