In Python, given the name of a file, how can I write a loop that reads one character each time through the loop?
with open(filename) as f:
while True:
c = f.read(1)
if not c:
print("End of file")
break
print("Read a character:", c)
First, open a file:
with open("filename") as fileobj:
for line in fileobj:
for ch in line:
print(ch)
This goes through every line in the file and then every character in that line.
I like the accepted answer: it is straightforward and will get the job done. I would also like to offer an alternative implementation:
def chunks(filename, buffer_size=4096):
"""Reads `filename` in chunks of `buffer_size` bytes and yields each chunk
until no more characters can be read; the last chunk will most likely have
less than `buffer_size` bytes.
:param str filename: Path to the file
:param int buffer_size: Buffer size, in bytes (default is 4096)
:return: Yields chunks of `buffer_size` size until exhausting the file
:rtype: str
"""
with open(filename, "rb") as fp:
chunk = fp.read(buffer_size)
while chunk:
yield chunk
chunk = fp.read(buffer_size)
def chars(filename, buffersize=4096):
"""Yields the contents of file `filename` character-by-character. Warning:
will only work for encodings where one character is encoded as one byte.
:param str filename: Path to the file
:param int buffer_size: Buffer size for the underlying chunks,
in bytes (default is 4096)
:return: Yields the contents of `filename` character-by-character.
:rtype: char
"""
for chunk in chunks(filename, buffersize):
for char in chunk:
yield char
def main(buffersize, filenames):
"""Reads several files character by character and redirects their contents
to `/dev/null`.
"""
for filename in filenames:
with open("/dev/null", "wb") as fp:
for char in chars(filename, buffersize):
fp.write(char)
if __name__ == "__main__":
# Try reading several files varying the buffer size
import sys
buffersize = int(sys.argv[1])
filenames = sys.argv[2:]
sys.exit(main(buffersize, filenames))
The code I suggest is essentially the same idea as your accepted answer: read a given number of bytes from the file. The difference is that it first reads a good chunk of data (4006 is a good default for X86, but you may want to try 1024, or 8192; any multiple of your page size), and then it yields the characters in that chunk one by one.
The code I present may be faster for larger files. Take, for example, the entire text of War and Peace, by Tolstoy. These are my timing results (Mac Book Pro using OS X 10.7.4; so.py is the name I gave to the code I pasted):
$ time python so.py 1 2600.txt.utf-8
python so.py 1 2600.txt.utf-8 3.79s user 0.01s system 99% cpu 3.808 total
$ time python so.py 4096 2600.txt.utf-8
python so.py 4096 2600.txt.utf-8 1.31s user 0.01s system 99% cpu 1.318 total
Now: do not take the buffer size at 4096 as a universal truth; look at the results I get for different sizes (buffer size (bytes) vs wall time (sec)):
2 2.726
4 1.948
8 1.693
16 1.534
32 1.525
64 1.398
128 1.432
256 1.377
512 1.347
1024 1.442
2048 1.316
4096 1.318
As you can see, you can start seeing gains earlier on (and my timings are likely very inaccurate); the buffer size is a trade-off between performance and memory. The default of 4096 is just a reasonable choice but, as always, measure first.
Just:
myfile = open(filename)
onecharacter = myfile.read(1)
Python itself can help you with this, in interactive mode:
>>> help(file.read)
Help on method_descriptor:
read(...)
read([size]) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
I learned a new idiom for this today while watching Raymond Hettinger's Transforming Code into Beautiful, Idiomatic Python:
import functools
with open(filename) as f:
f_read_ch = functools.partial(f.read, 1)
for ch in iter(f_read_ch, ''):
print 'Read a character:', repr(ch)
Just read a single character
f.read(1)
This will also work:
with open("filename") as fileObj:
for line in fileObj:
for ch in line:
print(ch)
It goes through every line in the the file and every character in every line.
(Note that this post now looks extremely similar to a highly upvoted answer, but this was not the case at the time of writing.)
Best answer for Python 3.8+:
with open(path, encoding="utf-8") as f:
while c := f.read(1):
do_my_thing(c)
You may want to specify utf-8 and avoid the platform encoding. I've chosen to do that here.
Function – Python 3.8+:
def stream_file_chars(path: str):
with open(path) as f:
while c := f.read(1):
yield c
Function – Python<=3.7:
def stream_file_chars(path: str):
with open(path, encoding="utf-8") as f:
while True:
c = f.read(1)
if c == "":
break
yield c
Function – pathlib + documentation:
from pathlib import Path
from typing import Union, Generator
def stream_file_chars(path: Union[str, Path]) -> Generator[str, None, None]:
"""Streams characters from a file."""
with Path(path).open(encoding="utf-8") as f:
while (c := f.read(1)) != "":
yield c
You should try f.read(1), which is definitely correct and the right thing to do.
f = open('hi.txt', 'w')
f.write('0123456789abcdef')
f.close()
f = open('hej.txt', 'r')
f.seek(12)
print f.read(1) # This will read just "c"
To make a supplement,
if you are reading file that contains a line that is vvvvery huge, which might break your memory, you might consider read them into a buffer then yield the each char
def read_char(inputfile, buffersize=10240):
with open(inputfile, 'r') as f:
while True:
buf = f.read(buffersize)
if not buf:
break
for char in buf:
yield char
yield '' #handle the scene that the file is empty
if __name__ == "__main__":
for word in read_char('./very_large_file.txt'):
process(char)
os.system("stty -icanon -echo")
while True:
raw_c = sys.stdin.buffer.peek()
c = sys.stdin.read(1)
print(f"Char: {c}")
Combining qualities of some other answers, here is something that is invulnerable to long files / lines, while being more succinct and faster:
import functools as ft, itertools as it
with open(path) as f:
for c in it.chain.from_iterable(
iter(ft.partial(f.read, 4096), '')
):
print(c)
#reading out the file at once in a list and then printing one-by-one
f=open('file.txt')
for i in list(f.read()):
print(i)
Related
I need to read chunks of 64KB in loop, and process them, but stop at the end of file minus 16 bytes: the last 16 bytes are a tag metadata.
The file might be super large, so I can't read it all in RAM.
All the solutions I find are a bit clumsy and/or unpythonic.
with open('myfile', 'rb') as f:
while True:
block = f.read(65536)
if not block:
break
process_block(block)
If 16 <= len(block) < 65536, it's easy: it's the last block ever. So useful_data = block[:-16] and tag = block[-16:]
If len(block) == 65536, it could mean three things: that the full block is useful data. Or that this 64KB block is in fact the last block, so useful_data = block[:-16] and tag = block[-16:]. Or that this 64KB block is followed by another block of only a few bytes (let's say 3 bytes), so in this case: useful_data = block[:-13] and tag = block[-13:] + last_block[:3].
How to deal with this problem in a nicer way than distinguishing all these cases?
Note:
the solution should work for a file opened with open(...), but also for a io.BytesIO() object, or for a distant SFTP opened file (with pysftp).
I was thinking about getting the file object size, with
f.seek(0,2)
length = f.tell()
f.seek(0)
Then after each
block = f.read(65536)
we can know if we are far from the end with length - f.tell(), but again the full solution does not look very elegant.
you can just read in every iteration min(65536, L-f.tell()-16)
Something like this:
from pathlib import Path
L = Path('myfile').stat().st_size
with open('myfile', 'rb') as f:
while True:
to_read_length = min(65536, L-f.tell()-16)
block = f.read(to_read_length)
process_block(block)
if f.tell() == L-16
break
Did not ran this, but hope you get the gist of it.
The following method relies only on the fact that the f.read() method returns an empty bytes object upon end of stream (EOS). It thus could be adopted for sockets simply by replacing f.read() with s.recv().
def read_all_but_last16(f):
rand = random.Random() # just for testing
buf = b''
while True:
bytes_read = f.read(rand.randint(1, 40)) # just for testing
# bytes_read = f.read(65536)
buf += bytes_read
if not bytes_read:
break
process_block(buf[:-16])
buf = buf[-16:]
verify(buf[-16:])
It works by always leaving 16 bytes at the end of buf until EOS, then finally processing the last 16. Note that if there aren't at least 17 bytes in buf then buf[:-16] returns the empty bytes object.
I am trying to convert a file containing more than 1 billion bytes into integers. Obviously, my machine cannot do this at once so I need to chunk my code. I was able to decode the first 50,000,000 bytes but I am wondering how to read the integers in the file that are between 50,000,001 and 100,000,000, 150,000,000 and 200,000,000 etc. The following is what I have now;the range function is not working with this.
import struct
with open(x, "rb") as f:
this_chunk = range(50000001, 100000000)
data = f.read(this_chunk)
ints1 = struct.unpack("I" * (this_chunk //4) , data)
print(ints1)
You can use f.seek(offset) to set the file pointer to start reading from a certain offset.
In your case, you'd want to skip 5000000 bytes, so you'd call
f.seek(50000000)
At this point, you'd want to read another 50000000 bytes, so you'd call f.read(50000000).
This would be your complete code listing, implementing f.seek and reading the whole file:
with open(x, "rb") as f:
f.seek(50000000) # omit if you don't want to skip this chunk
data = f.read(50000000)
while data:
... # do something
data = f.read(50000000)
Use f.read(50000000) in a loop at it will read the file in chunks of 50000000, e.g.:
In []:
from io import StringIO
s = '''hello'''
with StringIO(s) as f:
while True:
c = f.read(2)
if not c:
break
print(c)
Out[]:
he
ll
o
I am reading some value for file and wants to write modified value into file. My file is .ktx format [binary packed format].
I am using struct.pack() but seems that something is going wrong with that:
bytes = file.read(4)
bytesAsInt = struct.unpack("l",bytes)
number=1+(bytesAsInt[0])
number=hex(number)
no=struct.pack("1",number)
outfile.write(no)
I want to write in both ways little-endian and big-endian.
no_little =struct.pack(">1",bytesAsInt)
no_big =struct.pack("<1",bytesAsInt) # i think this is default ...
again you can check the docs and see the format characters you need
https://docs.python.org/3/library/struct.html
>>> struct.unpack("l","\x05\x04\x03\03")
(50529285,)
>>> struct.pack("l",50529285)
'\x05\x04\x03\x03'
>>> struct.pack("<l",50529285)
'\x05\x04\x03\x03'
>>> struct.pack(">l",50529285)
'\x03\x03\x04\x05'
also note that it is a lowercase L , not a one (as also covered in the docs)
I haven't tested this but the following function should solve your problem. At the moment it reads the file contents completely, creates a buffer and then writes out the updated contents. You could also modify the file buffer directly using unpack_from and pack_into but it might be slower (again, not tested). I'm using the struct.Struct class since you seem to want to unpack the same number many times.
import os
import struct
from StringIO import StringIO
def modify_values(in_file, out_file, increment=1, num_code="i", endian="<"):
with open(in_file, "rb") as file_h:
content = file_h.read()
num = struct.Struct(endian + num_code)
buf = StringIO()
try:
while len(content) >= num.size:
value = num.unpack(content[:num.size])[0]
value += increment
buf.write(num.pack(value))
content = content[num.size:]
except Exception as err:
# handle
else:
buf.seek(0)
with open(out_file, "wb") as file_h:
file_h.write(buf.read())
An alternative is to use the array which makes it quite easy. I don't know how to implement endianess with an array.
def modify_values(filename, increment=1, num_code="i"):
with open(filename, "rb") as file_h:
arr = array("i", file_h.read())
for i in range(len(arr)):
arr[i] += increment
with open(filename, "wb") as file_h:
arr.tofile(file_h)
In Python, given the name of a file, how can I write a loop that reads one character each time through the loop?
with open(filename) as f:
while True:
c = f.read(1)
if not c:
print("End of file")
break
print("Read a character:", c)
First, open a file:
with open("filename") as fileobj:
for line in fileobj:
for ch in line:
print(ch)
This goes through every line in the file and then every character in that line.
I like the accepted answer: it is straightforward and will get the job done. I would also like to offer an alternative implementation:
def chunks(filename, buffer_size=4096):
"""Reads `filename` in chunks of `buffer_size` bytes and yields each chunk
until no more characters can be read; the last chunk will most likely have
less than `buffer_size` bytes.
:param str filename: Path to the file
:param int buffer_size: Buffer size, in bytes (default is 4096)
:return: Yields chunks of `buffer_size` size until exhausting the file
:rtype: str
"""
with open(filename, "rb") as fp:
chunk = fp.read(buffer_size)
while chunk:
yield chunk
chunk = fp.read(buffer_size)
def chars(filename, buffersize=4096):
"""Yields the contents of file `filename` character-by-character. Warning:
will only work for encodings where one character is encoded as one byte.
:param str filename: Path to the file
:param int buffer_size: Buffer size for the underlying chunks,
in bytes (default is 4096)
:return: Yields the contents of `filename` character-by-character.
:rtype: char
"""
for chunk in chunks(filename, buffersize):
for char in chunk:
yield char
def main(buffersize, filenames):
"""Reads several files character by character and redirects their contents
to `/dev/null`.
"""
for filename in filenames:
with open("/dev/null", "wb") as fp:
for char in chars(filename, buffersize):
fp.write(char)
if __name__ == "__main__":
# Try reading several files varying the buffer size
import sys
buffersize = int(sys.argv[1])
filenames = sys.argv[2:]
sys.exit(main(buffersize, filenames))
The code I suggest is essentially the same idea as your accepted answer: read a given number of bytes from the file. The difference is that it first reads a good chunk of data (4006 is a good default for X86, but you may want to try 1024, or 8192; any multiple of your page size), and then it yields the characters in that chunk one by one.
The code I present may be faster for larger files. Take, for example, the entire text of War and Peace, by Tolstoy. These are my timing results (Mac Book Pro using OS X 10.7.4; so.py is the name I gave to the code I pasted):
$ time python so.py 1 2600.txt.utf-8
python so.py 1 2600.txt.utf-8 3.79s user 0.01s system 99% cpu 3.808 total
$ time python so.py 4096 2600.txt.utf-8
python so.py 4096 2600.txt.utf-8 1.31s user 0.01s system 99% cpu 1.318 total
Now: do not take the buffer size at 4096 as a universal truth; look at the results I get for different sizes (buffer size (bytes) vs wall time (sec)):
2 2.726
4 1.948
8 1.693
16 1.534
32 1.525
64 1.398
128 1.432
256 1.377
512 1.347
1024 1.442
2048 1.316
4096 1.318
As you can see, you can start seeing gains earlier on (and my timings are likely very inaccurate); the buffer size is a trade-off between performance and memory. The default of 4096 is just a reasonable choice but, as always, measure first.
Just:
myfile = open(filename)
onecharacter = myfile.read(1)
Python itself can help you with this, in interactive mode:
>>> help(file.read)
Help on method_descriptor:
read(...)
read([size]) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
I learned a new idiom for this today while watching Raymond Hettinger's Transforming Code into Beautiful, Idiomatic Python:
import functools
with open(filename) as f:
f_read_ch = functools.partial(f.read, 1)
for ch in iter(f_read_ch, ''):
print 'Read a character:', repr(ch)
Just read a single character
f.read(1)
This will also work:
with open("filename") as fileObj:
for line in fileObj:
for ch in line:
print(ch)
It goes through every line in the the file and every character in every line.
(Note that this post now looks extremely similar to a highly upvoted answer, but this was not the case at the time of writing.)
Best answer for Python 3.8+:
with open(path, encoding="utf-8") as f:
while c := f.read(1):
do_my_thing(c)
You may want to specify utf-8 and avoid the platform encoding. I've chosen to do that here.
Function – Python 3.8+:
def stream_file_chars(path: str):
with open(path) as f:
while c := f.read(1):
yield c
Function – Python<=3.7:
def stream_file_chars(path: str):
with open(path, encoding="utf-8") as f:
while True:
c = f.read(1)
if c == "":
break
yield c
Function – pathlib + documentation:
from pathlib import Path
from typing import Union, Generator
def stream_file_chars(path: Union[str, Path]) -> Generator[str, None, None]:
"""Streams characters from a file."""
with Path(path).open(encoding="utf-8") as f:
while (c := f.read(1)) != "":
yield c
You should try f.read(1), which is definitely correct and the right thing to do.
f = open('hi.txt', 'w')
f.write('0123456789abcdef')
f.close()
f = open('hej.txt', 'r')
f.seek(12)
print f.read(1) # This will read just "c"
To make a supplement,
if you are reading file that contains a line that is vvvvery huge, which might break your memory, you might consider read them into a buffer then yield the each char
def read_char(inputfile, buffersize=10240):
with open(inputfile, 'r') as f:
while True:
buf = f.read(buffersize)
if not buf:
break
for char in buf:
yield char
yield '' #handle the scene that the file is empty
if __name__ == "__main__":
for word in read_char('./very_large_file.txt'):
process(char)
os.system("stty -icanon -echo")
while True:
raw_c = sys.stdin.buffer.peek()
c = sys.stdin.read(1)
print(f"Char: {c}")
Combining qualities of some other answers, here is something that is invulnerable to long files / lines, while being more succinct and faster:
import functools as ft, itertools as it
with open(path) as f:
for c in it.chain.from_iterable(
iter(ft.partial(f.read, 4096), '')
):
print(c)
#reading out the file at once in a list and then printing one-by-one
f=open('file.txt')
for i in list(f.read()):
print(i)
I have a huge text file (~1GB) and sadly the text editor I use won't read such a large file. However, if I can just split it into two or three parts I'll be fine, so, as an exercise I wanted to write a program in python to do it.
What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.
Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?
I'll be writing this code test-first, so there's no need to give me a complete answer, unless its a one-liner ;-)
linux has a split command
split -l 100000 file.txt
would split into files of equal 100,000 line size
Check out os.stat() for file size and file.readlines([sizehint]). Those two functions should be all you need for the reading part, and hopefully you know how to do the writing :)
Now, there is a pypi module available that you can use to split files of any size into chunks. Check this out
https://pypi.org/project/filesplit/
As an alternative method, using the logging library:
>>> import logging.handlers
>>> log = logging.getLogger()
>>> fh = logging.handlers.RotatingFileHandler("D://filename.txt",
maxBytes=2**20*100, backupCount=100)
# 100 MB each, up to a maximum of 100 files
>>> log.addHandler(fh)
>>> log.setLevel(logging.INFO)
>>> f = open("D://biglog.txt")
>>> while True:
... log.info(f.readline().strip())
Your files will appear as follows:
filename.txt (end of file)
filename.txt.1
filename.txt.2
...
filename.txt.10 (start of file)
This is a quick and easy way to make a huge log file match your RotatingFileHandler implementation.
This generator method is a (slow) way to get a slice of lines without blowing up your memory.
import itertools
def slicefile(filename, start, end):
lines = open(filename)
return itertools.islice(lines, start, end)
out = open("/blah.txt", "w")
for line in slicefile("/python27/readme.txt", 10, 15):
out.write(line)
don't forget seek() and mmap() for random access to files.
def getSomeChunk(filename, start, len):
fobj = open(filename, 'r+b')
m = mmap.mmap(fobj.fileno(), 0)
return m[start:start+len]
While Ryan Ginstrom's answer is correct, it does take longer than it should (as he has already noted). Here's a way to circumvent the multiple calls to itertools.islice by successively iterating over the open file descriptor:
def splitfile(infilepath, chunksize):
fname, ext = infilepath.rsplit('.',1)
i = 0
written = False
with open(infilepath) as infile:
while True:
outfilepath = "{}{}.{}".format(fname, i, ext)
with open(outfilepath, 'w') as outfile:
for line in (infile.readline() for _ in range(chunksize)):
outfile.write(line)
written = bool(line)
if not written:
break
i += 1
You can use wc and split (see the respective manpages) to get the desired effect. In bash:
split -dl$((`wc -l 'filename'|sed 's/ .*$//'` / 3 + 1)) filename filename-chunk.
produces 3 parts of the same linecount (with a rounding error in the last, of course), named filename-chunk.00 to filename-chunk.02.
I've written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.
def Split(inputFile,numParts,outputName):
fileSize=os.stat(inputFile).st_size
parts=FileSizeParts(fileSize,numParts)
openInputFile = open(inputFile, 'r')
outPart=1
for part in parts:
if openInputFile.tell()<fileSize:
fullOutputName=outputName+os.extsep+str(outPart)
outPart+=1
openOutputFile=open(fullOutputName,'w')
openOutputFile.writelines(openInputFile.readlines(part))
openOutputFile.close()
openInputFile.close()
return outPart-1
usage - split.py filename splitsizeinkb
import os
import sys
def getfilesize(filename):
with open(filename,"rb") as fr:
fr.seek(0,2) # move to end of the file
size=fr.tell()
print("getfilesize: size: %s" % size)
return fr.tell()
def splitfile(filename, splitsize):
# Open original file in read only mode
if not os.path.isfile(filename):
print("No such file as: \"%s\"" % filename)
return
filesize=getfilesize(filename)
with open(filename,"rb") as fr:
counter=1
orginalfilename = filename.split(".")
readlimit = 5000 #read 5kb at a time
n_splits = filesize//splitsize
print("splitfile: No of splits required: %s" % str(n_splits))
for i in range(n_splits+1):
chunks_count = int(splitsize)//int(readlimit)
data_5kb = fr.read(readlimit) # read
# Create split files
print("chunks_count: %d" % chunks_count)
with open(orginalfilename[0]+"_{id}.".format(id=str(counter))+orginalfilename[1],"ab") as fw:
fw.seek(0)
fw.truncate()# truncate original if present
while data_5kb:
fw.write(data_5kb)
if chunks_count:
chunks_count-=1
data_5kb = fr.read(readlimit)
else: break
counter+=1
if __name__ == "__main__":
if len(sys.argv) < 3: print("Filename or splitsize not provided: Usage: filesplit.py filename splitsizeinkb ")
else:
filesize = int(sys.argv[2]) * 1000 #make into kb
filename = sys.argv[1]
splitfile(filename, filesize)
Here is a python script you can use for splitting large files using subprocess:
"""
Splits the file into the same directory and
deletes the original file
"""
import subprocess
import sys
import os
SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2' # subprocess expects a string, i.e. 2 = aa, ab, ac etc..
if __name__ == "__main__":
file_path = sys.argv[1]
# i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
os.path.dirname(file_path) + '/'])
# Remove the original file once done splitting
try:
os.remove(file_path)
except OSError:
pass
You can call it externally:
import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))
You can also import subprocess and run it directly in your program.
The issue with this approach is high memory usage: subprocess creates a fork with a memory footprint same size as your process and if your process memory is already heavy, it doubles it for the time that it runs. The same thing with os.system.
Here is another pure python way of doing this, although I haven't tested it on huge files, it's going to be slower but be leaner on memory:
CHUNK_SIZE = 5000
def yield_csv_rows(reader, chunk_size):
"""
Opens file to ingest, reads each line to return list of rows
Expects the header is already removed
Replacement for ingest_csv
:param reader: dictReader
:param chunk_size: int, chunk size
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
with open(local_file_path, 'rb') as f:
f.readline().strip().replace('"', '')
reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
chunks = yield_csv_rows(reader, CHUNK_SIZE)
for chunk in chunks:
if not chunk:
break
# Do something with your chunk here
Here is another example using readlines():
"""
Simple example using readlines()
where the 'file' is generated via:
seq 10000 > file
"""
CHUNK_SIZE = 5
def yield_rows(reader, chunk_size):
"""
Yield row chunks
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
def batch_operation(data):
for item in data:
print(item)
with open('file', 'r') as f:
chunks = yield_rows(f.readlines(), CHUNK_SIZE)
for _chunk in chunks:
batch_operation(_chunk)
The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks. Unfortunately readlines opens the whole file in memory, its better to use the reader example for performance. Although if you can easily fit what you need into memory and need to process it in chunks this should suffice.
You can achieve splitting any file to chunks like below, here the CHUNK_SIZE is 500000 bytes(500kb) and content can be any file :
for idx,val in enumerate(get_chunk(content, CHUNK_SIZE)):
data=val
index=idx
def get_chunk(content,size):
for i in range(0,len(content),size):
yield content[i:i+size]
This worked for me
import os
fil = "inputfile"
outfil = "outputfile"
f = open(fil,'r')
numbits = 1000000000
for i in range(0,os.stat(fil).st_size/numbits+1):
o = open(outfil+str(i),'w')
segment = f.readlines(numbits)
for c in range(0,len(segment)):
o.write(segment[c]+"\n")
o.close()
I had a requirement to split csv files for import into Dynamics CRM since the file size limit for import is 8MB and the files we receive are much larger. This program allows user to input FileNames and LinesPerFile, and then splits the specified files into the requested number of lines. I can't believe how fast it works!
# user input FileNames and LinesPerFile
FileCount = 1
FileNames = []
while True:
FileName = raw_input('File Name ' + str(FileCount) + ' (enter "Done" after last File):')
FileCount = FileCount + 1
if FileName == 'Done':
break
else:
FileNames.append(FileName)
LinesPerFile = raw_input('Lines Per File:')
LinesPerFile = int(LinesPerFile)
for FileName in FileNames:
File = open(FileName)
# get Header row
for Line in File:
Header = Line
break
FileCount = 0
Linecount = 1
for Line in File:
#skip Header in File
if Line == Header:
continue
#create NewFile with Header every [LinesPerFile] Lines
if Linecount % LinesPerFile == 1:
FileCount = FileCount + 1
NewFileName = FileName[:FileName.find('.')] + '-Part' + str(FileCount) + FileName[FileName.find('.'):]
NewFile = open(NewFileName,'w')
NewFile.write(Header)
NewFile.write(Line)
Linecount = Linecount + 1
NewFile.close()
import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)
For example if you want 50000 lines in one files and path is /home/data then you can run below command
subprocess.run('split -l 50000 /home/data', shell = True)
If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want
! wc -l file_path
in this case
! wc -l /home/data
And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows
Or, a python version of wc and split:
lines = 0
for l in open(filename): lines += 1
Then some code to read the first lines/3 into one file, the next lines/3 into another , etc.