when i try to read a file more than 2GB in size to a dataframe
i get followinbg error: OverflowError: signed integer is greater than maximum
this is as mentioned in https://bugs.python.org/issue42853
is there a workaround for this?
as mentioned in the bug use read the file using the buffer. remember you are still loading the data into your ram. so your system should still have large enough ram to store the data. or you will out of memory error.
existing code
s3_resource = boto3.resource()
s3_client = boto3.client()
s3_obj = s3_resource.Object(bucket_name, filename).get()
with io.BytesIO(s3_obj["Body"].read()) as file:
file_as_df = pd.read_csv(file, encoding='latin1',sep='\t')
revised code
response = s3_client.get_object(Bucket= bucket_name , Key = filename)
#os.path.join(key, datafile) #ignore this
buf = bytearray(response['ContentLength'])
view = memoryview(buf)
pos = 0
while True:
chunk = response['Body'].read(67108864)
if len(chunk) == 0:
view[pos:pos+len(chunk)] = chunk
pos += len(chunk)
file_as_df = pd.read_csv(io.BytesIO(bytes(view)), encoding='latin1',sep='\t')
I have a file of around 2 GB in a S3 folder which contains header and trailer of different length and actual data is of different length. I need to copy this file to another location in S3 programmatically after removing header and trailer. Can anyone help me with this ?
File format (say file name abc.txt)=>
I tried loading file from S3 in pandas but it got failed because of memory error. So can't use pandas here.
I tried using boto3 library and used obj.get()['Body'].read() but how to remove header and trailer from this data and then write back to the file in S3?
Is there any other effective way?
I'll assume you have some functions is_header(line) and is_trailer(line) that can tell you whether the line is a header or trailier, respectively. Then here's how you could stream the file from S3 and save it back.
import boto3
s3 = boto3.client("s3")
bucket = "mybucket"
key = "path/to/abc.txt"
new_key = "path/to/def.txt"
r = s3.get_object(Bucket=bucket, Key=key)
sb = r["StreamingBody"]
content = [line for line in sb.iter_lines() if not is_header(line) and not is_trailer(line)]
content = b"".join(content)
r = s3.put_object(Bucket=bucket, Key=new_key, Bytes=content)
Stream Data to Avoid Out of Memory Errors
The above code assumes that the entire file can fit into memory, which I assume it can because it's only 2 GB. If not you'll need to use Multipart Uploads.
Here's one way to do that using a TransferManager
from typing import Optional
import boto3
from boto3.s3.transfer import TransferConfig
import botocore
MB = 1024*1024
class FileNoHeader:
"""Wrapper for a botocore StreamingBody to filter headers/trailers"""
def __init__(self, stream: botocore.response.StreamingBody):
self.stream = stream
self.first_line = True
self.line_generator = self.stream.iter_lines()
def read(self, size: Optional[int] = None) -> bytes:
"""Wrap StreamingBody.iter_lines to read line-by-line while making it look like a fileobj
size: int, optional
How much data to read. This is a minimum amount because we are using
StreamingBody.iter_lines to read the file line by line, we can only return
whole lines. If `None`, the default, read the entire file.
This parameter is for compatibilty with the read() method of a file-like object
data = []
amt = 0
line = b""
while size is None or amt < size:
line = next(self.line_generator)
except StopIteration:
if line:
amt += len(line)
if self.is_header(line) or self.is_trailer(line):
line = b""
amt += len(line)
return b"\n".join(data)
def close(self):
"""Close the underlying StreamingBody"""
def is_header(self, line):
# TODO: implement your logic
# right now just skips the first line
if self.first_line:
self.first_line = False
return True
return self.first_line
def is_trailer(self, line):
# TODO: implement your logic
return False
## Usage
config = TransferConfig(multipart_chunksize=1*MB)
s3 = boto3.client("s3")
bucket = "mybucket"
key = "path/to/abc.txt"
new_key = "path/to/abc_no_header.txt"
r = s3.get_object(Bucket=bucket, Key=key)
streaming_body = r["Body"]
data_stream = FileNoHeader(streaming_body)
def tcback(bytes_transferred):
print(f"{bytes_transferred} bytes transferred")
Sidebar: AWS Lambda
If you are using AWS Lambda functions, you can have up to 10 GB of memory. You can set the memory in the AWS Console or using the API. Here're the docs for boto3 and the AWS CLI v2.
I make a requests.post() call to the server, which replies me with a json, in this json there are some keys and also the base64 file.
This is an example of a response from the server:
The server responds like this:
'success' is the key to understanding if access with private data is
'message' is the key in case success is False (In this case being
success == True, the message is not shown
'data' is the dictionary key that contains the fileName and the
base64 format file
{'success': True,
'message': '',
'data': {'fileName': 'Python_logo_and_wordmark.svg.png',
'file': 'iVBORw0KGgoAAAANSUhEUgAABLAAAA....'}} #To limit the space, I cut the very long bytes example
So the respose in json also contains the file, which I need to decode with base64.b64decode(r.json()['data']['file'])
Everything ok, I can get my file and decrypt it correctly.
The problem is that with large files I would like to use the stream method like this:
file = "G:\Python_logo_and_wordmark.svg.png"
if os.path.isfile(file):
def get_chunk(chunk):
# Try to decode the base64 file (Chunked)
# is this a wrong approach?
chunk = chunk.decode("ascii")
chunk = chunk.replace('"', '')
if "file" in chunk:
chunk = chunk.split('file:')[1]
elif "}}" in chunk:
chunk = chunk.split('}}')[0]
chunk = chunk
chunk += "=" * ((4 - len(chunk) % 4) % 4)
chunk_decoded = base64.b64decode(chunk)
return chunk_decoded
r = requests.post(url=my_url, json=my_data, stream=True)
iter_content = r.iter_content(chunk_size=64)
while True:
chunk = next(iter_content, None)
if not chunk:
chunk_decoded = get_chunk(chunk)
with open(file, "ab") as file_object:
iter_content chunks return this:
There are errors inherent in padding sometimes in decoding, but after 1 week of trying I preferred to ask this question here, as I am afraid of being wrong approach to this situation.
I would like how to handle this situation in the right way
According to your requirement mentioned in the comment, I'm pointing out the current issues and probable future problems below:
In your get_chunck function, you're doing this:
chunk = chunk.decode("ascii")
chunk = chunk.replace('"', '')
if "file" in chunk:
chunk = chunk.split('file:')[1]
elif "}}" in chunk:
chunk = chunk.split('}}')[0]
chunk = chunk
Now look into the first chunk given by iter_line:
So, it will fall under the condition if "file" in chunk: as it contains this file string in the fileName. So when it will try to split this based on file:, it will return a list of one element, because the file was in fileName, not as file:. Hence the program will through following error:
Traceback (most recent call last):
File "main.py", line 7, in <module>
chunk = chunk.split('file:')[1]
IndexError: list index out of range
try if "file:" in chunk: instead.
Your program may also fail if the fileName contains something like "prod_file:someName". You have to check for that too.
A chunk that doesn't contain file can contain }}, so it can break what you're trying too achieve too.
You can modify the response server and wrap the start and ending of the file base64 encoded string with unique identifiers so that you can receive the response as below and therefore can identify the start and end of the file with guarantee in this stream approach. For example:
{'success': True,
'message': '',
'data': {'fileName': 'Python_logo_and_wordmark.svg.png',
'file': '0000101100iVBORw0KGgoAAAANSUhEUgAABLAAAA....0000101101'}}
I've appended 0000101100 as starting identifier and 0000101101 as ending. You can trim them off while writing to chunk/file. You can use any other unique identifier format as your own, not conflicting the base64 encoding.
Feel free to ask if there's any further confusion.
I tried to analyze your problem, and can't find solution better than #devReddir provided.
The reason is - it is impossible (or very difficult) to parse data before completely download it.
Workaround may be to save data as is in one big file and parse it by separate worker. That will allow to decrease server memory usage, when downloading file and avoid to loss data.
save file as is
while True:
chunk = next(iter_content, None)
if not chunk:
with open(file, "ab") as file_object:
read file in separated worker
import json
import base64
with open("saved_as_is.json") as json_file:
json_object = json.load(json_file)
encoded_base64 = json_object['data']['file']
decoded = base64.b64decode(encoded_base64)
Why parse data on the fly is so difficult?
file separator may be splitted by two chunks:
b'... ... ... .., "fi'
b'le": "AAAB... ... .'
Actually \\ is a escape symbol and you must to handle it manually (and don't forget that \\ may be splitted by chunks → b'...\', b'\...'):
If file is super tiny, chunk line may be look like:
b'"file":"SUPERTINY_BASE64_DECODED", "fileName":"Python_lo'
And chunk.split('file:')[1] will don't work
base64 chunk must be multiple of 4, so if your first chunk (characters after "file":) will be 3 character length, you will be need to read next chunk and add one first character to end of previous chunk for all following iterations
So here is tones of nuances if you will try to parse data manually.
Howevevr, if you want to choose this hard way, here is how to decode base64 chunks.
And here is list of allowed base64 characters
If you want to use #devReddir's solution and store whole data in memory, not sure if here any profit of stream usage at all.
Okay, that is complete working solution:
Server side (main.py):
I added this code to be able run test server that responding json data with base64 encoded file.
Also I added some randomness in response to be able to check if string parsing independent on character position
import base64 as b
import json as j
from fastapi import FastAPI as f
import requests as r
import random as rr
import string as s
import uvicorn as u
banana_url = 'https://upload.wikimedia.org/wikipedia/commons/c/ce/PNG_demo_Banana.png'
banana_b64 = b.encodebytes(
r.get(banana_url, stream=True).raw.read())
banana_b64 = banana_b64.decode('ascii').replace('\n', '').encode('ascii')
def get_response(banana_file, banana_file_name):
random_status = ''
for i in range(rr.randint(3, 30)): random_status += rr.choice(s.ascii_letters)
banana_response = {
'status': random_status,
'data': {
'fileName': banana_file_name.split('/')[-1],
'file': banana_file,
if len(random_status) % 2 == 0:
banana_response['data']['random_payload'] = 'hello_world'
banana_response['random_payload'] = '%hello_world_again%'
return banana_response
app = f()
async def read_root():
resp = get_response(banana_b64, banana_url.split('/')[-1])
print('file length:', len(resp['data']['file']))
return resp
if __name__ == "__main__":
u.run('main:app', host="", port=8000, reload=True, workers=1)
Client side (file downloader decoder.py):
import requests
import base64
# must be larger than len('"file":')
# iterable response
r = requests.get('', stream=True).iter_content(chunk_size=CHUNK_SIZE)
class ChunkParser:
file = None
total_length = 0
def close(self):
if self.file:
def __init__(self, file_name) -> None:
self.file = open(file_name, 'ab')
def add_chunk(self, chunk):
# remove all escape symbols if existing
chunk = chunk.decode('ascii').replace('\\', '').encode('ascii')
# if chunk size is not multiple of 4, return modulo to be able add it in next chunk
modulo = b''
if not (l := len(chunk)) % 4 == 0:
modulo = chunk[l-(l%4):]
chunk = chunk[:l-(l%4)]
self.total_length += len(chunk)
return modulo
prev_chunk = None
cur_chunk = None
writing_started = False
last_chunk = False
parser = ChunkParser('temp_file.png')
file_found = False
while True:
# set previous chunk on first iterations before modulo may be returned
if cur_chunk is not None and not writing_started:
prev_chunk = cur_chunk
# get current chunk
cur_chunk = next(r, None)
# skip first iteration
if prev_chunk is None:
# break loop if no data
if not cur_chunk:
# concatenate two chunks to avoid b' ... "fil', b'e": ... ' patern
two_chunks = prev_chunk + cur_chunk
# if file key found get real base64 encoded data
if not file_found and '"file":' in two_chunks.decode('ascii'):
file_found = True
# get part after "file" key
two_chunks = two_chunks.decode('ascii').split('"file":')[1].encode('ascii')
if file_found and not writing_started:
# data should be started after first "-quote
# so cut all data before "
if '"' in (t := two_chunks.decode('ascii')):
two_chunks = t[t.find('"')+1:].encode('ascii')
writing_started = True
# handle b' ... "file":', b'"... ' patern
cur_chunk = b''
# check for last data chunk
# "-quote means end of value
if writing_started and '"' in (t := two_chunks.decode('ascii')):
two_chunks = t[:t.find('"')].encode('ascii')
last_chunk = True
if writing_started:
# decode and write data in file
prev_chunk = parser.add_chunk(two_chunks)
# end operation
if last_chunk:
if (l := len(prev_chunk)) > 0:
# if last modulo length is larget than 0, that meaning the data total length is not multiple of 4
# probably data loss appear?
raise ValueError(f'Bad end of data. length is {str(l)} and last characters are {prev_chunk.decode("ascii")}')
Don't forget to compare files after download when testing this script:
# get md5 of downloaded by chunks file
$ md5 temp_file.png
MD5 (temp_file.png) = 806165d96d5f9a25cebd2778ae4a3da2
# get md5 of downloaded file using browser
$ md5 PNG_demo_Banana.png
MD5 (PNG_demo_Banana.png) = 806165d96d5f9a25cebd2778ae4a3da2
You could stream it down to a file like this (pip install base64io):
class decoder():
def __init__(self, fh):
self.fileh = open(fh, 'rb')
self.closed = False
search = ''
start_tag = '"file": "'
for i in range(1024):
search += self.fileh.read(1).decode('UTF8')
if len(start_tag) > len(search)+1:
if search[-len(start_tag):] == start_tag:
def read(self, chunk=1200):
data = self.fileh.read(chunk)
if not data:
return b''
return data if not data.decode('UTF8').endswith('"}}') else data[:-3]
def close(self):
self.closed = True
def closed(self):
return self.closed
def flush(self):
def write(self):
def readable(self):
return True
And then use the class like this:
from base64io import Base64IO
encoded_source = decoder(fh)
with open("target_file.jpg", "wb") as target, Base64IO(encoded_source) as source:
for line in source:
But of course you need to change from streaming from local file to streaming from the requests.raw object.
so i am trying to ready image file form django request in chunks, the django filehandler chunks method does not work well for me, so i created a custom on, it works but the end product wasnt what i was expecting, so after reading the files in chunks and putting them together somehow the image get corrupts and i dont have any solution for it.
def process_download_with_progress(self, image_file, length):
process_recoder = ProgressRecorder(self)
print('Upload: Task Started')
fs = FileSystemStorage()
buffer = io.BytesIO()
chunk_size = 0
for chunk in read_chunk(image_file.file, length):
chunk_size += 1
process_recoder.set_progress(chunk_size, length, description=f'uploaded {chunk_size*length} bytes of the file')
image = ImageFile(buffer, name=image_file.name)
fs.save(image_file.name, content=image)
return 'Done'
def read_chunk(file_object, chunk_size=125):
while True:
file = file_object.read(chunk_size)
if not file:
yield file
so this my code, any help will be appreciated, thanks.
I've got a script to decompress and parse data contained in a bunch of very large bzip2 compressed files. Since it can take a while I'd like to have some way to monitor the progress. I know I can get the file size with os.path.getsize(), but bz2.BZ2File.tell() returns the position within the uncompressed data. Is there any way to get the current position within the uncompressed file so I can monitor the progress?
Bonus points if there's a python equivalent to Java's ProgressMonitorInputStream.
If you only need to parse the data in the bziped file, I think it should be possible to avoid to unzip the file before reading it. I have not tested it on bzip, but on gziped files. I hope this is also possible with bziped files.
See for instance :
How to write csv in python efficiently?.
This is the solution I came up with that seems to work.
import bz2
class SimpleBZ2File(object):
def __init__(self,path,readsize=1024):
self.decomp = bz2.BZ2Decompressor()
self.rawinput = open(path,'rb')
self.eof = False
self.readsize = readsize
self.leftover = ''
def tell(self):
return self.rawinput.tell()
def __iter__(self):
while not self.eof:
rawdata = self.rawinput.read(self.readsize)
if rawdata == '':
self.eof = True
data = self.decomp.decompress(rawdata)
if not data:
continue #we need to supply more raw to decompress
newlines = list(data.splitlines(True))
yield self.leftover + newlines[0]
self.leftover = ''
for l in newlines[1:-1]:
yield l
if newlines[-1].endswith('\n'):
yield newlines[-1]
self.leftover = newlines[-1]
if self.leftover:
yield self.leftover
This is related to the question about zip bombs, but having gzip or bzip2 compression in mind, e.g. a web service accepting .tar.gz files.
Python provides a handy tarfile module that is convenient to use, but does not seem to provide protection against zipbombs.
In python code using the tarfile module, what would be the most elegant way to detect zip bombs, preferably without duplicating too much logic (e.g. the transparent decompression support) from the tarfile module?
And, just to make it a bit less simple: No real files are involved; the input is a file-like object (provided by the web framework, representing the file a user uploaded).
You could use resource module to limit resources available to your process and its children.
If you need to decompress in memory then you could set resource.RLIMIT_AS (or RLIMIT_DATA, RLIMIT_STACK) e.g., using a context manager to automatically restore it to a previous value:
import contextlib
import resource
def limit(limit, type=resource.RLIMIT_AS):
soft_limit, hard_limit = resource.getrlimit(type)
resource.setrlimit(type, (limit, hard_limit)) # set soft limit
resource.setrlimit(type, (soft_limit, hard_limit)) # restore
with limit(1 << 30): # 1GB
# do the thing that might try to consume all memory
If the limit is reached; MemoryError is raised.
This will determine the uncompressed size of the gzip stream, while using limited memory:
import sys
import zlib
f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
while True:
buf = z.unconsumed_tail
if buf == "":
buf = f.read(1024)
if buf == "":
got = z.decompress(buf, 4096)
if got == "":
total += len(got)
print total
if z.unused_data != "" or f.read(1024) != "":
print "warning: more input after end of gzip stream"
It will return a slight overestimate of the space required for all of the files in the tar file in when extracted. The length includes those files, as well as the tar directory information.
The gzip.py code does not control the amount of data decompressed, except by virtue of the size of the input data. In gzip.py, it reads 1024 compressed bytes at a time. So you can use gzip.py if you're ok with up to about 1056768 bytes of memory usage for the uncompressed data (1032 * 1024, where 1032:1 is the maximum compression ratio of deflate). The solution here uses zlib.decompress with the second argument, which limits the amount of uncompressed data. gzip.py does not.
This will accurately determine the total size of the extracted tar entries by decoding the tar format:
import sys
import zlib
def decompn(f, z, n):
"""Return n uncompressed bytes, or fewer if at the end of the compressed
stream. This only decompresses as much as necessary, in order to
avoid excessive memory usage for highly compressed input.
blk = ""
while len(blk) < n:
buf = z.unconsumed_tail
if buf == "":
buf = f.read(1024)
got = z.decompress(buf, n - len(blk))
blk += got
if got == "":
return blk
f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
left = 0
while True:
blk = decompn(f, z, 512)
if len(blk) < 512:
if left == 0:
if blk == "\0"*512:
if blk[156] in ["1", "2", "3", "4", "5", "6"]:
if blk[124] == 0x80:
size = 0
for i in range(125, 136):
size <<= 8
size += blk[i]
size = int(blk[124:136].split()[0].split("\0")[0], 8)
if blk[156] not in ["x", "g", "X", "L", "K"]:
total += size
left = (size + 511) // 512
left -= 1
print total
if blk != "":
print "warning: partial final block"
if left != 0:
print "warning: tar file ended in the middle of an entry"
if z.unused_data != "" or f.read(1024) != "":
print "warning: more input after end of gzip stream"
You could use a variant of this to scan the tar file for bombs. This has the advantage of finding a large size in the header information before you even have to decompress that data.
As for .tar.bz2 archives, the Python bz2 library (at least as of 3.3) is unavoidably unsafe for bz2 bombs consuming too much memory. The bz2.decompress function does not offer a second argument like zlib.decompress does. This is made even worse by the fact that the bz2 format has a much, much higher maximum compression ratio than zlib due to run-length coding. bzip2 compresses 1 GB of zeros to 722 bytes. So you cannot meter the output of bz2.decompress by metering the input as can be done with zlib.decompress even without the second argument. The lack of a limit on the decompressed output size is a fundamental flaw in the Python interface.
I looked in the _bz2module.c in 3.3 to see if there is an undocumented way to use it to avoid this problem. There is no way around it. The decompress function in there just keeps growing the result buffer until it can decompress all of the provided input. _bz2module.c needs to be fixed.
If you develop for linux, you can run decompression in separate process and use ulimit to limit the memory usage.
import subprocess
subprocess.Popen("ulimit -v %d; ./decompression_script.py %s" % (LIMIT, FILE))
Keep in mind that decompression_script.py should decompress the whole file in memory, before writing to disk.
I guess the answer is: There is no easy, readymade solution. Here is what I use now:
class SafeUncompressor(object):
"""Small proxy class that enables external file object
support for uncompressed, bzip2 and gzip files. Works transparently, and
supports a maximum size to avoid zipbombs.
blocksize = 16 * 1024
class FileTooLarge(Exception):
def __init__(self, fileobj, maxsize=10*1024*1024):
self.fileobj = fileobj
self.name = getattr(self.fileobj, "name", None)
self.maxsize = maxsize
def init(self):
import bz2
import gzip
self.pos = 0
self.buf = ""
self.format = "plain"
magic = self.fileobj.read(2)
if magic == '\037\213':
self.format = "gzip"
self.gzipobj = gzip.GzipFile(fileobj = self.fileobj, mode = 'r')
elif magic == 'BZ':
raise IOError, "bzip2 support in SafeUncompressor disabled, as self.bz2obj.decompress is not safe"
self.format = "bz2"
self.bz2obj = bz2.BZ2Decompressor()
def read(self, size):
b = [self.buf]
x = len(self.buf)
while x < size:
if self.format == 'gzip':
data = self.gzipobj.read(self.blocksize)
if not data:
elif self.format == 'bz2':
raw = self.fileobj.read(self.blocksize)
if not raw:
# this can already bomb here, to some extend.
# so disable bzip support until resolved.
# Also monitor http://stackoverflow.com/questions/13622706/how-to-protect-myself-from-a-gzip-or-bzip2-bomb for ideas
data = self.bz2obj.decompress(raw)
data = self.fileobj.read(self.blocksize)
if not data:
x += len(data)
if self.pos + x > self.maxsize:
self.buf = ""
self.pos = 0
raise SafeUncompressor.FileTooLarge, "Compressed file too large"
self.buf = "".join(b)
buf = self.buf[:size]
self.buf = self.buf[size:]
self.pos += len(buf)
return buf
def seek(self, pos, whence=0):
if whence != 0:
raise IOError, "SafeUncompressor only supports whence=0"
if pos < self.pos:
self.read(pos - self.pos)
def tell(self):
return self.pos
It does not work well for bzip2, so that part of the code is disabled. The reason is that bz2.BZ2Decompressor.decompress can already produce an unwanted large chunk of data.
I also need to handle zip bombs in uploaded zipfiles.
I do this by creating a fixed size tmpfs, and unzipping to that. If the extracted data is too large then the tmpfs will run out of space and give an error.
Here is the linux commands to create a 200M tmpfs to unzip to.
sudo mkdir -p /mnt/ziptmpfs
echo 'tmpfs /mnt/ziptmpfs tmpfs rw,nodev,nosuid,size=200M 0 0' | sudo tee -a /etc/fstab