I've got a big (many GB) .bz2 compressed file, which I am reading in using python's bz2.open() function. I want to provide a status update about how much of the file is left to read. I can get the file size from the file system, and the number of uncompressed bytes read so far using bz2_filehandle.tell(), but how can I get the number of compressed bytes read so far?
Thanks to #ignacio-vazquez-abrams I worked it out:
with open("path/to/file.bz2", 'rb') as compressed_file:
with bz2.open(file, 'rb') as uncompressed_file:
for l in uncompressed_file:
print(compressed_file.tell(), uncompressed_file.tell())
Related
I have a very big .BIN file and I am loading it into the available RAM memory (128 GB) by using:
ice.Load_data_to_memory("global.bin", True)
(see: https://github.com/iceland2k14/secp256k1)
Now I need to read the content of the file in chunks of 10 bytes, and for that I am using:
with open('global.bin', 'rb') as bf:
while True:
data = bf.read(10)
if data = y:
do this!
This works good with the rest of the code, if the .BIN file is small, but not if the file is big. My suspection is, by writing the code this way I will open the .BIN file twice OR I won't get any result, because with open('global.bin', 'rb') as bf is not "synchronized" with ice.Load_data_to_memory("global.bin", True). Thus, I would like to find a way to directly read the chunks of 10 bytes from memory, without having to open the file with "with open('global.bin', 'rb') as bf"
I found a working approach here: LOAD FILE INTO MEMORY
This is working good with a small .BIN file containing 3 strings of 10 bytes each:
with open('0x4.bin', 'rb') as f:
# Size 0 will read the ENTIRE file into memory!
m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) #File is open read-only
# Proceed with your code here -- note the file is already in memory
# so "readine" here will be as fast as could be
data = m.read(10) # using read(10) instead of readline()
while data:
do something!
Now the point: When using a much bigger .BIN file, it will take much more time to load the whole file into the memory and the while data: part starts immediately to work, so I would need here a function delay, so that the script only starts to work AFTER the file is completely loaded into the memory...
I have a tar.gz file downloaded from s3, I load it in memory and I want to add a folder and eventually write it into another s3.
I've been trying different approaches:
from io import BytesIO
import gzip
buffer = BytesIO(zip_obj.get()["Body"].read())
im_memory_tar = tarfile.open(buffer, mode='a')
The above rises the error: ReadError: invalid header .
With the below approach:
im_memory_tar = tarfile.open(fileobj=buffer, mode='a')
im_memory_tar.add(name='code_1', arcname='code')
The content seems to be overwritten.
Do you know a good solution to append a folder into a tar.gz file?
Thanks.
very well explained in question how-to-append-a-file-to-a-tar-file-use-python-tarfile-module
Note that 'a:gz' or 'a:bz2' is not possible. If mode is not suitable to open a certain (compressed) file for reading, ReadError is raised. Use mode 'r' to avoid this. If a compression method is not supported, CompressionError is raised.
First we need to consider how to append to a tar file. Let's set aside the compression for a moment.
A tar file is terminated by two 512-byte blocks of all zeros. To add more entries, you need to remove or overwrite that 1024 bytes at the end. If you then append another tar file there, or start writing a new tar file there, you will have a single tar file with all of the entries of the original two.
Now we return to the tar.gz. You can simply decompress the entire .gz file, do that append as above, and then recompress the whole thing.
Avoiding the decompression and recompression is rather more difficult, since we'd have to somehow remove that last 1024 bytes of zeros from the end of the compressed stream. It is possible, but you would need some knowledge of the internals of a deflate compressed stream.
A deflate stream consists of a series of compressed data "blocks", which are each an arbitrary number of bits long. You would need to decompress, without writing out the result, until you get to the block containing the last 1024 bytes. You would need to save the decompressed result of that and any subsequent blocks, and at what bit in the stream that block started. Then you could recompress that data, sans the last 1024 bytes, starting at that byte.
Complete the compression, and write out the gzip trailer with the 1024 zeros removed from the CRC and length. (There is a way to back out zeros from the CRC.) Now you have a complete gzip stream for the previous .tar.gz file, but with the last 1024 bytes of zeros removed.
Since the concatenation of two gzip streams is itself a valid gzip stream, you can now concatenate the second .tar.gz file directly or start writing a new .tar.gz stream there. You now have a single, valid .tar.gz stream with the entries from the two original sources.
I am working with MNIST data in ML(for digit recognistion) and I want to convert my 'mnist.pkl' to 'mnist.pkl.gz' because the turtorial I am watching uses that extension.
also if possible please tell me what are those ..... that he has before the file name('.../data/mnist.pkl.gz', 'rb') if you are familiar with it Thank You
The extension .gz indicates that the file was compressed using gzip which you can do by invoking
gzip mnist.pkl
on the command line. The command will remove the original file and replace it with a compressed version named mnist.pkl.gz.
That said, you don't have to compress/decompress the file in your particular case. Just use
f = open('../data.mnist.pkl', 'rb')
instead of
f = gzip.open('../data.mnist.pkl.gz', 'rb')
I currently try to create a module that writes a *.gz file up to a specific size. I want to use it for a custom log handler to specify the maximum size of a zipped logfile. I already made my way through the gzip documentation and also the zlib documentation.
I could use zlib right away and measure the length of my compressed bytearray, but then I would have to create and write the gzip file header by myself. The zlib-documentaion itself says: For reading and writing .gz files see the gzip module..
But I do not see any option for getting the size of the compressed file in the gzip module.
the logfile opened via logfile = gzip.open("test.gz", "ab", compresslevel=6) does have a .size parameter, but this is the size of the original file, not the compressed file.
Also os.path.getsize("test.gz") is zero until logfile is closed and is actually written to the disk.
Do you have any idea how I can use the built-in gzip module to close a compressed file once it reached a certain size? Without closing and re-opening it all the time?
Or is this even possible?
Thanks for any help on this!
Update:
It is not true that no data is written to disk until the file is closed, it just takes some time to collect some kilobytes before the filesize changes. This is good enogh for me and my usecase, so this is solved. Thanks for any input!
My test code for this:
import os
import gzip
import time
data = 'Hello world'
limit = 10000
i = 0
logfile = gzip.open("test.gz", "wb", compresslevel=6)
while i < limit:
msg = f"{data} {str(i)} \n"
logfile.write(msg.encode("utf-8"))
print(os.path.getsize("test.gz"))
print(logfile.size)
if i > 1000:
logfile.flush()
break
#time.sleep(0.03)
i += 1
logfile.close()
print(f"final size of *.gz file: {os.path.getsize('test.gz')}")
print(f"final size of logfile object file: {logfile.size}")
gzip does not actually compress the file until after you close it so it does no really make sense to ask to know the size of the compressed file beforehand. One thing you could do is look at the size of compressed files you obtain on real data from your use case and do a linear regression to have some kind of approximation of the compression ratio.
I have a large number of compressed HDF files, which I need to read.
file1.HDF.gz
file2.HDF.gz
file3.HDF.gz
...
I can read in uncompressed HDF files with the following method
from pyhdf.SD import SD, SDC
import os
os.system('gunzip < file1.HDF.gz > file1.HDF')
HDF = SD('file1.HDF')
and repeat this for each file. However, this is more time consuming than I want.
I'm thinking its possible that most of the time overhang comes from writing the compressed file to a new uncompressed version, and that I could speed it up if I simply was able to read an uncompressed version of the file into the SD function in one step.
Am I correct in this thinking? And if so, is there a way to do what I want?
According to the pyhdf package documentation, this is not possible.
__init__(self, path, mode=1)
SD constructor. Initialize an SD interface on an HDF file,
creating the file if necessary.
There is no other way to instantiate an SD object that takes a file-like object. This is likely because they are conforming to an external interface (NCSA HDF). The HDF format also normally handles massive files that are impractical to store in memory at one time.
Unzipping it as a file is likely your most performant option.
If you would like to stay in Python, use the gzip module (docs):
import gzip
import shutil
with gzip.open('file1.HDF.gz', 'rb') as f_in, open('file1.HDF', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
sascha is correct that hdf transparent compression is more adequate than gzipping, nonetheless if you can't control how the hdf files are stored you're looking for the gzip python modulue (docs) it can get the data from these files.