I have a tar.gz file downloaded from s3, I load it in memory and I want to add a folder and eventually write it into another s3.
I've been trying different approaches:
from io import BytesIO
import gzip
buffer = BytesIO(zip_obj.get()["Body"].read())
im_memory_tar = tarfile.open(buffer, mode='a')
The above rises the error: ReadError: invalid header .
With the below approach:
im_memory_tar = tarfile.open(fileobj=buffer, mode='a')
im_memory_tar.add(name='code_1', arcname='code')
The content seems to be overwritten.
Do you know a good solution to append a folder into a tar.gz file?
Thanks.
very well explained in question how-to-append-a-file-to-a-tar-file-use-python-tarfile-module
Note that 'a:gz' or 'a:bz2' is not possible. If mode is not suitable to open a certain (compressed) file for reading, ReadError is raised. Use mode 'r' to avoid this. If a compression method is not supported, CompressionError is raised.
First we need to consider how to append to a tar file. Let's set aside the compression for a moment.
A tar file is terminated by two 512-byte blocks of all zeros. To add more entries, you need to remove or overwrite that 1024 bytes at the end. If you then append another tar file there, or start writing a new tar file there, you will have a single tar file with all of the entries of the original two.
Now we return to the tar.gz. You can simply decompress the entire .gz file, do that append as above, and then recompress the whole thing.
Avoiding the decompression and recompression is rather more difficult, since we'd have to somehow remove that last 1024 bytes of zeros from the end of the compressed stream. It is possible, but you would need some knowledge of the internals of a deflate compressed stream.
A deflate stream consists of a series of compressed data "blocks", which are each an arbitrary number of bits long. You would need to decompress, without writing out the result, until you get to the block containing the last 1024 bytes. You would need to save the decompressed result of that and any subsequent blocks, and at what bit in the stream that block started. Then you could recompress that data, sans the last 1024 bytes, starting at that byte.
Complete the compression, and write out the gzip trailer with the 1024 zeros removed from the CRC and length. (There is a way to back out zeros from the CRC.) Now you have a complete gzip stream for the previous .tar.gz file, but with the last 1024 bytes of zeros removed.
Since the concatenation of two gzip streams is itself a valid gzip stream, you can now concatenate the second .tar.gz file directly or start writing a new .tar.gz stream there. You now have a single, valid .tar.gz stream with the entries from the two original sources.
Related
I currently try to create a module that writes a *.gz file up to a specific size. I want to use it for a custom log handler to specify the maximum size of a zipped logfile. I already made my way through the gzip documentation and also the zlib documentation.
I could use zlib right away and measure the length of my compressed bytearray, but then I would have to create and write the gzip file header by myself. The zlib-documentaion itself says: For reading and writing .gz files see the gzip module..
But I do not see any option for getting the size of the compressed file in the gzip module.
the logfile opened via logfile = gzip.open("test.gz", "ab", compresslevel=6) does have a .size parameter, but this is the size of the original file, not the compressed file.
Also os.path.getsize("test.gz") is zero until logfile is closed and is actually written to the disk.
Do you have any idea how I can use the built-in gzip module to close a compressed file once it reached a certain size? Without closing and re-opening it all the time?
Or is this even possible?
Thanks for any help on this!
Update:
It is not true that no data is written to disk until the file is closed, it just takes some time to collect some kilobytes before the filesize changes. This is good enogh for me and my usecase, so this is solved. Thanks for any input!
My test code for this:
import os
import gzip
import time
data = 'Hello world'
limit = 10000
i = 0
logfile = gzip.open("test.gz", "wb", compresslevel=6)
while i < limit:
msg = f"{data} {str(i)} \n"
logfile.write(msg.encode("utf-8"))
print(os.path.getsize("test.gz"))
print(logfile.size)
if i > 1000:
logfile.flush()
break
#time.sleep(0.03)
i += 1
logfile.close()
print(f"final size of *.gz file: {os.path.getsize('test.gz')}")
print(f"final size of logfile object file: {logfile.size}")
gzip does not actually compress the file until after you close it so it does no really make sense to ask to know the size of the compressed file beforehand. One thing you could do is look at the size of compressed files you obtain on real data from your use case and do a linear regression to have some kind of approximation of the compression ratio.
npy files size are around 5 gb and RAM is around 5gb so cannot load both numpy arrays. How to load one npy file and append its rows to other npy file without loading it
An npy file is a header containing the data type (metadata) and shape, followed by the data itself.
The header ends with a '\n' (newline) character. So, open your first file in append mode, then open the second file in read mode, skip the header by readline(), then copy chunks (using read(size)) from the second file to the first.
There is only one thing left: to update the shape (length) field in the header. And here it gets a bit tricky, because if the two files had for example the shapes (700,) and (400,), the new shape needs to be (1300,) but you may not have space in the header for it. This depends on how many pad characters were in the original header--sometimes you will have space and sometimes you won't. If there is no space, you will need to write a new header into a new file and then copy the data from both source files. Still, this won't take much memory or time, just a bit of extra disk space.
You can see the code which reads and writes npy files here: https://github.com/numpy/numpy/blob/master/numpy/lib/format.py - there are some undocumented functions you may find useful in your quest.
I am dealing with a somewhat large binary file (717M). This binary file contains a set (unknown number!) of complete zip files.
I would like to extract all of those zip files (no need to explitly decompress them). I am able to find the offset (start point) of each chunks thanks to the magic number ('PK') but I fail to find a way to compute the length for each chunk (eg. to carve those zip file out of the large binary file).
Reading some documentation (http://forensicswiki.org/wiki/ZIP), gives me the impression it is easy to parse a zip file since it contains the compressed size of each compressed file.
Is there a way for me to do that in C or Python without reinventing the wheel ?
A zip entry is permitted to not contain the compressed size in the local header. There is a flag bit to have a descriptor with the compressed size, uncompressed size, and CRC follow the compressed data.
It would be more reliable to search for end-of-central-directory headers, use that to find the central directories, and use that to find the local headers and entries. This will require attention to detail, very carefully reading the PKWare appnote that describes the zip format. You will need to handle the Zip64 format as well, which has additional headers and fields.
It is possible a zip entry to be stored, i.e. copied verbatim into that location in the zip file, and it is possible for that entry to itself be a zip file. So make sure that you handle the case of embedded zip files, extracting only the outermost zip files.
There are some standard ways to handle zip files in python for example but as far as i know (not that i'm an expert) you first need to supply the actual file somehow. I suggest looking at the zip file format specification.
You should be able to find the other information you need based on the relative position to the magic number. If I'm not mistaken the CRC-32 is the magic number, so jumping forward 4 bytes will get you to the compressed size, and another 8 bytes should get you the file name.
local file header signature 4 bytes (0x04034b50)
version needed to extract 2 bytes
general purpose bit flag 2 bytes
compression method 2 bytes
last mod file time 2 bytes
last mod file date 2 bytes
crc-32 4 bytes
compressed size 4 bytes
uncompressed size 4 bytes
file name length 2 bytes
extra field length 2 bytes
file name (variable size)
extra field (variable size)
Hope that helps a little bit at least :)
I have a simple server on my Windows PC written in python that reads files from a directory and then sends the file to the client via TCP.
Files like HTML and Javascript are received by the client correctly (sent and original file match).
The issue is that image data is truncated.
Oddly, different images are truncated at different lengths, but it's consistent per image.
For example, a specific 1MB JPG is always received as 95 bytes. Another image which should be 7KB, is received as 120 bytes.
Opening the truncated image files in notepad++, the data that is there is correct. (The only issue is that the file ends too soon).
I do not see a pattern for where the files end. The chars/bytes immediately before and after truncation are different per image.
I've tried three different ways for the server to read the files, but they all have the same result.
Here is a snippet of the reading and sending of files:
print ("Cache size=" + str(os.stat(filename).st_size))
#1st attempt, using readlines
fileobj = open(filename, "r")
cacheBuffer = fileobj.readlines()
for i in range(0, len(cacheBuffer)):
tcpCliSock.send(cacheBuffer[i])
#2nd attempt, using line, same result
with open(filename) as f:
for line in f:
tcpCliSock.send(f)
#3rd attempt, using f.read(), same result
with open(filename) as f:
tcpCliSock.send(f.read())
The script prints to the console the size of the file read, and the number of bytes matches the original image. So this proves the problem is in sending, right?
If the issue is with sending, what can I change to have the whole image sent properly?
Since you're dealing with images, which are binary files, you need to open the files in binary mode.
open(filename, 'rb')
From the Python documentation for open():
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.)
Since your server is running on Windows, as you read the file, Python is converting every \r\n it sees to \n. For text files, this is nice: You can write platform-independent code that only deals with \n characters. For binary files, this completely corrupts your data. That's why it's important to use 'b' when dealing with binary files, but also important to leave it off when dealing with text files.
Also, as TCP is a stream protocol, it's better to stream the data into the socket in smaller pieces. This avoids the need to read an entire file into memory, which will keep your memory usage down. Like this:
with open(filename, 'rb') as f:
while True:
data = f.read(4096)
if len(data) == 0:
break
tcpCliSock.send(data)
Is it possible to append to a gzipped text file on the fly using Python ?
Basically I am doing this:-
import gzip
content = "Lots of content here"
f = gzip.open('file.txt.gz', 'a', 9)
f.write(content)
f.close()
A line is appended (note "appended") to the file every 6 seconds or so, but the resulting file is just as big as a standard uncompressed file (roughly 1MB when done).
Explicitly specifying the compression level does not seem to make a difference either.
If I gzip an existing uncompressed file afterwards, it's size comes down to roughly 80kb.
Im guessing its not possible to "append" to a gzip file on the fly and have it compress ?
Is this a case of writing to a String.IO buffer and then flushing to a gzip file when done ?
That works in the sense of creating and maintaining a valid gzip file, since the gzip format permits concatenated gzip streams.
However it doesn't work in the sense that you get lousy compression, since you are giving each instance of gzip compression so little data to work with. Compression depends on taking advantage the history of previous data, but here gzip has been given essentially none.
You could either a) accumulate at least a few K of data, many of your lines, before invoking gzip to add another gzip stream to the file, or b) do something much more sophisticated that appends to a single gzip stream, leaving a valid gzip stream each time and permitting efficient compression of the data.
You find an example of b) in C, in gzlog.h and gzlog.c. I do not believe that Python has all of the interfaces to zlib needed to implement gzlog directly in Python, but you could interface to the C code from Python.