Extract Tar File inside Memory Filesystem - python

I have trouble using memoryfs:
https://docs.pyfilesystem.org/en/latest/reference/memoryfs.html:
I'm trying to extract tar inside a memoryFS, but I cant use mem_fs because it is an object and cant get the real / memory path...
from fs import open_fs, copy
import fs
import tarfile
mem_fs = open_fs('mem://')
print(mem_fs.isempty('.'))
fs.copy.copy_file('//TEST_FS', 'test.tar', mem_fs, 'test.tar')
print(mem_fs.listdir('/'))
with mem_fs.open('test.tar') as tar_file:
print(tar_file.read())
tar = tarfile.open(tar_file) // I cant create the tar ...
tar.extractall(mem_fs + 'Extract_Dir') // Cant extract it too...
Can someone help me, it is possible to do that ?

The first argument to tarfile.open is a filename. You're (a) passing it an open file object, and (b) even if you were to pass in a filename, tarfile doesn't know anything about your in-memory filesystem and so wouldn't be able to find the file.
Fortunately, tarfile.open has a fileobj argument that accepts an open file object, so you can write:
with mem_fs.open('test.tar', 'rb') as tar_file:
tar = tarfile.open(fileobj=tar_file)
t.list()
Note that you need to open the file in binary mode (rb).
Of course, now you have a second problem: while you can open and read the archive, the tarfile module still doesn't know about your in-memory filesystem, so attempting to extract files will simply extract them to your local filesystem, which is probably not what you want.
To extract into your in-memory filesystem, you're going to need to read the data from the tar archive member and write it yourself. Here's one option for doing that:
import fs
import os
import pathlib
import tarfile
mem_fs = fs.open_fs('mem://')
fs.copy.copy_file('/', '{}/example.tar.gz'.format(os.getcwd()),
mem_fs, 'example.tar.gz')
with mem_fs.open('example.tar.gz', 'rb') as fd:
tar = tarfile.open(fileobj=fd)
# iterate over list of members
for member in tar.getmembers():
# if the member is a file
if member.isfile():
# create any necessary directories
p = pathlib.Path(member.path)
mem_fs.makedirs(str(p.parent), recreate=True)
# open the archive member
with mem_fs.open(member.path, 'wb') as memfd, \
tar.extractfile(member.path) as tarfd:
# and write the data into the memory fs
memfd.write(tarfd.read())
The tarfile.TarFile.extractfile method returns an open file object to a tar archive member, rather than extracting the file to disk.
Note that the above isn't an optimal solution if you're working with large files (since it reads the entire archive member into memory before writing it out).

Related

Uncompress tar.bz2 from s3 and move the files back to s3 from Pyhon

Suppose I have a bzip2 compressed tar archive file x.tar.bz2 stored in s3. I would like to decompress it and place back to s3. This can be achieved by:
from s3fs import S3FileSystem
fs = S3FileSystem()
import tarfile
with fs.open('s3://path_to_source/x.tar.bz2', 'rb') as f:
r = f.read()
with open('x.tar.bz2', mode='wb') as localfile:
localfile.write(r)
tar = tarfile.open('x.tar.bz2', "r:bz2")
tar.extractall(path='extraction/path')
tar.close()
fs.put('extraction/path', f's3://path_to_destination/x', recursive=True)
Within the solution above, I am saving the file content twice into my local disc. I have the following questions (solution is expected to be done using Python):
Is it possible (using tarfile module) to load data directly from s3 and also extract it there avoiding to store data on local drive?
Is it possible to make this job in a streaming mode without need to have the whole x.tar.bz2 (or at least the uncompressed archive x.tar) file in memory?
tarfile.open accepts a file-like object as the fileobj argument, so you can pass to it the file object you get from S3FileSystem.open. You can then iterate through the TarInfo objects in the tar object, and open the corresponding path in S3 for writing:
with fs.open('s3://path_to_source/x.tar.bz2', 'rb') as f:
with tarfile.open(fileobj=f, mode='r:bz2') as tar:
for entry in tar:
with fs.open(f'path_to_destination/{entry.name}', mode='wb') as writer:
writer.write(tar.extractfile(entry).read())

how to gzip files in tmp folder

Using an AWS Lambda function, I download an S3 zipped file and unzip it.
For now I do it using extractall. Upon unzipping, all files are saved in the tmp/ folder.
s3.download_file('test','10000838.zip','/tmp/10000838.zip')
with zipfile.ZipFile('/tmp/10000838.zip', 'r') as zip_ref:
lstNEW = list(filter(lambda x: not x.startswith("__MACOSX/"), zip_ref.namelist()))
zip_ref.extractall('/tmp/', members=lstNEW)
After unzipping, I want to gzip files and place them in another S3 bucket.
Now, how can I read all files from the tmp folder again and gzip each file?
$item.csv.gz
I see this (https://docs.python.org/3/library/gzip.html) but I am not sure which function is to be used.
If it's the compress function, how exactly do I use it? I read in this answer gzip a file in Python that I can use the open function gzip.open('', 'wb') to gzip a file but I couldn't figure out how to use it in my case. In the open function, do I specify the target location or the source location? Where do I save the gzipped files such as that I can later save them to S3?
Alternative Option:
Instead of loading everything into the tmp folder, I read that I can also open an output stream, wrap the output stream in a gzip wrapper, and then copy from one stream to the other
with zipfile.ZipFile('/tmp/10000838.zip', 'r') as zip_ref:
testList = []
for i in zip_ref.namelist():
if (i.startswith("__MACOSX/") == False):
testList.append(i)
for i in testList:
zip_ref.open(i, ‘r’)
but then again I am not sure how to continue in the for loop and open the stream and convert files there
Depending on the sizes of the files, I would skip writing the .gz file(s) to disk. Perhaps something based on s3fs | boto and gzip.
import contextlib
import gzip
import s3fs
AWS_S3 = s3fs.S3FileSystem(anon=False) # AWS env must be set up correctly
source_file_path = "/tmp/your_file.txt"
s3_file_path = "my-bucket/your_file.txt.gz"
with contextlib.ExitStack() as stack:
source_file = stack.enter_context(open(source_file_path , mode="rb"))
destination_file = stack.enter_context(AWS_S3.open(s3_file_path, mode="wb"))
destination_file_gz = stack.enter_context(gzip.GzipFile(fileobj=destination_file))
while True:
chunk = source_file.read(1024)
if not chunk:
break
destination_file_gz.write(chunk)
Note: I have not tested this so if it does not work, let me know.

How to extract a specific file from the .tar archive in python?

I have created a .tar file on a Linux machine as follows:
tar cvf test.tar test_folder/
where the test_folder contains some files as shown below:
test_folder
|___ file1.jpg
|___ file2.jpg
|___ ...
I am unable to programmatically extract the individual files within the tar archive using Python. More specifically, I have tried the following:
import tarfile
with tarfile.open('test.tar', 'r:') as tar:
img_file = tar.extractfile('test_folder/file1.jpg')
# img_file contains the object: <ExFileObject name='test_folder/test.tar'>
Here, the img_file does not seem to contain the requested image, but rather it contains the source .tar file. I am not sure, where I am messing things up. Any suggestions would be really helpful. Thanks in advance.
You probably wanted to use the .extract() method instead of your .extractfile() method (see my other answer):
import tarfile
with tarfile.open('test.tar', 'r:') as tar:
tar.extract('test_folder/file1.jpg') # .extract() instead of .extractfile()
Notes:
Your extracted file will be in the (maybe newly created) folder test_folder under your current directory.
The .extract() method returns None, so there is no need to assign it (img_file = tar.extract(...))
Appending 2 lines to your code will solve your problem:
import tarfile
with tarfile.open('test.tar', 'r:') as tar:
img_file = tar.extractfile('test_folder/file1.jpg')
# --------------------- Add this ---------------------------
with open ("img_file.jpg", "wb") as outfile:
outfile.write(img_file.read())
The explanation:
The .extractfile() method only provided you the content of the extracted file (i.e. its data).
        It don't extract any file to the file system.
So you have do it yourself - by reading this returned content (img_file.read()) and writing it into a file of your choice (outfile.write(...)).
Or — to simplify your life — use the .extract() method instead. See my other answer.
This is because extractfile() returns a io.BufferReader object, so essentially you are extracting the file in your directory and storing the io.BufferReader in your variable.
What you can do is, extract the file then open the file in a different content manager
import tarfile
with tarfile.open('test.tar', 'r:') as tar:
tar.extractfile('test_folder/file1.jpg')
with open('test_folder/file1.jpg','rb') as img:
# do something with img. Here img is your img file

How to read gz compressed files from tar

Let's say we have a tar file which in turn contains multiple gzip compressed files. I want to be able to read the contents of those gzip files without compressing either the tar file or the individual gzip files. I 'm trying to use tarfile module in python.
This might work, I haven't tested it, but this has the main ideas, and related tools. It iterates over the files in the tar, and if they are gzipped, then will read them into the file_contents variable:
import tarfile as t
import gzip as g
for member in t.open("your.gz.tar").getmembers():
fo=t.extractfile(member)
file_contents = g.GzipFile(fileobj=fo).read()
note: if the file is too large for memory, then consider looking into a streamed reader (chunk by chunk) as linked.
If you have additional logic based on what the member (TarInfo) object looks like you can use these:
https://docs.python.org/2/library/tarfile.html#tarinfo-objects
see:
How can I decompress a gzip stream with zlib?
Python decompressing gzip chunk-by-chunk
reading tar file contents without untarring it, in python script

Python: Untar a single folder from a tarball

Given a tarball containing multiple directories, how do I extract just a single, specific directory?
import tarfile
tar = tarfile.open("/path/to/tarfile.tar.gz")
tar.list()
... rootdir/subdir_1/file_1.ext
... rootdir/subdir_1/file_n.ext
... rootdir/subdir_2/file_1.ext
etc.
How would I extract just the files from subdir_2?
NOTE: The entire operation is being done in memory a la...
import tarfile, urllib2, StringIO
data = urllib2.urlopen(url)
tar = tarfile.open(mode = 'r|*', fileobj = StringIO.StringIO(data.read()))
... so it's not feasible to extract all to disk and move the necessary folder.
You seem to be almost there - I think you can just use the contents of getnames() and combine it with extractfile() to process the files in memory, e.g.:
import re
files = (file for file in tar.getnames() if file.startswith('rootdir/'))

Categories