How to extract a specific file from the .tar archive in python?

How to extract a specific file from the .tar archive in python? - python

I have created a .tar file on a Linux machine as follows:
tar cvf test.tar test_folder/
where the test_folder contains some files as shown below:
test_folder
|___ file1.jpg
|___ file2.jpg
|___ ...
I am unable to programmatically extract the individual files within the tar archive using Python. More specifically, I have tried the following:
import tarfile
with tarfile.open('test.tar', 'r:') as tar:
img_file = tar.extractfile('test_folder/file1.jpg')
# img_file contains the object: <ExFileObject name='test_folder/test.tar'>
Here, the img_file does not seem to contain the requested image, but rather it contains the source .tar file. I am not sure, where I am messing things up. Any suggestions would be really helpful. Thanks in advance.

You probably wanted to use the .extract() method instead of your .extractfile() method (see my other answer):
import tarfile
with tarfile.open('test.tar', 'r:') as tar:
tar.extract('test_folder/file1.jpg') # .extract() instead of .extractfile()
Notes:
Your extracted file will be in the (maybe newly created) folder test_folder under your current directory.
The .extract() method returns None, so there is no need to assign it (img_file = tar.extract(...))

Appending 2 lines to your code will solve your problem:
import tarfile
with tarfile.open('test.tar', 'r:') as tar:
img_file = tar.extractfile('test_folder/file1.jpg')
# --------------------- Add this ---------------------------
with open ("img_file.jpg", "wb") as outfile:
outfile.write(img_file.read())
The explanation:
The .extractfile() method only provided you the content of the extracted file (i.e. its data).
        It don't extract any file to the file system.
So you have do it yourself - by reading this returned content (img_file.read()) and writing it into a file of your choice (outfile.write(...)).
Or — to simplify your life — use the .extract() method instead. See my other answer.

This is because extractfile() returns a io.BufferReader object, so essentially you are extracting the file in your directory and storing the io.BufferReader in your variable.
What you can do is, extract the file then open the file in a different content manager
import tarfile
with tarfile.open('test.tar', 'r:') as tar:
tar.extractfile('test_folder/file1.jpg')
with open('test_folder/file1.jpg','rb') as img:
# do something with img. Here img is your img file

Related

Read RTF file using python

reading RTF file using striprtf
rtf_to_text not able to read URL,what changes need to make in the code?
Input
Get latest news update at abc#gmail.com
Output
Get latest news update at
Desired Output
Get latest news update at abc#gmail.com
python code:-
import os
from striprtf.striprtf import rtf_to_text
import pandas as pd
from os import path
path_of_the_directory= r'C:\Users\Documents\filename.rtf'
print("Files and directories in a specified path:")
for filename in os.listdir(path_of_the_directory):
f = os.path.join(path_of_the_directory,filename)
if os.path.isfile(f):
print(f)
open_rtf_file=open(f,'r')
file_content_read=open_rtf_file.read()
text_content=rtf_to_text(file_content_read)
print(text_content)

It looks like you are treating a file as a directory. your path_of_the_directory varaible is actually the path to a rtf file name. Without knowing what specific error you are getting at runtime, it looks to me like that is the problem. An easy way to fix it is to check to make sure the path is a directory prior to calling os.listdir like I do in the example below.
path_of_the_directory= r'C:\Users\Documents\filename.rtf' #<--- this is a file
print("Files and directories in a specified path:")
if os.path.isdir(filename): # check if path is directory
for filename in os.listdir(path_of_the_directory):
f = os.path.join(path_of_the_directory,filename)
if os.path.isfile(f):
print(f)
open_rtf_file=open(f,'r')
file_content_read=open_rtf_file.read()
text_content=rtf_to_text(file_content_read)
print(text_content)

Extract only jpg files from a .tar.gz file using python

Problem Summary:
In one of my folder I have .tar.gz file and I need to extract all the images (.jpg & .png) from it. But I have to use the .tar.gz extension (using path to directory) to extract it and not by using the usual way of giving the input file_name to extract it. I need this in one of the part of GUI (Tkinter) for the image classification project.
Code I'm trying:
import os
import tarfile
def extractfile():
os.chdir('GUI_Tkinter/PMC_downloads')
with tarfile.open(os.path.join(os.environ['GUI_Tkinter/PMC_downloads'], f'Backup_{self.batch_id}.tar.gz'), "r:gz") as so:
so.extractall(path=os.environ['GUI_Tkinter/PMC_downloads'])
The code is not giving any error but it's not working. Please suggest me how to do the same by any other way by specifying the .tar.gz file extension to extract it.

I think you can use this code.
import tarfile
import os
t = tarfile.open('example.tar.gz', 'r')
for member in t.getmembers():
if ".jpg" in member.name:
t.extract(member, "outdir")
print(os.listdir('outdir'))
Hope to be helpful for you. Thanks.

Generic/dynamic way to extract one or more .tar.gz or zip file present in a folder without specifying the file name. This is executed by using the extension and the path (location) of the file. You can extract any type of file (.pdf, .nxml, .xml, .gif, etc.) you want from the .tar.gz/zip/compressed file just by mentioning the extension of the required file as the member name in this code. As, I needed all the images from that .tar.gz file to be extracted in one folder. So, in the code below I have specified the extensions .jpg and .png and extracted all the images in the same directory under a folder named "Extracted_Images". If you want, you can also change the directory where the files needed to be extracted by providing the path parameter.
For example "C:/Users/dell/project/histo_images" instead of "Extracted_Images".
import tarfile
import os
import glob
path = glob.glob("*.tar.gz")
for file in path:
t = tarfile.open(file, 'r')
for member in t.getmembers():
if ".jpg" in member.name:
t.extract(member, "Extracted_Images")
elif ".png" in member.name:
t.extract(member, "Extracted_Images")

Extract Tar File inside Memory Filesystem

I have trouble using memoryfs:
https://docs.pyfilesystem.org/en/latest/reference/memoryfs.html:
I'm trying to extract tar inside a memoryFS, but I cant use mem_fs because it is an object and cant get the real / memory path...
from fs import open_fs, copy
import fs
import tarfile
mem_fs = open_fs('mem://')
print(mem_fs.isempty('.'))
fs.copy.copy_file('//TEST_FS', 'test.tar', mem_fs, 'test.tar')
print(mem_fs.listdir('/'))
with mem_fs.open('test.tar') as tar_file:
print(tar_file.read())
tar = tarfile.open(tar_file) // I cant create the tar ...
tar.extractall(mem_fs + 'Extract_Dir') // Cant extract it too...
Can someone help me, it is possible to do that ?

The first argument to tarfile.open is a filename. You're (a) passing it an open file object, and (b) even if you were to pass in a filename, tarfile doesn't know anything about your in-memory filesystem and so wouldn't be able to find the file.
Fortunately, tarfile.open has a fileobj argument that accepts an open file object, so you can write:
with mem_fs.open('test.tar', 'rb') as tar_file:
tar = tarfile.open(fileobj=tar_file)
t.list()
Note that you need to open the file in binary mode (rb).
Of course, now you have a second problem: while you can open and read the archive, the tarfile module still doesn't know about your in-memory filesystem, so attempting to extract files will simply extract them to your local filesystem, which is probably not what you want.
To extract into your in-memory filesystem, you're going to need to read the data from the tar archive member and write it yourself. Here's one option for doing that:
import fs
import os
import pathlib
import tarfile
mem_fs = fs.open_fs('mem://')
fs.copy.copy_file('/', '{}/example.tar.gz'.format(os.getcwd()),
mem_fs, 'example.tar.gz')
with mem_fs.open('example.tar.gz', 'rb') as fd:
tar = tarfile.open(fileobj=fd)
# iterate over list of members
for member in tar.getmembers():
# if the member is a file
if member.isfile():
# create any necessary directories
p = pathlib.Path(member.path)
mem_fs.makedirs(str(p.parent), recreate=True)
# open the archive member
with mem_fs.open(member.path, 'wb') as memfd, \
tar.extractfile(member.path) as tarfd:
# and write the data into the memory fs
memfd.write(tarfd.read())
The tarfile.TarFile.extractfile method returns an open file object to a tar archive member, rather than extracting the file to disk.
Note that the above isn't an optimal solution if you're working with large files (since it reads the entire archive member into memory before writing it out).

Extracting all file names in python

I have a application that converts from one photo format to another by inputting in cmd.exe following: "AppConverter.exe" "file.tiff" "file.jpeg"
But since i don't want to input this every time i want a photo converted, i would like a script that converts all files in the folder. So far i have this:
def start(self):
for root, dirs, files in os.walk("C:\\Users\\x\\Desktop\\converter"):
for file in files:
if file.endswith(".tiff"):
subprocess.run(['AppConverter.exe', '.tiff', '.jpeg'])
So how do i get the names of all the files and put them in subprocess. I am thinking taking basename (no ext.) for every file and pasting it in .tiff and .jpeg, but im at lost on how to do it.

I think the fastest way would be to use the glob module for expressions:
import glob
import subprocess
for file in glob.glob("*.tiff"):
subprocess.run(['AppConverter.exe', file, file[:-5] + '.jpeg'])
# file will be like 'test.tiff'
# file[:-5] will be 'test' (we remove the last 5 characters, so '.tiff'
# we add '.jpeg' to our extension-less string
All those informations are on the post I've linked in the comments o your original question.

You could try looking into os.path.splitext(). That allows you to split the file name into a tuple containing the basename and extension. That might help...
https://docs.python.org/3/library/os.path.html

Python. Container file for different multimedia

This is my problem.
I need to combine text, picture and video (any codec) into one file.
I know there is binary files. How would I go about packaging and reading the file.
For example, In the one file I store the text, then the png and then the video.
In another Python file I extract the files again and display as I please.
Regards,
Renier Engelbrecht

You could use the zipfile module - it creates a single file from arbitrary components.
Sample usage (Python 3):
import zipfile
# Write zip file
with zipfile.ZipFile("combined_file.zip", mode='w', compression=zipfile.ZIP_STORED) as archive:
archive.write("file_1.ext")
archive.write("file_2.ext")
# Extract contents later
with zipfile.ZipFile("combined_file.zip", mode='r') as archive:
archive.extractall()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract a specific file from the .tar archive in python? - python

Related

Read RTF file using python

Extract only jpg files from a .tar.gz file using python

Extract Tar File inside Memory Filesystem

Extracting all file names in python

Python. Container file for different multimedia

Categories

Resources