Extracting compressed files - python

The following code allows me to extract .tgz files. However, it stops extracting after about two levels down; there are other subfolders that have .tgz files that need extracting. Additionally, when I extract a file, I have to manually move it to another path or it will get overwritten by other .tgz files that I extract to that location (all .tgz that I'm using have the same file structure/folder names once extracted). Any help is appreciated. Thanks!
import os, sys, tarfile
def extract(tar_url, extract_path='.'):
print tar_url
tar = tarfile.open(tar_url, 'r')
for item in tar:
tar.extract(item, extract_path)
if item.name.find(".tgz") != -1 or item.name.find(".tar") != -1:
extract(item.name, "./" + item.name[:item.name.rfind('/')])
try:
extract(sys.argv[1] + '.tgz')
print 'Done.'
except:
name = os.path.basename(sys.argv[0])
print name[:name.rfind('.')], '<filename>'

If I have not wrongly misinterpreted your question, then here is what you want to do -
Extract a .tgz file which may have
more .tgz files within it that needs further
extraction (and so on..)
While extracting, you need to be careful that you are not replacing an already existing directory in the folder.
If I have correctly interpreted your problem, then...
Here is what my code does -
Extracts every .tgz file (recursively) in a separate folder with the same name as the .tgz file (without its extension) in the same directory.
While extracting, it makes sure that it is not overwriting/replacing any already existing files/folder.
So if this is the directory structure of the .tgz file -
parent/
xyz.tgz/
a
b
c
d.tgz/
x
y
z
a.tgz/ # note if I extract this directly, it will replace/overwrite contents of the folder 'a'
m
n
o
p
After extraction, the directory structure will be -
parent/
xyz.tgz
xyz/
a
b
c
d/
x
y
z
a 1/ # it extracts 'a.tgz' to the folder 'a 1' as folder 'a' already exists in the same folder.
m
n
o
p
Although I have provided plenty of documentation in my code below, I would just brief out the structure of my program. Here are the functions I have defined -
FileExtension --> returns the extension of a file
AppropriateFolderName --> helps in preventing overwriting/replacing of already existing folders (how? you will see it in the program)
Extract --> extracts a .tgz file (safely)
WalkTreeAndExtract - walks down a directory (passed as parameter) and extracts all .tgz files(recursively) on the way down.
I cannot suggest changes to what you have done, as my approach is a bit different. I have used extractall method of the tarfile module instead of the bit complicated extract method as you have done. (Just have glance at this - http://docs.python.org/library/tarfile.html#tarfile.TarFile.extractall and read the warning associated with using extractall method. I don`t think we will be having any such problem in general, but just keep that in mind.)
So here is the code that worked for me -
(I tried it for .tar files nested 5 levels deep (ie .tar within .tar within .tar ... 5 times), but it should work for any depth* and also for .tgz files.)
# extracting_nested_tars.py
import os
import re
import tarfile
file_extensions = ('tar', 'tgz')
# Edit this according to the archive types you want to extract. Keep in
# mind that these should be extractable by the tarfile module.
def FileExtension(file_name):
"""Return the file extension of file
'file' should be a string. It can be either the full path of
the file or just its name (or any string as long it contains
the file extension.)
Examples:
input (file) --> 'abc.tar'
return value --> 'tar'
"""
match = re.compile(r"^.*[.](?P<ext>\w+)$",
re.VERBOSE|re.IGNORECASE).match(file_name)
if match: # if match != None:
ext = match.group('ext')
return ext
else:
return '' # there is no file extension to file_name
def AppropriateFolderName(folder_name, parent_fullpath):
"""Return a folder name such that it can be safely created in
parent_fullpath without replacing any existing folder in it.
Check if a folder named folder_name exists in parent_fullpath. If no,
return folder_name (without changing, because it can be safely created
without replacing any already existing folder). If yes, append an
appropriate number to the folder_name such that this new folder_name
can be safely created in the folder parent_fullpath.
Examples:
folder_name = 'untitled folder'
return value = 'untitled folder' (if no such folder already exists
in parent_fullpath.)
folder_name = 'untitled folder'
return value = 'untitled folder 1' (if a folder named 'untitled folder'
already exists but no folder named
'untitled folder 1' exists in
parent_fullpath.)
folder_name = 'untitled folder'
return value = 'untitled folder 2' (if folders named 'untitled folder'
and 'untitled folder 1' both
already exist but no folder named
'untitled folder 2' exists in
parent_fullpath.)
"""
if os.path.exists(os.path.join(parent_fullpath,folder_name)):
match = re.compile(r'^(?P<name>.*)[ ](?P<num>\d+)$').match(folder_name)
if match: # if match != None:
name = match.group('name')
number = match.group('num')
new_folder_name = '%s %d' %(name, int(number)+1)
return AppropriateFolderName(new_folder_name,
parent_fullpath)
# Recursively call itself so that it can be check whether a
# folder named new_folder_name already exists in parent_fullpath
# or not.
else:
new_folder_name = '%s 1' %folder_name
return AppropriateFolderName(new_folder_name, parent_fullpath)
# Recursively call itself so that it can be check whether a
# folder named new_folder_name already exists in parent_fullpath
# or not.
else:
return folder_name
def Extract(tarfile_fullpath, delete_tar_file=True):
"""Extract the tarfile_fullpath to an appropriate* folder of the same
name as the tar file (without an extension) and return the path
of this folder.
If delete_tar_file is True, it will delete the tar file after
its extraction; if False, it won`t. Default value is True as you
would normally want to delete the (nested) tar files after
extraction. Pass a False, if you don`t want to delete the
tar file (after its extraction) you are passing.
"""
tarfile_name = os.path.basename(tarfile_fullpath)
parent_dir = os.path.dirname(tarfile_fullpath)
extract_folder_name = AppropriateFolderName(tarfile_name[:\
-1*len(FileExtension(tarfile_name))-1], parent_dir)
# (the slicing is to remove the extension (.tar) from the file name.)
# Get a folder name (from the function AppropriateFolderName)
# in which the contents of the tar file can be extracted,
# so that it doesn't replace an already existing folder.
extract_folder_fullpath = os.path.join(parent_dir,
extract_folder_name)
# The full path to this new folder.
try:
tar = tarfile.open(tarfile_fullpath)
tar.extractall(extract_folder_fullpath)
tar.close()
if delete_tar_file:
os.remove(tarfile_fullpath)
return extract_folder_name
except Exception as e:
# Exceptions can occur while opening a damaged tar file.
print 'Error occured while extracting %s\n'\
'Reason: %s' %(tarfile_fullpath, e)
return
def WalkTreeAndExtract(parent_dir):
"""Recursively descend the directory tree rooted at parent_dir
and extract each tar file on the way down (recursively).
"""
try:
dir_contents = os.listdir(parent_dir)
except OSError as e:
# Exception can occur if trying to open some folder whose
# permissions this program does not have.
print 'Error occured. Could not open folder %s\n'\
'Reason: %s' %(parent_dir, e)
return
for content in dir_contents:
content_fullpath = os.path.join(parent_dir, content)
if os.path.isdir(content_fullpath):
# If content is a folder, walk it down completely.
WalkTreeAndExtract(content_fullpath)
elif os.path.isfile(content_fullpath):
# If content is a file, check if it is a tar file.
# If so, extract its contents to a new folder.
if FileExtension(content_fullpath) in file_extensions:
extract_folder_name = Extract(content_fullpath)
if extract_folder_name: # if extract_folder_name != None:
dir_contents.append(extract_folder_name)
# Append the newly extracted folder to dir_contents
# so that it can be later searched for more tar files
# to extract.
else:
# Unknown file type.
print 'Skipping %s. <Neither file nor folder>' % content_fullpath
if __name__ == '__main__':
tarfile_fullpath = 'fullpath_path_of_your_tarfile' # pass the path of your tar file here.
extract_folder_name = Extract(tarfile_fullpath, False)
# tarfile_fullpath is extracted to extract_folder_name. Now descend
# down its directory structure and extract all other tar files
# (recursively).
extract_folder_fullpath = os.path.join(os.path.dirname(tarfile_fullpath),
extract_folder_name)
WalkTreeAndExtract(extract_folder_fullpath)
# If you want to extract all tar files in a dir, just execute the above
# line and nothing else.
I have not added a command line interface to it. I guess you can add it if you find it useful.
Here is a slightly better version of the above program -
http://guanidene.blogspot.com/2011/06/nested-tar-archives-extractor.html

Related

Is there a simpler function or one liner to check if folder exists if not create it and paste a specific file into it?

I am aiming to create a function that does the following:
Declare a path with a file, not just a folder. e.g. 'C:/Users/Lampard/Desktop/Folder1/File.py'
Create a folder in same folder as the declared file path - Calling it 'Archive'
Cut the file and paste it into the new folder just created.
If the folder 'Archive' already exists - then simply cut and paste the file into there
I have spent approx. 15-20min going through these:
https://www.programiz.com/python-programming/directory
Join all except last x in list
https://docs.python.org/3/library/pathlib.html#operators
And here is what I got to:
import os
from pathlib import Path, PurePath
from shutil import copy
#This path will change every time - just trying to get function right first
path = 'C:/Users/Lampard/Desktop/Folder1/File.py'
#Used to allow suffix function
p = PurePath(path)
#Check if directory is a file not a folder
if not p.suffix:
print("Not an extension")
#If it is a file
else:
#Create new folder before last file
#Change working directory
split = path.split('/')
new_directory = '/'.join(split[:-1])
apply_new_directory = os.chdir(new_directory)
#If folder does not exist create it
try:
os.mkdir('Archive')#Create new folder
#If not, continue process to copy file and paste it into Archive
except FileExistsError:
copy(path, new_directory + '/Archive/' + split[-1])
Is this code okay? - does anyone know a simpler method?
Locate folder/file in path
print [name for name in os.listdir(".") if os.path.isdir(name)]
Create path
import os
# define the name of the directory to be created
path = "/tmp/year"
try:
os.mkdir(path)
except OSError:
print ("Creation of the directory %s failed" % path)
else:
print ("Successfully created the directory %s " % path)
To move and cut files you can use this library
As you're already using pathlib, there's no need to use shutil:
from pathlib import Path
path = 'C:/Users/Lampard/Desktop/Folder1/File.py' # or whatever
p = Path(path)
target = Path(p.with_name('Archive')) # replace the filename with 'Archive'
target.mkdir() # create target directory
p.rename(target.joinpath(p.name)) # move the file to the target directory
Feel free to add appriopriate try…except statements to handle any errors.
Update: you might find this version more readable:
target = p.parent / 'Archive'
target.mkdir()
p.rename(target / p.name)
This is an example of overloading / operator.

How to save files with same name in folder in python?

I have two folders with images. Let the two folder names A and B. A contains 100 files and B has only 80 files. Both the files have the same name. I want to save only the 80 files from A which has the same correspondence to B in folder C.
Here is a part of my code. However, it is throwing error :
Required argument 'img' (pos 2) not found.
path1= '/home/vplab/Kitty/Saliency Dataset/PiCANet-Implementation/TrainSet/images'
path_mask= '/home/vplab/Kitty/Saliency Dataset/PiCANet-Implementation/TrainSet/masks'
save_path = '/home/vplab/Kitty/Saliency Dataset/PiCANet-Implementation/TrainSet/exp'
for file in os.listdir(path1):
for file1 in os.listdir(path_mask):
img_name = file[:-4]
mask_name =file1[:-4]
if img_name == mask_name:
cv2.imwrite(os.path.join(save_path,img_name))
Your issue here is that you are not passing a file object to cv2.imwrite(os.path.join(save_path,img_name)) when trying to perform the copy; that's what the error is telling you.
However, your current approach includes a nested for loop which will give poor performance. If you only want to know the files that the directories have in common, you can create a set of the file names in each directory and find the intersection. Then you just need to iterate through the common files and copy them over (as said in the comments, there's no need for cv2 here - they may be images but they're just regular files that can be copied).
import os
from shutil import copyfile
dir_1 = 'A'
dir_2 = 'B'
output_dir = 'C'
files_1 = os.listdir(dir_1)
files_2 = os.listdir(dir_2)
# Find the common files between both
common_files = set(files_1).intersection(files_2)
# Copy the common files over.
for file in common_files:
copyfile(os.path.join(dir_1, file),
os.path.join(output_dir, file))
If the reason that you are stripping the last characters from the files in os.listdir is because the files have the same name but different extensions, you only need to make two small modifications (where here I'm assuming the extension is .png that needs to be added back later):
files_1 = [item[:-4] for item in os.listdir(dir_1)]
files_2 = [item[:-4] for item in os.listdir(dir_2)]
And:
for file in common_files:
file = file + '.png' # Add the extension back on to the file name
copyfile(os.path.join(dir_1, file),
os.path.join(output_dir, file))
The any() method returns True if any element of an iterable is True. If not, any() returns False. shutil.copy - Copies the file src to the file or directory dst.
import os
import shutil
def read_file(folderName,folderPath):
''' Return list of files name '''
path = folderPath+folderName
return [file for file in os.listdir(path)]
def save_file(soureFolderName,destFolderName,folderPath,fileName):
''' Save file on destination folder'''
try:
source_path = folderPath+soureFolderName+"/"+fileName
dest_path = folderPath+destFolderName+"/"+fileName
shutil.copy(source_path, dest_path)
except Exception as e:
print(e)
base_path = '/home/vplab/Kitty/Saliency Dataset/PiCANet-Implementation/TrainSet/'
folder_images_files = read_file('images',base_path)
folder_masks_file = read_file('masks',base_path)
for file_1 in folder_images_files:
#Check folder A file is exists in folder B
if any(file_1 == file_2 for file_2 in folder_masks_file):
save_file("images","exp",base_path,file_1)

Find, renaming, and replacing files

I need to update an existing directory with files that are provided in a Patch directory.
This is what I'm starting with. All commented out by me and then I try to build each line.
# $SourceDirectory = Patch folder that has files in any number of sub folders
# $DestDirectory = Application folder that has the files that need patching
# $UnMatchedFilesFolder = A Folder where SourceFiles go that don't have a match in $DestDirectory
# import os.path
# import os.listdir
#
# Create list1 of files from $SourceDirectory
# For each file (excluding directory names) in List1 (including subfolders), search for it in $DestDirectory and its subfolders;
# If you find the file by the same name, then create a backup of that file with .old;
# move $DestDirectoryPathAndFile to $DestDirectoryPathAndFile.old;
# print "Creating backup of file";
# After the backup is made, then copy the file from the $SourceDirectory to the;
# exact same location where it was found in the $DestDirectory. ;
# Else;
# move file to UnmatchedFilesDirectory.;
# If the number of files in $UnMatchedFilesDirectory =/ 0;
# Create list3 from $UnmatchedFilesDirectory
# print "The following files in $UnMatchedFilesDirectory will need to be installed individually";
# Print "Automated Patching completed.";
# Print "Script completed";
As mentioned in the previous post, I am skeptical of the course you are following based on the information given. Based on the document given, there are far better sites/tutorials available for free to help you learn Python/programming. That said, Stack Overflow is a friendly place, and so I hope to provide you with information which will help you on your way:
import os
source_dir =r"D:\temp"
dest_dir=r"D:\temp2"
for root, dirs, files in os.walk(source_dir):
# os.walk 'root' steps through subdirectories as we iterate
# this allows us to join 'root' and 'file' without missing any sub-directories
for file in files:
exist_path = os.path.join(root, file)
# expected_file represents the fullpath of a file we are looking to create/replace
expected_file = exist_path.replace(source_dir, dest_dir)
current = os.path.join(root, file)
if os.path.exists(expected_file):
print "The file %s exists, os.rename with '.old' before copying %s" % (current, exist_path)
# .. note:: we should rename to .bkp here, then we would correctly copy the file below without conflict
print "Now %s doesn't exist, we are free to copy %s" % (expected_file, exist_path)

Python: folder creation when copying files

I'm trying to create a shell script that will copy files from one computer (employee's old computer) to another (employee's new computer). I have it to the point where I can copy files over, thanks to the lovely people here, but I'm running into a problem - if I'm going from, say, this directory that has 2 files:
C:\Users\specificuser\Documents\Test Folder
....to this directory...
C:\Users\specificuser\Desktop
...I see the files show up on the Desktop, but the folder those files were in (Test Folder) isn't created.
Here is the copy function I'm using:
#copy function
def dir_copy(srcpath, dstpath):
#if the destination path doesn't exist, create it
if not os.path.exists(dstpath):
os.makedir(dstpath)
#tag each file to the source path to create the file path
for file in os.listdir(srcpath):
srcfile = os.path.join(srcpath, file)
dstfile = os.path.join(dstpath, file)
#if the source file path is a directory, copy the directory
if os.path.isdir(srcfile):
dir_copy(srcfile, dstfile)
else: #if the source file path is just a file, copy the file
shutil.copyfile(srcfile, dstfile)
I know I need to create the directory on the destination, I'm just not quite sure how to do it.
Edit: I found that I had a type (os.makedir instead of os.mkdir). I tested it, and it creates directories like it's supposed to. HOWEVER I'd like it to create the directory one level up from where it's starting. For example, in Test Folder there is Sub Test Folder. It has created Sub Test Folder but won't create Test Folder because Test Folder is not part of the dstpath. Does that make sense?
You might want to look at shutil.copytree(). It performs the recursive copy functionality, including directories, that you're looking for. So, for a basic recursive copy, you could just run:
shutil.copytree(srcpath, dstpath)
However, to accomplish your goal of copying the source directory to the destination directory, creating the source directory inside of the destination directory in the process, you could use something like this:
import os
import shutil
def dir_copy(srcpath, dstdir):
dirname = os.path.basename(srcpath)
dstpath = os.path.join(dstdir, dirname)
shutil.copytree(srcpath, dstpath)
Note that your srcpath must not contain a slash at the end for this to work. Also, the result of joining the destination directory and the source directory name must not already exist, or copytree will fail.
This is a common problem with file copy... do you intend to just copy the contents of the folder or do you want the folder itself copied. Copy utilities typically have a flag for this and you can too. I use os.makedirs so that any intermediate directories are created also.
#copy function
def dir_copy(srcpath, dstpath, include_directory=False):
if include_directory:
dstpath = os.path.join(dstpath, os.path.basename(srcpath))
os.makedirs(dstpath, exist_ok=True)
#tag each file to the source path to create the file path
for file in os.listdir(srcpath):
srcfile = os.path.join(srcpath, file)
dstfile = os.path.join(dstpath, file)
#if the source file path is a directory, copy the directory
if os.path.isdir(srcfile):
dir_copy(srcfile, dstfile)
else: #if the source file path is just a file, copy the file
shutil.copyfile(srcfile, dstfile)
import shutil
import os
def dir_copy(srcpath, dstpath):
try:
shutil.copytree(srcpath, dstpath)
except shutil.Error as e:
print('Directory not copied. Error: %s' % e)
except OSError as e:
print('Directory not copied. Error: %s' % e)
dir_copy('/home/sergey/test1', '/home/sergey/test2')
I use this script to backup (copy) my working folder. It will skip large files, keep folder structure (hierarchy) and create destination folders if they don't exist.
import os
import shutil
for root, dirs, files in os.walk(the_folder_copy_from):
for name in files:
if os.path.getsize(os.path.join(root, name))<10*1024*1024:
target=os.path.join("backup", os.path.relpath(os.path.join(root, name),start=the_folder_copy_from))
print(target)
os.makedirs(os.path.dirname(target),exist_ok=True)
shutil.copy(src=os.path.join(root, name),dst=target)
print("Done")

Backup File Script

I am writing a script to backup files from one dir(Master) to another dir(Clone).
And the script will monitor the two directories.
If a file inside clone is missing then the script will copy the missing file from Master to
Clone.Now I have a problem creating the missing folder.
I have read the documentation and found that shutil.copyfile will create a dir if the
dir doesn't exist.But I am getting an IOError message showing that the destination dir
is not exist.Below is the code.
import os,shutil,hashlib
master="C:\Users\Will Yan\Desktop\Master"
client="D:\Clone"
if(os.path.exists(client)):
print "PATH EXISTS"
else:
print "PATH Doesn't exists copying"
shutil.copytree(master,client)
def walkLocation(location,option):
aList = []
for(path,dirs,files) in os.walk(location):
for i in files:
if option == "path":
aList.append(path+"/"+i)
else:
aList.append(i)
return aList
def getPaths(location):
paths=[]
files=[]
result =[]
paths = walkLocation(location,'path')
files = walkLocation(location,'files')
result.append(paths)
result.append(files)
return result
ma=walkLocation(master,"path")
cl=walkLocation(client,"path")
maf=walkLocation(master,"a")
clf=walkLocation(client,"a")
for i in range(len(ma)):
count = 0
for j in range(len(cl)):
if maf[i]==clf[j]:
break
else:
count= count+1
if count==len(cl):
dirStep1=ma[i][ma[i].find("Master")::]
dirStep2=dirStep1.replace("Master",client)
shutil.copyfile(ma[i],dirStep2)
Can anyone tell me where did I do wrong?
Thanks
Sorry, but the documentation doesn't say that. Here's a reproduction of the full documentation for the function:
shutil.copyfile(src, dst)
Copy the
contents (no metadata) of the file
named src to a file named dst. dst
must be the complete target file name;
look at copy() for a copy that accepts
a target directory path. If src and
dst are the same files, Error is
raised. The destination location must
be writable; otherwise, an IOError
exception will be raised. If dst
already exists, it will be replaced.
Special files such as character or
block devices and pipes cannot be
copied with this function. src and dst
are path names given as strings.
So you have to create the directory yourself.

Categories