Read all files in .zip archive in python

Read all files in .zip archive in python - python

I'm trying to read all files in a .zip archive named data1.zip using the glob() method.
import glob
from zipfile import ZipFile
archive = ZipFile('data1.zip','r')
files = archive.read(glob.glob('*.jpg'))
Error Message:
TypeError: unhashable type: 'list'
The solution to the problem I'm using is:
files = [archive.read(str(i+1)+'.jpg') for i in range(100)]
This is bad because I'm assuming my files are named 1.jpg, 2.jpg, etc.
Is there a better way using python best practices to do this? Doesn't need to be necessarily using glob()

glob doesn't look inside your archive, it'll just give you a list of jpg files in your current working directory.
ZipFile already has methods for returning information about the files in the archive: namelist returns names, and infolist returns ZipInfo objects which include metadata as well.
Are you just looking for:
archive = ZipFile('data1.zip', 'r')
files = archive.namelist()
Or if you only want .jpg files:
files = [name for name in archive.namelist() if name.endswith('.jpg')]
Or if you want to read all the contents of each file:
files = [archive.read(name) for name in archive.namelist()]
Although I'd probably rather make a dict mapping names to contents:
files = {name: archive.read(name) for name in archive.namelist()}
That way you can access contents like so:
files['1.jpg']
Or get a list of the files presents using files.keys(), etc.

Related

Extract a list of files with a certain criteria within subdirectory of zip archive in python

I want to access some .jp2 image files inside a zip file and create a list of their paths. The zip file contains a directory folder named S2A_MSIL2A_20170420T103021_N0204_R108_T32UNB_20170420T103454.SAFE and I am currently reading the files using glob, after having extracted the folder.
I don't want to have to extract the contents of the zip file first. I read that I cannot use glob within a zip directory, nor I can use wildcards to access files within it, so I am wondering what my options are, apart from extracting to a temporary directory.
The way I am currently getting the list is this:
dirr = r'C:\path-to-folder\S2A_MSIL2A_20170420T103021_N0204_R108_T32UNB_20170420T103454.SAFE'
jp2_files = glob.glob(dirr + '/**/IMG_DATA/**/R60m/*B??_??m.jp2', recursive=True)
There are additional different .jp2 files in the directory, for which reason I am using the glob wildcards to filter the ones I need.
I am hoping to make this work so that I can automate it for many different zip directories. Any help is highly appreciated.

I made it work with zipfile and fnmatch
from zipfile import ZipFile
import fnmatch
zip = path_to_zip.zip
with ZipFile(zipaki, 'r') as zipObj:
file_list = zipObj.namelist()
pattern = '*/R60m/*B???60m.jp2'
filtered_list = []
for file in file_list:
if fnmatch.fnmatch(file, pattern):
filtered_list.append(file)

Python list certain files in different folders

I've got 2 folders, each with a different CSV file inside (both have the same format):
I've written some python code to search within the "C:/Users/Documents" directory for CSV files which begin with the word "File"
import glob, os
inputfile = []
for root, dirs, files in os.walk("C:/Users/Documents/"):
for datafile in files:
if datafile.startswith("File") and datafile.endswith(".csv"):
inputfile.append([os.path.join(root, datafile)])
print(inputfile)
That almost worked as it returns:
[['C:/Users/Documents/Test A\\File 1.csv'], ['C:/Users/Documents/Test B\\File 2.csv']]
Is there any way I can get it to return this instead (no sub list and shows / instead of \):
['C:/Users/Documents/Test A/File 1.csv', 'C:/Users/Documents/Test B/File 2.csv']
The idea is so I can then read both CSV files at once later, but I believe I need to get the list in the format above first.

okay, I will paste an option here.
I made use of os.path.abspath to get the the path before join.
Have a look and see if it works.
import os
filelist = []
for folder, subfolders, files in os.walk("C:/Users/Documents/"):
for datafile in files:
if datafile.startswith("File") and datafile.endswith(".csv"):
filePath = os.path.abspath(os.path.join(folder, datafile))
filelist.append(filePath)
filelist
Result:
['C:/Users/Documents/Test A/File 1.csv','C:/Users/Documents/Test B/File 2.csv']

Create an empty YAML object and populate it with yaml files found in directory

I'm trying to iterate through a directory of multiple YAML files. My goal is to all merge them together, or rather, append them. My current solution requires me to load a 'placeholder.yml' file, which i then populate with the files in the directory. Here's the code:
import yaml
from yaml import Loader
import os
def yaml_loader(filepath):
# Loads a yaml file
with open(filepath, 'r') as file_descriptor:
data = yaml.load(file_descriptor, Loader)
return data
rootdir = './yaml_files'
generated_yaml = yaml_loader('./yaml_files/placeholder.yml')
for subdir, dirs, files in os.walk(rootdir):
for file in files:
data = yaml_loader(os.path.join(subdir, file))
generated_yaml.update(data)
print(generated_yaml)
This solution is not satisfactory as the placeholder.yml and must hold at least one value.
Is there a way to generate an empty YAML object for me to populate with the data i collect in my directory?
Also, if you know of any Libary that would suit this requirement, please let me know
Thanks in advance

yaml.load doesn't produce a "yaml object" as such, it just produces output that happens to correspond to the yaml in your file (or your string, if you load from a string). So it might produce a string, a list or a dict. Given that you're using update I'm guessing your yaml produces a dict, so you should be able to simply create an empty dict instead of using your placeholder:
rootdir = './yaml_files'
generated_yaml = dict()
for subdir, dirs, files in os.walk(rootdir):
for file in files:
data = yaml_loader(os.path.join(subdir, file))
generated_yaml.update(data)

I've decided to go with the !include functionality in PyYAML.
This required a little bit of restructuring, but suits my needs better.
The update function I used before, will overwrite configurations. As I only wanted to append files, using includes is way simpler. See here.

Python, Inconsistent zip file extraction

I am trying to extract zip files using the zipfile module's extractall method.
My code snippet is
import zipfile
file_path = '/something/airway.zip'
dir_path = 'something/'
with zipfile.ZipFile(file_path, "r") as zip_ref:
zip_ref.extractall(dir_path)
I have two zip files named, test (1.1 mb) and airway (520 mb).
For test.zip the folder contains all the files but for airway.zip, it creates another folder inside my target folder named Airway, and then extracts all the files there. Even after renaming the airway.zip to any garbage name, the result was same.
Is there some workaround to get only the files extracted in my target folder? It is critical for me as I'm doing this extraction automated from django
Python version: 3.9.6;
Django version: 2.2

I ran your code and it seems to be only a problem of the zipfile itself. If you create a zipfile by selecting only the elements you get the result you got with test.zip. If you create it by selecting a folder holding the elements the folder will be there if you extract it again, no matter what you name your zip file.

I have two articles related to this:
https://www.kite.com/python/docs/zipfile.ZipFile.extractall
https://www.geeksforgeeks.org/working-zip-files-python/
Even if both of these articles do not solve your problem then I think that instead of zipping the files in the folder you just zipped the folder itself so try by zipping the files inside the folder.

unzipping a file with Python and returning all the directories it creates

How can I unzip a .zip file with Python into some directory output_dir and fetch a list of all the directories made by the unzipping as a result? For example, if I have:
unzip('myzip.zip', 'outdir')
outdir is a directory that might have other files/directories in it. When I unzip myzip.zip into it, I'd like unzip to return all the directories made in outdir/ as a result of the zipping. Here is my code so far:
import zipfile
def unzip(zip_file, outdir):
"""
Unzip a given 'zip_file' into the output directory 'outdir'.
"""
zf = zipfile.ZipFile(zip_file, "r")
zf.extractall(outdir)
How can I make unzip return the dirs it creates in outdir? thanks.
Edit: the solution that makes most sense to me is to get ONLY the top-level directories in the zip file and then recursively walk through them which will guarantee that I get all the files made by the zip. Is this possible? The system specific behavior of namelist makes it virtually impossible to rely on

You can read the contents of the zip file with the namelist() method. Directories will have a trailing path separator:
>>> import zipfile
>>> zip = zipfile.ZipFile('test.zip')
>>> zip.namelist()
['dir2/', 'file1']
You can do this before or after extracting contents.
Depending on your operating environment, the result of namelist() may be limited to the top-level paths of the zip archive (e.g. Python on Linux) or may cover the full contents of the archive (e.g. IronPython on Windows).
The namelist() returns a complete listing of the zip archive contents, with directories marked with a trailing path separator. For instance, a zip archive of the following file structure:
./file1
./dir2
./dir2/dir21
./dir3
./dir3/file3
./dir3/dir31
./dir3/dir31/file31
results in the following list being returned by zipfile.ZipFile.namelist():
[ 'file1',
'dir2/',
'dir2/dir21/',
'dir3/',
'dir3/file3',
'dir3/dir31/',
'dir3/dir31/file31' ]

ZipFile.namelist will return a list of the names of the items in an archive. However, these names will only be the full names of files including their directory path. (A zip file can only contain files, not directories, so directories are implied by archive member names.) To determine the directories created, you need a list of every directory created implicitly by each file.
The dirs_in_zip() function below will do this and collect all dir names into a set.
import zipfile
import os
def parent_dirs(pathname, subdirs=None):
"""Return a set of all individual directories contained in a pathname
For example, if 'a/b/c.ext' is the path to the file 'c.ext':
a/b/c.ext -> set(['a','a/b'])
"""
if subdirs is None:
subdirs = set()
parent = os.path.dirname(pathname)
if parent:
subdirs.add(parent)
parent_dirs(parent, subdirs)
return subdirs
def dirs_in_zip(zf):
"""Return a list of directories that would be created by the ZipFile zf"""
alldirs = set()
for fn in zf.namelist():
alldirs.update(parent_dirs(fn))
return alldirs
zf = zipfile.ZipFile(zipfilename, 'r')
print(dirs_in_zip(zf))

Let it finish and then read the content of the directory - here is a good example of this.

Assuming no one else will be writing the target directory at the same time, walk the directory recursively prior to unzipping, then afterwards, and compare the results.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read all files in .zip archive in python - python

Related

Extract a list of files with a certain criteria within subdirectory of zip archive in python

Python list certain files in different folders

Create an empty YAML object and populate it with yaml files found in directory

Python, Inconsistent zip file extraction

unzipping a file with Python and returning all the directories it creates

Categories

Resources