How to read text files in a zipped folder in Python

How to read text files in a zipped folder in Python - python

I have a compressed data file (all in a folder, then zipped). I want to read each file without unzipping. I tried several methods but nothing works for entering the folder in the zip file. How should I achieve that?
Without folder in the zip file:
with zipfile.ZipFile('data.zip') as z:
for filename in z.namelist():
data = filename.readlines()
With one folder:
with zipfile.ZipFile('data.zip') as z:
for filename in z.namelist():
if filename.endswith('/'):
# Here is what I was stucked

namelist() returns a list of all items in an archive recursively.
You can check whether an item is a directory by calling os.path.isdir():
import os
import zipfile
with zipfile.ZipFile('archive.zip') as z:
for filename in z.namelist():
if not os.path.isdir(filename):
# read the file
with z.open(filename) as f:
for line in f:
print line
Hope that helps.

I got Alec's code to work. I made some minor edits: (note, this won't work with password-protected zipfiles)
import os
import sys
import zipfile
z = zipfile.ZipFile(sys.argv[1]) # Flexibility with regard to zipfile
for filename in z.namelist():
if not os.path.isdir(filename):
# read the file
for line in z.open(filename):
print line
z.close() # Close the file after opening it
del z # Cleanup (in case there's further work after this)

I got RichS' code to work. I made some minor edits:
import os
import sys
import zipfile
archive = sys.argv[1] # assuming launched with `python my_script.py archive.zip`
with zipfile.ZipFile(archive) as z:
for filename in z.namelist():
if not os.path.isdir(filename):
# read the file
for line in z.open(filename):
print(line.decode('utf-8'))
As you can see the edits are minor. I've switched to Python 3, the ZipFile class has a capital F, and the output is converted from b-strings to unicode strings. Only decode if you are trying to unzip a text file.
PS I'm not dissing RichS at all. I just thought it would be hilarious. Both useful and a mild shitpost.
PPS You can get file from an archive with a password: ZipFile.open(name, mode='r', pwd=None, *, force_zip64=False) or ZipFile.read(name, pwd=None). If you use .read then there's no context manager so you would simply do
# read the file
print(z.read(filename).decode('utf-8'))

Related

Filter Directory using Regex and output filtered files to another directory

I am simply trying to create a python 3 program that runs through all .sql files in a specific directory and then apply my regex that adds ; after a certain instance and write the changes made to the file to a separate directory with their respective file names as the same.
So, if I had file1.sql and file2.sql in "/home/files" directory, after I run the program, the output should write those two files to "/home/new_files" without changes the content of the original files.
Here is my code:
import glob
import re
folder_path = "/home/files/d_d"
file_pattern = "/*sql"
folder_contents = glob.glob(folder_path + file_pattern)
for file in folder_contents:
print("Checking", file)
for file in folder_contents:
read_file = open(file, 'rt',encoding='latin-1').read()
#words=read_file.split()
with open(read_file,"w") as output:
output.write(re.sub(r'(TBLPROPERTIES \(.*?\))', r'\1;', f, flags=re.DOTALL))
I receive an error of File name too long:"CREATE EXTERNAL TABLe" and also I am not too sure where I would put my output path (/home/files/new_dd)in my code.
Any ideas or suggestions?

With read_file = open(file, 'rt',encoding='latin-1').read() the whole content of the file was being used as the file descriptor. The code provided here iterate over the files names found with glob.glob pattern open to read, process data, and open to write (assuming that a folder newfile_sqls already exist,
if not, an error would rise FileNotFoundError: [Errno 2] No such file or directory).
import glob
import os
import re
folder_path = "original_sqls"
#original_sqls\file1.sql, original_sqls\file2.sql, original_sqls\file3.sql
file_pattern = "*sql"
# new/modified files folder
output_path = "newfile_sqls"
folder_contents = glob.glob(os.path.join(folder_path,file_pattern))
# iterate over file names
for file_ in [os.path.basename(f) for f in folder_contents]:
# open to read
with open(os.path.join(folder_path,file_), "r") as inputf:
read_file = inputf.read()
# use variable 'read_file' here
tmp = re.sub(r'(TBLPROPERTIES \(.*?\))', r'\1;', read_file, flags=re.DOTALL)
# open to write to (previouly created) new folder
with open(os.path.join(output_path,file_), "w") as output:
output.writelines(tmp)

Replace/Add a class file to subfolder in jar using python zipfile command

I have a jar file and I have a path which represents a location inside Jar file.
Using this location I need to replace class file inside jar(Add a class file in some cases).I have class file inside another folder which is present where jar is present(This class file i have to move to Jar).
Code which I am trying to achieve above objective :
import zipfile
import os
zf = zipfile.ZipFile(os.path.normpath('D:\mystuff\test.jar'),mode='a')
try:
print('adding testclass.class')
zf.write(os.path.normpath('D:\mystuff\testclass.class'))
finally:
print('closing')
zf.close()
After executing above code when I saw jar below mentioned format:
Jar
|----META-INF
|----com.XYZ
|----Mystuff
|--testclass.class
Actual Output I need is -
Jar
|----META-INF
|----com.XYZ
|--ABC
|-testclass.class
How can achieve this using zipfile.write command or any other way in python?
I didn't find any params in write command where i can provide destination file location inside Jar/Zip file.
ZipFile.write(filename, arcname=None, compress_type=None)

Specify arcname to change the name of the file in the archive.
import zipfile
import os
zf = zipfile.ZipFile(os.path.normpath(r'D:\mystuff\test.jar'),mode='a')
try:
print('adding testclass.class')
zf.write(os.path.normpath(r'D:\mystuff\testclass.class'),arcname="com.XYZ/ABC/testclass.class")
finally:
print('closing')
zf.close()
Note: I doubt test.jar is your real jar name, since you didn't protect your string against special chars and the jar file opened would have been 'D:\mystuff\<TAB>est.jar' (well, it doesn't work :))
EDIT: if you want to add the new file but remove the old one, you have to do differently: you cannot delete from a zipfile, you have to rebuild another one (inspired by Delete file from zipfile with the ZipFile Module)
import zipfile
import os
infile = os.path.normpath(r'D:\mystuff\test.jar')
outfile = os.path.normpath(r'D:\mystuff\test_new.jar')
zin = zipfile.ZipFile(infile,mode='r')
zout = zipfile.ZipFile(outfile,mode='w')
for item in zin.infolist():
if os.path.basename(item.filename)=="testclass.class":
pass # skip item
else:
# write the item to the new archive
buffer = zin.read(item.filename)
zout.writestr(item, buffer)
print('adding testclass.class')
zout.write(os.path.normpath(r'D:\mystuff\testclass.class'),arcname="com.XYZ/ABC/testclass.class")
zout.close()
zin.close()
os.remove(infile)
os.rename(outfile,infile)

Unable to remove zipped file after unzipping

I'm attempting to remove a zipped file after unzipping the contents on windows. The contents can be stored in a folder structure in the zip. I'm using the with statement and thought this would close the file-like object (source var) and zip file. I've removed lines of code relating to saving the source file.
import zipfile
import os
zipped_file = r'D:\test.zip'
with zipfile.ZipFile(zipped_file) as zip_file:
for member in zip_file.namelist():
filename = os.path.basename(member)
if not filename:
continue
source = zip_file.open(member)
os.remove(zipped_file)
The error returned is:
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'D:\\test.zip'
I've tried:
looping over the os.remove line in case it's a slight timing issue
Using close explicitly instead of the with statment
Attempted on local C drive and mapped D Drive

instead of passing in a string to the ZipFile constructor, you can pass it a file like object:
import zipfile
import os
zipped_file = r'D:\test.zip'
with open(zipped_file, mode="r") as file:
zip_file = zipfile.ZipFile(file)
for member in zip_file.namelist():
filename = os.path.basename(member)
if not filename:
continue
source = zip_file.open(member)
os.remove(zipped_file)

You are opening files inside the zip... which create a file lock on the whole zip file. close the inner file open first... via source.close() at the end of your loop
import zipfile
import os
zipped_file = r'D:\test.zip'
with zipfile.ZipFile(zipped_file) as zip_file:
for member in zip_file.namelist():
filename = os.path.basename(member)
if not filename:
continue
source = zip_file.open(member)
source.close()
os.remove(zipped_file)

Try to close the zipfile before removing.

you can do also like this, which works pretty good:
import os, shutil, zipfile
fpath= 'C:/Users/dest_folder'
path = os.getcwd()
for file in os.listdir(path):
if file.endswith(".zip"):
dirs = os.path.join(path, file)
if os.path.exists(fpath):
shutil.rmtree(fpath)
_ = os.mkdir(fpath)
with open(dirs, 'rb') as fileobj:
z = zipfile.ZipFile(fileobj)
z.extractall(fpath)
z.close()
os.remove(dirs)

How to extract a file within a folder within a zip?

I need to extract a file called Preview.pdf from a folder called QuickLooks inside of a zip file.
Right now my code looks a little like this:
with ZipFile(newName, 'r') as newName:
newName.extract(\QuickLooks\Preview.pdf)
newName.close()
(In this case, newName has been set equal to the full path to the zip).
It's important to note that the backslash is correct in this case because I'm on Windows.
The code doesn't work; here's the error it gives:
Traceback (most recent call last):
File "C:\Users\Asit\Documents\Evam\Python_Scripts\pageszip.py", line 18, in <module>
ZF.extract("""QuickLooks\Preview.pdf""")
File "C:\Python33\lib\zipfile.py", line 1019, in extract
member = self.getinfo(member)
File "C:\Python33\lib\zipfile.py", line 905, in getinfo
'There is no item named %r in the archive' % name)
KeyError: "There is no item named 'QuickLook/Preview.pdf' in the archive"
I'm running the Python script from inside Notepad++, and taking the output from its console.
How can I accomplish this?
Alternatively, how could I extract the whole QuickLooks folder, move out Preview.pdf, and then delete the folder and the rest of it's contents?
Just for context, here's the rest of the script. It's a script to get a PDF of a .pages file. I know there are bonified converters out there; I'm just doing this as an excercise with some sort of real-world application.
import os.path
import zipfile
from zipfile import *
import sys
file = raw_input('Enter the full path to the .pages file in question. Please note that file and directory names cannot contain any spaces.')
dir = os.path.abspath(os.path.join(file, os.pardir))
fileName, fileExtension = os.path.splitext(file)
if fileExtension == ".pages":
os.chdir(dir)
print (dir)
fileExtension = ".zip"
os.rename (file, fileName + ".zip")
newName = fileName + ".zip" #for debugging purposes
print (newName) #for debugging purposes
with ZipFile(newName, 'w') as ZF:
print("I'm about to list names!")
print(ZF.namelist()) #for debugging purposes
ZF.extract("QuickLook/Preview.pdf")
os.rename('Preview.pdf', fileName + '.pdf')
finalPDF = fileName + ".pdf"
print ("Check out the PDF! It's located at" + dir + finalPDF + ".")
else:
print ("Sorry, this is not a valid .pages file.")
sys.exit
I'm not sure if the import of Zipfile is redundant; I read on another SO post that it was better to use from zipfile import * than import zipfile. I wasn't sure, so I used both. =)
EDIT: I've changed the code to reflect the changes suggested by Blckknght.

Here's something that seems to work. There were several issues with your code. As I mentioned in a comment, the zipfile must be opened with mode 'r' in order to read it. Another is that zip archive member names always use forward slash / characters in their path names as separators (see section 4.4.17.1 of the PKZIP Application Note). It's important to be aware that there's no way to extract a nested archive member to a different subdirectory with Python's currentzipfilemodule. You can control the root directory, but nothing below it (i.e. any subfolders within the zip).
Lastly, since it's not necessary to rename the .pages file to .zip — the filename you passZipFile() can have any extension — I removed all that from the code. However, to overcome the limitation on extracting members to a different subdirectory, I had to add code to first extract the target member to a temporary directory, and then copy that to the final destination. Afterwards, of course, this temporary folder needs to deleted. So I'm not sure the net result is much simpler...
import os.path
import shutil
import sys
import tempfile
from zipfile import ZipFile
PREVIEW_PATH = 'QuickLooks/Preview.pdf' # archive member path
pages_file = input('Enter the path to the .pages file in question: ')
#pages_file = r'C:\Stack Overflow\extract_test.pages' # hardcode for testing
pages_file = os.path.abspath(pages_file)
filename, file_extension = os.path.splitext(pages_file)
if file_extension == ".pages":
tempdir = tempfile.gettempdir()
temp_filename = os.path.join(tempdir, PREVIEW_PATH)
with ZipFile(pages_file, 'r') as zipfile:
zipfile.extract(PREVIEW_PATH, tempdir)
if not os.path.isfile(temp_filename): # extract failure?
sys.exit('unable to extract {} from {}'.format(PREVIEW_PATH, pages_file))
final_PDF = filename + '.pdf'
shutil.copy2(temp_filename, final_PDF) # copy and rename extracted file
# delete the temporary subdirectory created (along with pdf file in it)
shutil.rmtree(os.path.join(tempdir, os.path.split(PREVIEW_PATH)[0]))
print('Check out the PDF! It\'s located at "{}".'.format(final_PDF))
#view_file(final_PDF) # see Bonus below
else:
sys.exit('Sorry, that isn\'t a .pages file.')
Bonus: If you'd like to actually view the final pdf file from the script, you can add the following function and use it on the final pdf created (assuming you have a PDF viewer application installed on your system):
import subprocess
def view_file(filepath):
subprocess.Popen(filepath, shell=True).wait()

Python File Concatenation

I have a data folder, with subfolders for each subject that ran through a program. So, for example, in the data folder, there are folders for Bob, Fred, and Tom. Each one of those folders contains a variety of files and subfolders. However, I am only interested in the 'summary.log' file contained in each subject's folder.
I want to concatenate the 'summary.log' file from Bob, Fred, and Tom into a single log file in the data folder. In addition, I want to add a column to each log file that will list the subject number.
Is this possible to do in Python? Or is there an easier way to do it? I have tried a number of different batches of code, but none of them get the job done. For example,
#!/usr/bin/python
import sys, string, glob, os
fls = glob.glob(r'/Users/slevclab/Desktop/Acceptability Judgement Task/data/*');
outfile = open('summary.log','w');
for x in fls:
file=open(x,'r');
data=file.read();
file.close();
outfile.write(data);
outfile.close();
Gives me the error,
Traceback (most recent call last):
File "fileconcat.py", line 8, in <module>
file=open(x,'r');
IOError: [Errno 21] Is a directory
I think this has to do with the fact that the data folder contains subfolders, but I don't know how to work around it. I also tried this, but to no avail:
from glob import iglob
import shutil
import os
PATH = r'/Users/slevclab/Desktop/Acceptability Judgement Task/data/*'
destination = open('summary.log', 'wb')
for filename in iglob(os.path.join(PATH, '*.log'))
shutil.copyfileobj(open(filename, 'rb'), destination)
destination.close()
This gives me an "invalid syntax" error at the "for filename" line, but I'm not sure what to change.

The syntax is not related to the use of glob.
You forget the ":" at the end of the for statement:
for filename in iglob(os.path.join(PATH, '*.log')):
^--- missing
But the following pattern works :
PATH = r'/Users/slevclab/Desktop/Acceptability Judgement Task/data/*/*.log'
destination = open('summary.log', 'wb')
for filename in iglob(PATH):
shutil.copyfileobj(open(filename, 'rb'), destination)
destination.close()

The colon (:) is missing in the for line.
Besides you should use with because it handles closing the file (your code is not exception safe).
from glob import iglob
import shutil
import os
PATH = r'/Users/slevclab/Desktop/Acceptability Judgement Task/data/*'
with open('summary.log', 'wb') as destination:
for filename in iglob(os.path.join(PATH, '*.log')):
with open(filename, 'rb') as in_:
shutil.copyfileobj(in_, destination)

In your first example:
import sys, string, glob, os
you are not using sys, string or os, so there is no need to import those.
fls = glob.glob(r'/Users/slevclab/Desktop/Acceptability Judgement Task/data/*');
here, you are selecting the subject folders. Since you are interested in summary.log files within these folders, you may change the pattern as follows:
fls = glob.glob('/Users/slevclab/Desktop/Acceptability Judgement Task/data/*/summary.log')
In Python, there is no need to end lines with semicolons.
outfile = open('summary.log','w')
for x in fls:
file = open(x, 'r')
data = file.read()
file.close()
outfile.write(data)
outfile.close()

As VGE's answer shows, your second solution works once you've fixed the syntax error. But note that a more general solution is to use os.walk:
>>> import os
>>> for i in os.walk('foo'):
... print i
...
('foo', ['bar', 'baz'], ['oof.txt'])
('foo/bar', [], ['rab.txt'])
('foo/baz', [], ['zab.txt'])
This goes through all the directories in the tree above the start directory and maintains a nice separation between directories and files.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read text files in a zipped folder in Python - python

Related

Filter Directory using Regex and output filtered files to another directory

Replace/Add a class file to subfolder in jar using python zipfile command

Unable to remove zipped file after unzipping

How to extract a file within a folder within a zip?

Python File Concatenation

Categories

Resources