The following Python code extracts images from a Pdf file and saves them as jp2 files. The files are then named im1.jp2 and im2.jp2 and seem to be overwritten with a new pdf file from the path on the next run.
How can I give the jp2 files a specific name within the Write() method? E.g. pathname_im1.jp2? Or is it possible to rename it directly?
from PyPDF2 import PdfReader
from pathlib import Path
pdfdirpath = Path('C:/Users/...')
pathlist = pdfdirpath.glob('*.pdf')
for path in pathlist:
reader = PdfReader(path)
for page in reader.pages:
for image in page.images:
with open(image.name, "wb") as fp:
fp.write(image.data)
Well, there are actually a lot of good ways to do this. As it was pointed out in the comments, you could just use an enumerate:
from PyPDF2 import PdfReader
from pathlib import Path
pdfdirpath = Path('C:/Users/...')
pathlist = pdfdirpath.glob('*.pdf')
for path in pathlist:
reader = PdfReader(path)
for page in reader.pages:
for index, image in enumerate(page.images):
filename = f'{index}{image.name}'
with open(filename, "wb") as fp:
fp.write(image.data)
Or you could append the datetime (which I think is better and more reliable across different runs, if you haven't anything better to use).
from datetime import datetime
from PyPDF2 import PdfReader
from pathlib import Path
pdfdirpath = Path('C:/Users/...')
pathlist = pdfdirpath.glob('*.pdf')
# Note this will use the same datetime for all images
date = datetime.now().strftime("%Y%m%d_%H%M%S")
for path in pathlist:
reader = PdfReader(path)
for page in reader.pages:
for image in page.images:
filename = f'{date}_{image.name}'
with open(filename, "wb") as fp:
fp.write(image.data)
You can then modify this based on how exactly you want the thing.
Related
I'm looking to retrieve a list of CSV files, and use these names as variables to open and retrieve their content. Something like this:
import csv
import os
files = os.listdir('C:/csvs')
with open(files[0], 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for line in csv_reader:
if line[1]=="**STAFF**":
pass
else:
print(line)
If I print files[0], I do get the correct content, but when I try the above code it does not work.
os.listdir(directory_path) gives filenames which are inside the folder. To actually use the file you need the full path (absolute or relative). This can be easily done by appending each file's name to the directory_path like this:
import os
files = os.listdir(directory_path)
full_file_path = os.path.join(directory_path, files[0])
You can also use glob to save the trouble of joining the paths.
The codes I have written, for some reasons does not work.
import pandas as pd
import glob
import zipfile
path = r"C:/Users/nano/Documents/Project" # use your path
all_files = glob.glob(path + "/*.gz")
for folder in all_files:
with zipfile.ZipFile(folder,"r") as zip_ref:
zip_ref.extractall(path)
First you are using Zip against Gzip. So you need to use the right library. Below is a working example of the code.
import glob
import os
import gzip
path = r"C:/Temp/Unzip" # use your path
all_files = glob.glob(path + "/*.gz")
print(all_files)
for file in all_files:
path, filename = os.path.split(file)
filename = os.path.splitext(filename)[0]
with gzip.open(file,"rb") as gz:
with open('{0}/{1}.csv'.format(path, filename), 'wb') as cv:
cv.writelines(gz.read())
gzip (.gz) and zip (.zip) are two different things. For gzip, you can use gzip:
import glob
import gzip
import shutil
path = r"C:/Users/shedez/Documents/Project" # use your path
all_files = glob.glob(path + "/*.gz")
for folder in all_files:
dst=folder[:-3] # destination file name
with gzip.open(folder, 'rb') as f_in, open(dst, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
If you use gz (gZip) format, you might want to look at the gzip package, I'm not aware of an extract method, but you can do something as such, using pandas purely, which i find more convenient:
for folder in all_files:
c = pd.read_csv(folder, compression='gzip')
c.to_csv(path+folder[:-2]+"csv")
the [:-2] is to cut the "gz", and you might want to either change the parameters of read_csv (adding header row, or whatever) or the flags of to_csv (setting the arguments header=False, index_label=False to prevent panda adding you undesired stuff
alternatively, you could open it with gzip
import gzip
import shutil
with open(folder, 'rb') as f_in, gzip.open(folder[:-2]+"csv", 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
Try out this code:
import os, zipfile
dir_name = 'C:\\Users\\shedez\\Documents\\Project' # ZIP location
extract_dir_name = 'C:\\Users\\shedez\\Documents\\Project\\Unziped' # CSV location after unzip
extension = ".zip" # you might have to change this
os.chdir(dir_name) # change directory from working dir to dir with files
for item in os.listdir(dir_name): # loop through items in dir
if item.endswith(extension): # check for ".zip" extension
file_name = os.path.abspath(item) # get full path of files
zip_ref = zipfile.ZipFile(file_name) # create zipfile object
zip_ref.extractall(extract_dir_name) # extract file to dir
zip_ref.close() # close file
If you want to learn more about zipFile, click here.
I have several sub-folders, each of which containing twitter files which are zipped. I want python to iterate through these sub-folders and turn them into regular JSON files.
I have more than 300 sub-folders, each of which containing about 1000 or more of these zipped files.
A sample of these files is named:
00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D"
Thanks in advance
I have tried the codes below, just to see if I can extract one of those files, but none worked.
import zipfile
zip_ref = zipfile.ZipFile('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0', 'r')
zip_ref.extractall('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D')
zip_ref.close()
I have also tried:
import tarfile
tar = tarfile.open('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D')
tar.extractall()
tar.close
here is my third try (and no luck):
import gzip
import json
with gzip.open('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D'
, 'rb') as f:
d = json.loads(f.read().decode("utf-8"))
There is another very similar threat on stackover flow, but my question is different in that my zipped file is originally JSON, and when I use this last method I get this error:
Exception has occurred: json.decoder.JSONDecodeError
Expecting value: line 1 column 1 (char 0)
Simple script that answers the question: it traverses, checks if file (fname) is a gzip (via magic number because I'm cynical) and unzips it.
import json
import gzip
import binascii
import os
def is_gz_file(filepath):
with open(filepath, 'rb') as test_f:
return binascii.hexlify(test_f.read(2)) == b'1f8b'
rootDir = '.'
for dirName, subdirList, fileList in os.walk(rootDir):
for fname in fileList:
filepath = os.path.join(dirName,fname)
if is_gz_file(filepath):
f = gzip.open(filepath, 'rb')
json_content = json.loads(f.read())
print(json_content)
Tested and it works.
I have this password protected zip folder:
folder_1\1.zip
When I extract this it gives me
1\image.png
How can I extract this to another folder without its folder name? Just the contents of it: image.png
So far I have done all stackoverflows solutions and took me 11 hrs straight just to solve this.
import zipfile
zip = zipfile.ZipFile('C:\\Users\\Desktop\\folder_1\\1.zip', 'r')
zip.setpassword(b"virus")
zip.extractall('C:\\Users\\Desktop') <--target dir to extract all contents
zip.close()
EDIT:
This code worked for me: (Now I want many paths to be extracted at once, any ideas?
import os
import shutil
import zipfile
my_dir = r"C:\\Users\\Desktop"
my_zip = r"C:\\Users\\Desktop\\test\\folder_1\\1.zip"
with zipfile.ZipFile(my_zip) as zip_file:
zip_file.setpassword(b"virus")
for member in zip_file.namelist():
filename = os.path.basename(member)
# skip directories
if not filename:
continue
# copy file (taken from zipfile's extract)
source = zip_file.open(member)
target = file(os.path.join(my_dir, filename), "wb")
with source, target:
shutil.copyfileobj(source, target)
You can use the ZipFile.read() method to read the specific file in the archive, open your target file for writing by joining the target directory with the base name of the source file, and then write what you read to it:
import zipfile
import os
zip = zipfile.ZipFile('C:\\Users\\Desktop\\folder_1\\1.zip', 'r')
zip.setpassword(b"virus")
for name in zip.namelist():
if not name.endswith(('/', '\\')):
with open(os.path.join('C:\\Users\\Desktop', os.path.basename(name)), 'wb') as f:
f.write(zip.read(name))
zip.close()
And if you have several paths containing 1.zip for extraction:
import zipfile
import os
for path in 'C:\\Users\\Desktop\\folder_1', 'C:\\Users\\Desktop\\folder_2', 'C:\\Users\\Desktop\\folder_3':
zip = zipfile.ZipFile(os.path.join(path, '1.zip'), 'r')
zip.setpassword(b"virus")
for name in zip.namelist():
if not name.endswith(('/', '\\')):
with open(os.path.join('C:\\Users\\Desktop', os.path.basename(name)), 'wb') as f:
f.write(zip.read(name))
zip.close()
import glob2
from datetime import datetime
filenames = glob2.glob("*.txt")
with open(datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")+".txt", 'w') as file:
for filename in filenames:
with open(filename, "r") as f:
file.write(f.read() + "\n")
I was working in python and came across this name glob, googled it and couldn't find any answer, what does glob do, why is it used for?
from glob docs
"The glob module finds all the pathnames matching a specified pattern(...)"
i skip the imports import glob2 and
from datetime import datetime
get all the filenames in the directory where filename is any and it is extension is text
filenames = glob2.glob("*.txt")
open new file which name is current datetime in the format as specified in the strftime and open it with write access as variable 'file'
with open(datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")+".txt", 'w') as file:
for each filenames in found files which names / paths are stored in filenames variable...
for filename in filenames:
with the filename open for read access as f:
with open(filename, "r") as f:
write all content from f into file and add \n to the end (\n = new line)
file.write(f.read() + "\n")
I also saw "glob2"-module used in a kaggle-notebook and researched my own answer in what is the difference to "glob".
All features of "glob2" are in the current included "glob"-implementation of python.
So there is no reason to use "glob2" anymore.
As for what glob does in general, BlueTomato already provided a nice link and description.