I am compiling a script for adding custom property in PDF files using PdfMerger() in PyPdf2. It worked fine for almost all the files except a few. And error occurs in some function inside the PdfMerge. I don't understand what exactly is causing this error or how to rectify it. Here is the entire program - not sure if giving a snippet would be helpful.
import os
import pandas as pd
from PyPDF2 import PdfReader, PdfMerger
df = pd.read_excel('S:\\USERS\\VINTON\\F001A - Item Master (Stock and Cost)- 270001.xlsx')
folder_path = "U:\\BMP" pdf_files = [os.path.splitext(f)[0] for f in os.listdir(folder_path) if f.endswith('.pdf')]
for EachFile in pdf_files:
search_value = EachFile
print(EachFile)
search_result = df[df['Item Number 02'] == search_value]
# Find the corresponding value in the "Name" column of the same w
if not search_result.empty:
print("Found in JDE")
Revision = search_result['Rev'].values[0]
Description = search_result['Item Description 01'].values[0]
FileName = "U:\\BMP\\" + search_value + ".pdf"
# Get the file from BMP Folder
file_in = open(FileName, 'rb')
pdf_reader = PdfReader(file_in)
if pdf_reader.is_encrypted:
print("Encrypted")
continue
metadata = pdf_reader.metadata
# Adding entire existing file to the new file created
pdf_merger = PdfMerger()
pdf_merger.append(file_in)
pdf_merger.add_metadata({
'/Revision': Revision,
'/Description': Description
})
file_out = open("S:\\USERS\\VINTON\\BMP-Rev\\" + search_value ".pdf", 'wb')
pdf_merger.write(file_out)
file_in.close()
file_out.close()
print("All Done!!")
I cannot figure out how to overcome assertion errors because the error is shown to have occurred in several layers below the simplified syntax.
There is "+" sign missing in this line before ".pdf"
file_out = open("S:\USERS\VINTON\BMP-Rev\" + search_value ".pdf", 'wb')
try this:
file_out = open("S:\USERS\VINTON\BMP-Rev\" + search_value + ".pdf", 'wb')
hope it works
Use try and except statements when reading or merging pdf files to throw the exception messages if failed. It's always a good practice to throw errors and exceptions when working with files or memory for development purposes.
import os
import pandas as pd
from PyPDF2 import PdfReader, PdfMerger
df = pd.read_excel('S:\\USERS\\VINTON\\F001A - Item Master (Stock and Cost)- 270001.xlsx')
folder_path = "U:\\BMP"
pdf_files = [os.path.splitext(f)[0] for f in os.listdir(folder_path) if f.endswith('.pdf')]
for EachFile in pdf_files:
search_value = EachFile
print(EachFile)
search_result = df[df['Item Number 02'] == search_value]
# Find the corresponding value in the "Name" column of the same w
if not search_result.empty:
print("Found in JDE")
Revision = search_result['Rev'].values[0]
Description = search_result['Item Description 01'].values[0]
FileName = "U:\\BMP\\" + search_value + ".pdf"
# Get the file from BMP Folder
file_in = open(FileName, 'rb')
try:
pdf_reader = PdfReader(file_in)
if pdf_reader.is_encrypted:
print("Encrypted")
continue
metadata = pdf_reader.metadata
# Adding entire existing file to the new file created
pdf_merger = PdfMerger()
pdf_merger.append(file_in)
pdf_merger.add_metadata({
'/Revision': Revision,
'/Description': Description
})
except Exception as e:
print(e)
file_out = open("S:\\USERS\\VINTON\\BMP-Rev\\" + search_value ".pdf", 'wb')
pdf_merger.write(file_out)
file_in.close()
file_out.close()
print("All Done!!")
Related
I'm writing a simple script which loops over some text file and uses a function which should replace some string looking in a .csv file (every row has the word to replace and the word which I want there)
Here is my simple code:
import os
import re
import csv
def substitute_tips(table, tree_content):
count = 0
for l in table:
print("element of the table", l[1])
reg_tree = re.search(l[1],tree_content)
if reg_tree is not None:
#print("match in the tree: ",reg_tree.group())
tree_content = tree_content.replace(reg_tree.group(), l[0])
count = count + 1
else:
print("Not found: ",l[1])
tree_content = tree_content
print("Substitutions done: ",count)
return(tree_content)
path=os.getcwd()
table_name = "162_table.csv"
table = open(table_name)
csv_table = csv.reader(table, delimiter='\t')
for root, dirs, files in os.walk(path, topdown=True):
for name in files:
if name.endswith(".tree"):
print(Fore.GREEN + "Working on treefile", name)
my_tree = open(name, "r")
my_tree_content = my_tree.read()
output_tree = substitute_tips(csv_table, my_tree_content)
output_file = open(name.rstrip("tree") + "SPECIES_NAME.tre", "w")
output_file.write(output_tree)
output_file.close()
else:
print(Fore.YELLOW + name ,Fore.RED + "doesn't end in .tree")
It's probably very easy, but I'm a newbie.
Thanks!
The files list returned by os.walk contains only the file names rather than the full path names. You should join root with the file names instead to be able to open them:
Change:
my_tree = open(name, "r")
...
output_file = open(name.rstrip("tree") + "SPECIES_NAME.tre", "w")
to:
my_tree = open(os.path.join(root, name), "r")
...
output_file = open(os.path.join(root, name.rstrip("tree") + "SPECIES_NAME.tre"), "w")
I need to generate data and save it on a file in a directory, both are created at run time. "File Not Found error" occurs
I have some data which is created as below method
log = AnalyzeLog()
then I need to save that data in a file with .csv extension in the directory, both the directory and file supposed to be created at run time using the below code but I am not been able to create both...
plot_data_path = "E:\\Malicious_TLS_Detection-master\\M_TLS_Detection\\dataset\\data_model"
dir_name=dataset0
for dir_name in normal_folder_path:
path_to_single = normal_path + "\\" + dir_name
__PrintManager__.single_folder_header(path_to_single)
log.evaluate_features(path_to_single)
__PrintManager__.succ_single_folder_header()
log.create_plot_data(plot_data_path, dir_name)
def create_plot_data(self, path, filename):
__PrintManager__.evaluate_creating_plot()
self.create_dataset(path, filename)
__PrintManager__.succ_evaluate_data()
def create_dataset(self, path, filename):
index = 0
ssl_flow = 0
all_flow = 0
malicious = 0
normal = 0
# file header: label feature
header = [\
'label',\
'avg_domain_name_length',\
'std_domain_name_length',\
'avg_IPs_in_DNS']
with open(
path + "\\dataset-" + filename + ".csv", 'w+',
newline='') as f:
writer = csv.writer(f)
writer.writerow(header)
for key in self.conn_tuple:
label_feature = [\
str(self.conn_tuple[key].is_malicious()),\
length()),\
str(self.conn_tuple[key].avg_IPs_in_DNS())]
writer.writerow(label_feature)
print("<<< dataset file dataset-%s.csv successfully created !" %
filename)
The code just breaks at
with open(
path + "\\dataset-" + filename + ".csv", 'w+',
newline='') as f:
path=E:\\Malicious_TLS_Detection-master\\M_TLS_Detection\\dataset\\data_model
filename=dataset0
Data in the csv format must be created in a file but the following error arises
"No such file or directory: 'E:\Malicious_TLS_Detection-master\M_TLS_Detection\dataset\data_model\dataset-dataset0.csv'"
I have 3 json files that need to be parsed by python.
file1.jasn
file2.json
file3.json
I have intentionally sabotaged the format in file3.json so it doesn't actually contain correct json formatting.
my code:
import os, json, shutil
fileRoot = 'C:/root/python/'
inputFiles = fileRoot + 'input/'
processed_folder = fileRoot + 'processed/'
error_folder = fileRoot + 'error/'
print("processFiles")
print('inputfiles = ' + inputFiles)
if any(File.endswith(".json") for File in os.listdir(inputFiles)):
json_files = [pos_json for pos_json in os.listdir(inputFiles) if pos_json.endswith('.json')]
print('--------------------FILES IN DIRECTORY----------------------')
print(json_files)
print( '--------------------FILE LOOPING----------------------------')
for eachfile in json_files:
print(eachfile)
with open((inputFiles + eachfile), 'r') as f:
try:
data = json.load(f)
except :
shutil.move((inputFiles + eachfile), error_folder)
The idea is that if it doesn't parse the JSON, the file should be moved to another folder called 'error'
However, I keep getting errors such as:
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:/Python/input/file3.json' -> 'C:/root/Python/input/file3.json'
Why is this happening?
You are opening the files, and they stay open until the with block exits.
As a work-around you can store the files that you want to move in a list:
move_to_error = []
move_to_valid = []
for eachfile in json_files:
print(eachfile)
with open((inputFiles + eachfile), 'r') as f:
try:
data = json.load(f)
# if we have an exception in the previous line,
# the file will not be appended to move_to_valid
move_to_valid.append(eachfile)
except:
move_to_error.append(eachfile)
for eachfile in move_to_error:
shutil.move((inputFiles + eachfile), error_folder)
I am trying to write a script that tracks for changes made in directories/files set to multiple file paths created by an installer. I found Thomas Sileo's DirTools project on git, modified it, but am now running into some issues when writing/reading from JSON:
1) First, I believe that I am writing to JSON incorrectly and am finding that my create_state() function is only writing the last path I need.
2) If I get it working, I am unable to read/parse the file like I was before. I usually get ValueError: Extra data errors
Code below:
import os import json import getpass
files = [] subdirs = []
USER = getpass.getuser()
pathMac = ['/Applications/',
'/Users/' + USER + '/Documents/' ]
def create_dir_index(path):
files = []
subdirs = []
for root, dirs, filenames in os.walk(path):
for subdir in dirs:
subdirs.append(os.path.relpath(os.path.join(root, subdir), path))
for f in filenames:
files.append(os.path.relpath(os.path.join(root, f), path))
return dict(files=files, subdirs=subdirs)
def create_state(): for count in xrange(len(pathMac)):
dir_state = create_dir_index(pathMac[count])
out_file = open("Manifest.json", "w")
json.dump(dir_state, out_file)
out_file.close()
def compare_states(dir_base, dir_cmp):
'''
return a comparison two manifest json files
'''
data = {}
data['deleted'] = list(set(dir_cmp['files']) - set(dir_base['files']))
data['created'] = list(set(dir_base['files']) - set(dir_cmp['files']))
data['deleted_dirs'] = list(set(dir_cmp['subdirs']) - set(dir_base['subdirs']))
data['created_dirs'] = list(set(dir_base['subdirs']) - set(dir_cmp['subdirs']))
return data
if __name__ == '__main__':
response = raw_input("Would you like to Compare or Create? ")
if response == "Create":
# CREATE MANIFEST json file
create_state()
print "Manifest file created."
elif response == "Compare":
# create the CURRENT state of all indexes in pathMac and write to json file
for count in xrange(len(pathMac)):
dir_state = create_dir_index(pathMac[count])
out_file = open("CurrentState.json", "w")
json.dump(dir_state, out_file)
out_file.close()
# Open and Load the contents from the file into dictionaries
manifest = json.load(open("Manifest.json", "r"))
current = json.load(open("CurrentState.json", "r"))
print compare_states(current, manifest)
I wrote a script to read PDF metadata to ease a task at work. The current working version is not very usable in the long run:
from pyPdf import PdfFileReader
BASEDIR = ''
PDFFiles = []
def extractor():
output = open('windoutput.txt', 'r+')
for file in PDFFiles:
try:
pdf_toread = PdfFileReader(open(BASEDIR + file, 'r'))
pdf_info = pdf_toread.getDocumentInfo()
#print str(pdf_info) #print full metadata if you want
x = file + "~" + pdf_info['/Title'] + " ~ " + pdf_info['/Subject']
print x
output.write(x + '\n')
except:
x = file + '~' + ' ERROR: Data missing or corrupt'
print x
output.write(x + '\n')
pass
output.close()
if __name__ == "__main__":
extractor()
Currently, as you can see, I have to manually input the working directory and manually populate the list of PDF files. It also just prints out the data in the terminal in a format that I can copy/paste/separate into a spreadsheet.
I'd like the script to work automatically in whichever directory I throw it in and populate a CSV file for easier use. So far:
from pyPdf import PdfFileReader
import csv
import os
def extractor():
basedir = os.getcwd()
extension = '.pdf'
pdffiles = [filter(lambda x: x.endswith('.pdf'), os.listdir(basedir))]
with open('pdfmetadata.csv', 'wb') as csvfile:
for f in pdffiles:
try:
pdf_to_read = PdfFileReader(open(f, 'r'))
pdf_info = pdf_to_read.getDocumentInfo()
title = pdf_info['/Title']
subject = pdf_info['/Subject']
csvfile.writerow([file, title, subject])
print 'Metadata for %s written successfully.' % (f)
except:
print 'ERROR reading file %s.' % (f)
#output.writerow(x + '\n')
pass
if __name__ == "__main__":
extractor()
In its current state it seems to just prints a single error (as in, the error message in the exception, not an error returned by Python) message and then stop. I've been staring at it for a while and I'm not really sure where to go from here. Can anyone point me in the right direction?
writerow([file, title, subject]) should be writerow([f, title, subject])
You can use sys.exc_info() to print the details of your error
http://docs.python.org/2/library/sys.html#sys.exc_info
Did you check the pdffiles variable contains what you think it does? I was getting a list inside a list... so maybe try:
for files in pdffiles:
for f in files:
#do stuff with f
I personally like glob. Notice I add * before the .pdf in the extension variable:
import os
import glob
basedir = os.getcwd()
extension = '*.pdf'
pdffiles = glob.glob(os.path.join(basedir,extension)))
Figured it out. The script I used to download the files was saving the files with '\r\n' trailing after the file name, which I didn't notice until I actually ls'd the directory to see what was up. Thanks for everyone's help.