grab next .zip file in folder (iterate through zip directory) - python

Below is my most recent attempt; but alas, I print 'current_file' and it's always the same (first) .zip file in my directory?
Why/how can I iterate this to get to the next file in my zip directory?
my DIRECTORY_LOCATION has 4 zip files in it.
def find_file(cls):
listOfFiles = os.listdir(config.DIRECTORY_LOCATION)
total_files = 0
for entry in listOfFiles:
total_files += 1
# if fnmatch.fnmatch(entry, pattern):
current_file = entry
print (current_file)
""""Finds the excel file to process"""
archive = ZipFile(config.DIRECTORY_LOCATION + "/" + current_file)
for file in archive.filelist:
if file.filename.__contains__('Contact Frog'):
return archive.extract(file.filename, config.UNZIP_LOCATION)
return FileNotFoundError
find_file usage:
excel_data = pandas.read_excel(self.find_file())
Update:
I just tried changing return to yield at:
yield archive.extract(file.filename, config.UNZIP_LOCATION)
and now getting the below error at my find_file line.
ValueError: Invalid file path or buffer object type: <class 'generator'>
then I alter with the generator obj as suggested in comments; i.e.:
generator = self.find_file(); excel_data = pandas.read_excel(generator())
and now getting this error:
generator = self.find_file(); excel_data = pandas.read_excel(generator())
TypeError: 'generator' object is not callable
Here is my /main.py if helpful
"""Start Point"""
from data.find_pending_records import FindPendingRecords
from vital.vital_entry import VitalEntry
import sys
import os
import config
import datetime
# from csv import DictWriter
if __name__ == "__main__":
try:
for file in os.listdir(config.DIRECTORY_LOCATION):
if 'VCCS' in file:
PENDING_RECORDS = FindPendingRecords().get_excel_data()
# Do operations on PENDING_RECORDS
# Reads excel to map data from excel to vital
MAP_DATA = FindPendingRecords().get_mapping_data()
# Configures Driver
VITAL_ENTRY = VitalEntry()
# Start chrome and navigate to vital website
VITAL_ENTRY.instantiate_chrome()
# Begin processing Records
VITAL_ENTRY.process_records(PENDING_RECORDS, MAP_DATA)
except:
print("exception occured")
raise

It is not tested.
def find_file(cls):
listOfFiles = os.listdir(config.DIRECTORY_LOCATION)
total_files = 0
for entry in listOfFiles:
total_files += 1
# if fnmatch.fnmatch(entry, pattern):
current_file = entry
print (current_file)
""""Finds the excel file to process"""
archive = ZipFile(config.DIRECTORY_LOCATION + "/" + current_file)
for file in archive.filelist:
if file.filename.__contains__('Contact Frog'):
yield archive.extract(file.filename, config.UNZIP_LOCATION)
This is just your function rewritten with yield instead of return.
I think it should be used in the following way:
for extracted_archive in self.find_file():
excel_data = pandas.read_excel(extracted_archive)
#do whatever you want to do with excel_data here
self.find_file() is a generator, should be used like an iterator (read this answer for more details).
Try to integrate the previous loop in your main script. Each iteration of the loop, it will read a different file in excel_data, so in the body of the loop you should also do whatever you need to do with the data.
Not sure what you mean by:
just one each time the script is executed
Even with yield, if you execute the script multiple times, you will always start from the beginning (and always get the first file). You should read all of the files in the same execution.

Related

Is it possible to create a python script that looks for files in a directory on a given time daily?

So basically, I'm creating a directory that allows users to put csv files in there. But I want to create python script that would look in that folder everyday at a given time (lets say noon) and pick up the latest file that was placed in there if it's not over a day old. But I'm not sure if that's possible.
Its this chunk of code that I would like to run if it the app finds a new file in the desired directory:
def better_Match(results, best_percent = "Will need to get the match %"):
result = {}
result_list = [{item.name:item.text for item in result} for result in results]
if result_list:
score_list = [float(item['score']) for item in result_list]
match_index = max(enumerate(score_list),key=lambda x: x[1])[0]
logger.debug('MRCs:{}, Chosen MRC:{}'.format(score_list,score_list[match_index]))
logger.debug(result_list[match_index])
above_threshold = float(result_list[match_index]['score']) >= float(best_percent)
if above_threshold:
result = result_list[match_index]
return result
def clean_plate_code(platecode):
return str(platecode).lstrip('0').zfill(5)[:5]
def re_ch(file_path, orig_data, return_columns = ['ex_opbin']):
list_of_chunk_files = list(file_path.glob('*.csv'))
cb_ch = [pd.read_csv(f, sep=None, dtype=object, engine='python') for f in tqdm(list_of_chunk_files, desc='Combining ch', unit='chunk')]
cb_ch = pd.concat(cb_ch)
shared_columns = [column_name.replace('req_','') for column_name in cb_ch.columns if column_name.startswith('req_')]
cb_ch.columns = cb_ch.columns.str.replace("req_", "")
return_columns = return_columns + shared_columns
cb_ch = cb_ch[return_columns]
for column in shared_columns:
cb_ch[column] = cb_ch[column].astype(str)
orig_data[column] = orig_data[column].astype(str)
final= orig_data.merge(cb_ch, how='left', on=shared_columns)
return final
For running script at certain time:
You can use cron for linux.
In windows you can use windows scheduler
Here is an example for getting latest file in directory
files = os.listdir(output_folder)
files = [os.path.join(output_folder, file) for file in files]
files = [file for file in files if os.path.isfile(file)]
latest_file = max(files, key=os.path.getctime)
This will do the job!
import os
import time
import threading
import pandas as pd
DIR_PATH = 'DIR_PATH_HERE'
def create_csv_file():
# create files.csv file that will contains all the current files
# This will run for one time only
if not os.path.exists('files.csv'):
list_of_files = os.listdir(DIR_PATH )
list_of_files.append('files.csv')
pd.DataFrame({'files':list_of_files}).to_csv('files.csv')
else:
None
def check_for_new_files():
create_csv_file()
files = pd.read_csv('files.csv')
list_of_files = os.listdir(DIR_PATH )
if len(files.files) != len(list_of_files):
print('New file added')
#do what you want
#save your excel with the name sample.xslx
#append your excel into list of files and get the set so you will not have the sample.xlsx twice if run again
list_of_files.append('sample.xslx')
list_of_files=list(set(list_of_files))
#save again the curent list of files
pd.DataFrame({'files':list_of_files}).to_csv('files.csv')
print('Finished for the day!')
ticker = threading.Event()
# Run the program every 86400 seconds = 24h
while not ticker.wait(86400):
check_for_new_files()
It basically uses threading to check for new files every 86400s which is 24h, and saves all the current files in a directory where the py file is in and checks for new files that does not exist in the csv file and append them to the files.csv file every day.

how to fix the error displayed on python shell ?

i have a code that is written in PYTHON where the code allow the user to select the path of folder that contains PDF files and convert it to text files.
the system system work perfect when the content is not ARABIC.
error displayed :
Traceback (most recent call last): File
"C:\Users\test\Downloads\pdf-txt\text maker.py", line 32, in
path=list[i] IndexError: list index out of range
code:
import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime
def check_path(prompt):
''' (str) -> str
Verifies if the provided absolute path does exist.
'''
abs_path = raw_input(prompt)
while path.exists(abs_path) != True:
print "\nThe specified path does not exist.\n"
abs_path = raw_input(prompt)
return abs_path
print "\n"
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
for filename in files:
if filename.endswith('.pdf'):
t=os.path.join(directory,filename)
list.append(t)
m=len(list)
i=0
while i<=len(list):
path=list[i]
head,tail=os.path.split(path)
var="\\"
tail=tail.replace(".pdf",".txt")
name=head+var+tail
content = ""
# Load PDF into pyPDF
##pdf = pyPdf.PdfFileReader(file(path, "rb"))
pdf = pyPdf.PdfFileReader(codecs.open(path, "rb", encoding='UTF-8'))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
print strftime("%H:%M:%S"), " pdf -> txt "
f=open(name,'w')
f.decode(content.encode('UTF-8'))
## f.write(content.encode("UTF-8"))
f.write(content)
f.close
the error can probably be solved by just changing
while i<=len(list):
to:
while i<len(list):
because in python allowed indices for a list with N elements are:
0,1,...,N-1
while trying to access the element N gives an IndexError.
If a list's last index is n, then the len of the list is n+1.
This means that when you want to access a list, you do NOT want to access list[length of list] aka n+1 as this does not exist!
I believe the only wrong line in your code is the while, it should be:
while i < len(list):
And not
while i <= len(list):
You do not want i to take the value len(list).

Cant get text to append to file in python

Ok so this snippet of code is a http response inside of a flask server. I dont think this information will be of any use but its there if you need to know it.
This Code is suppose to read in the name from the post request and write to a file.
Then it checks a file called saved.txt which is stored in the FILES dictionary.
If we do not find our filename in the saved.txt file we append the filename to the saved file.
APIResponce function is just a json dump
At the moment it doesn't seem to be appending at all. The file is written just fine but append doesn't go thru.
Also btw this is being run on Linino, which is just a distribution of Linux.
def post(self):
try:
## Create the filepath so we can use this for mutliple schedules
filename = request.form["name"] + ".txt"
path = "/mnt/sda1/arduino/www/"
filename_path = path + filename
#Get the data from the request
schedule = request.form["schedule"]
replacement_value = schedule
#write the schedule to the file
writefile(filename_path,replacement_value)
#append the file base name to the saved file
append = True
schedule_names = readfile(FILES['saved']).split(" ")
for item in schedule_names:
if item == filename:
append = False
if append:
append_to = FILES['saved']
filename_with_space =filename + " "
append(append_to,filename_with_space)
return APIResponse({
'success': "Successfully modified the mode."
})
except:
return APIResponse({
'error': "Failed to modify the mode"
})
Here are the requested functions
def writefile(filename, data):
#Opens a file.
sdwrite = open(filename, 'w')
#Writes to the file.
sdwrite.write(data)
#Close the file.
sdwrite.close()
return
def readfile(filename):
#Opens a file.
sdread = open(filename, 'r')
#Reads the file's contents.
blah = sdread.readline()
#Close the file.
sdread.close()
return blah
def append(filename,data):
## use mode a for appending
sdwrite = open(filename, 'a')
## append the data to the file
sdwrite.write(data)
sdwrite.close()
Could it be that the bool object append and the function name append are the same? When I tried it, Python complained with "TypeError: 'bool' object is not callable"

Script that reads PDF metadata and writes to CSV

I wrote a script to read PDF metadata to ease a task at work. The current working version is not very usable in the long run:
from pyPdf import PdfFileReader
BASEDIR = ''
PDFFiles = []
def extractor():
output = open('windoutput.txt', 'r+')
for file in PDFFiles:
try:
pdf_toread = PdfFileReader(open(BASEDIR + file, 'r'))
pdf_info = pdf_toread.getDocumentInfo()
#print str(pdf_info) #print full metadata if you want
x = file + "~" + pdf_info['/Title'] + " ~ " + pdf_info['/Subject']
print x
output.write(x + '\n')
except:
x = file + '~' + ' ERROR: Data missing or corrupt'
print x
output.write(x + '\n')
pass
output.close()
if __name__ == "__main__":
extractor()
Currently, as you can see, I have to manually input the working directory and manually populate the list of PDF files. It also just prints out the data in the terminal in a format that I can copy/paste/separate into a spreadsheet.
I'd like the script to work automatically in whichever directory I throw it in and populate a CSV file for easier use. So far:
from pyPdf import PdfFileReader
import csv
import os
def extractor():
basedir = os.getcwd()
extension = '.pdf'
pdffiles = [filter(lambda x: x.endswith('.pdf'), os.listdir(basedir))]
with open('pdfmetadata.csv', 'wb') as csvfile:
for f in pdffiles:
try:
pdf_to_read = PdfFileReader(open(f, 'r'))
pdf_info = pdf_to_read.getDocumentInfo()
title = pdf_info['/Title']
subject = pdf_info['/Subject']
csvfile.writerow([file, title, subject])
print 'Metadata for %s written successfully.' % (f)
except:
print 'ERROR reading file %s.' % (f)
#output.writerow(x + '\n')
pass
if __name__ == "__main__":
extractor()
In its current state it seems to just prints a single error (as in, the error message in the exception, not an error returned by Python) message and then stop. I've been staring at it for a while and I'm not really sure where to go from here. Can anyone point me in the right direction?
writerow([file, title, subject]) should be writerow([f, title, subject])
You can use sys.exc_info() to print the details of your error
http://docs.python.org/2/library/sys.html#sys.exc_info
Did you check the pdffiles variable contains what you think it does? I was getting a list inside a list... so maybe try:
for files in pdffiles:
for f in files:
#do stuff with f
I personally like glob. Notice I add * before the .pdf in the extension variable:
import os
import glob
basedir = os.getcwd()
extension = '*.pdf'
pdffiles = glob.glob(os.path.join(basedir,extension)))
Figured it out. The script I used to download the files was saving the files with '\r\n' trailing after the file name, which I didn't notice until I actually ls'd the directory to see what was up. Thanks for everyone's help.

File naming problem with Python

I am trying to iterate through a number .rtf files and for each file: read the file, perform some operations, and then write new files into a sub-directory as plain text files with the same name as the original file, but with .txt extensions. The problem I am having is with the file naming.
If a file is named foo.rtf, I want the new file in the subdirectory to be foo.txt. here is my code:
import glob
import os
import numpy as np
dir_path = '/Users/me/Desktop/test/'
file_suffix = '*.rtf'
output_dir = os.mkdir('sub_dir')
for item in glob.iglob(dir_path + file_suffix):
with open(item, "r") as infile:
reader = infile.readlines()
matrix = []
for row in reader:
row = str(row)
row = row.split()
row = [int(value) for value in row]
matrix.append(row)
np_matrix = np.array(matrix)
inv_matrix = np.transpose(np_matrix)
new_file_name = item.replace('*.rtf', '*.txt') # i think this line is the problem?
os.chdir(output_dir)
with open(new_file_name, mode="w") as outfile:
outfile.write(inv_matrix)
When I run this code, I get a Type Error:
TypeError: coercing to Unicode: need string or buffer, NoneType found
How can I fix my code to write new files into a subdirectory and change the file extensions from .rtf to .txt? Thanks for the help.
Instead of item.replace, check out some of the functions in the os.path module (http://docs.python.org/library/os.path.html). They're made for splitting up and recombining parts of filenames. For instance, os.path.splitext will split a filename into a file path and a file extension.
Let's say you have a file /tmp/foo.rtf and you want to move it to /tmp/foo.txt:
old_file = '/tmp/foo.rtf'
(file,ext) = os.path.splitext(old_file)
print 'File=%s Extension=%s' % (file,ext)
new_file = '%s%s' % (file,'.txt')
print 'New file = %s' % (new_file)
Or if you want the one line version:
old_file = '/tmp/foo.rtf'
new_file = '%s%s' % (os.path.splitext(old_file)[0],'.txt')
I've never used glob, but here's an alternative way without using a module:
You can easily strip the suffix using
name = name[:name.rfind('.')]
and then add the new suffix:
name = name + '.txt'
Why not using a function ?
def change_suffix(string, new_suffix):
i = string.rfind('.')
if i < 0:
raise ValueError, 'string does not have a suffix'
if not new_suffix[0] == '.':
new_suffix += '.'
return string[:i] + new_suffix
glob.iglob() yields pathnames, without the character '*'.
therefore your line should be:
new_file_name = item.replace('.rtf', '.txt')
consider working with clearer names (reserve 'filename' for a file name and use 'path' for a complete path to a file; use 'path_original' instead of 'item'), os.extsep ('.' in Windows) and os.path.splitext():
path_txt = os.extsep.join([os.path.splitext(path_original)[0], 'txt'])
now the best hint of all:
numpy can probably read your file directly:
data = np.genfromtxt(filename, unpack=True)
(see also here)
To better understand where your TypeError comes from, wrap your code in the following try/except block:
try:
(your code)
except:
import traceback
traceback.print_exc()

Categories