i have a code that is written in PYTHON where the code allow the user to select the path of folder that contains PDF files and convert it to text files.
the system system work perfect when the content is not ARABIC.
error displayed :
Traceback (most recent call last): File
"C:\Users\test\Downloads\pdf-txt\text maker.py", line 32, in
path=list[i] IndexError: list index out of range
code:
import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime
def check_path(prompt):
''' (str) -> str
Verifies if the provided absolute path does exist.
'''
abs_path = raw_input(prompt)
while path.exists(abs_path) != True:
print "\nThe specified path does not exist.\n"
abs_path = raw_input(prompt)
return abs_path
print "\n"
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
for filename in files:
if filename.endswith('.pdf'):
t=os.path.join(directory,filename)
list.append(t)
m=len(list)
i=0
while i<=len(list):
path=list[i]
head,tail=os.path.split(path)
var="\\"
tail=tail.replace(".pdf",".txt")
name=head+var+tail
content = ""
# Load PDF into pyPDF
##pdf = pyPdf.PdfFileReader(file(path, "rb"))
pdf = pyPdf.PdfFileReader(codecs.open(path, "rb", encoding='UTF-8'))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
print strftime("%H:%M:%S"), " pdf -> txt "
f=open(name,'w')
f.decode(content.encode('UTF-8'))
## f.write(content.encode("UTF-8"))
f.write(content)
f.close
the error can probably be solved by just changing
while i<=len(list):
to:
while i<len(list):
because in python allowed indices for a list with N elements are:
0,1,...,N-1
while trying to access the element N gives an IndexError.
If a list's last index is n, then the len of the list is n+1.
This means that when you want to access a list, you do NOT want to access list[length of list] aka n+1 as this does not exist!
I believe the only wrong line in your code is the while, it should be:
while i < len(list):
And not
while i <= len(list):
You do not want i to take the value len(list).
Related
Below is my most recent attempt; but alas, I print 'current_file' and it's always the same (first) .zip file in my directory?
Why/how can I iterate this to get to the next file in my zip directory?
my DIRECTORY_LOCATION has 4 zip files in it.
def find_file(cls):
listOfFiles = os.listdir(config.DIRECTORY_LOCATION)
total_files = 0
for entry in listOfFiles:
total_files += 1
# if fnmatch.fnmatch(entry, pattern):
current_file = entry
print (current_file)
""""Finds the excel file to process"""
archive = ZipFile(config.DIRECTORY_LOCATION + "/" + current_file)
for file in archive.filelist:
if file.filename.__contains__('Contact Frog'):
return archive.extract(file.filename, config.UNZIP_LOCATION)
return FileNotFoundError
find_file usage:
excel_data = pandas.read_excel(self.find_file())
Update:
I just tried changing return to yield at:
yield archive.extract(file.filename, config.UNZIP_LOCATION)
and now getting the below error at my find_file line.
ValueError: Invalid file path or buffer object type: <class 'generator'>
then I alter with the generator obj as suggested in comments; i.e.:
generator = self.find_file(); excel_data = pandas.read_excel(generator())
and now getting this error:
generator = self.find_file(); excel_data = pandas.read_excel(generator())
TypeError: 'generator' object is not callable
Here is my /main.py if helpful
"""Start Point"""
from data.find_pending_records import FindPendingRecords
from vital.vital_entry import VitalEntry
import sys
import os
import config
import datetime
# from csv import DictWriter
if __name__ == "__main__":
try:
for file in os.listdir(config.DIRECTORY_LOCATION):
if 'VCCS' in file:
PENDING_RECORDS = FindPendingRecords().get_excel_data()
# Do operations on PENDING_RECORDS
# Reads excel to map data from excel to vital
MAP_DATA = FindPendingRecords().get_mapping_data()
# Configures Driver
VITAL_ENTRY = VitalEntry()
# Start chrome and navigate to vital website
VITAL_ENTRY.instantiate_chrome()
# Begin processing Records
VITAL_ENTRY.process_records(PENDING_RECORDS, MAP_DATA)
except:
print("exception occured")
raise
It is not tested.
def find_file(cls):
listOfFiles = os.listdir(config.DIRECTORY_LOCATION)
total_files = 0
for entry in listOfFiles:
total_files += 1
# if fnmatch.fnmatch(entry, pattern):
current_file = entry
print (current_file)
""""Finds the excel file to process"""
archive = ZipFile(config.DIRECTORY_LOCATION + "/" + current_file)
for file in archive.filelist:
if file.filename.__contains__('Contact Frog'):
yield archive.extract(file.filename, config.UNZIP_LOCATION)
This is just your function rewritten with yield instead of return.
I think it should be used in the following way:
for extracted_archive in self.find_file():
excel_data = pandas.read_excel(extracted_archive)
#do whatever you want to do with excel_data here
self.find_file() is a generator, should be used like an iterator (read this answer for more details).
Try to integrate the previous loop in your main script. Each iteration of the loop, it will read a different file in excel_data, so in the body of the loop you should also do whatever you need to do with the data.
Not sure what you mean by:
just one each time the script is executed
Even with yield, if you execute the script multiple times, you will always start from the beginning (and always get the first file). You should read all of the files in the same execution.
I have a folder that contains several log file that I will parse with python.
I would show the list of file contained into a folder like:
[1] FileName1.log
[2] FileName2.log
And then the user can choose the right file writing the file list number.
For instance, to parse the file "FileName2.log" the user press 2.
In my script I can show the list of file but I don't now how to pick up a file from a list by index.
This is my script
import os
import sys
items = os.listdir("D:/Logs")
fileList = []
for names in items:
if names.endswith(".log"):
fileList.append(names)
cnt = 0
for fileName in fileList:
sys.stdout.write( "[%d] %s\n\r" %(cnt, fileName) )
cnt = cnt + 1
fileName = raw_input("\n\rSelect log file [0 -" + str(cnt) + " ]: ")
Thanks for the help!
import os
import sys
items = os.listdir("D:/Logs")
fileList = [name for name in items if name.endswith(".log")]
for cnt, fileName in enumerate(fileList, 1):
sys.stdout.write("[%d] %s\n\r" % (cnt, fileName))
choice = int(input("Select log file[1-%s]: " % cnt))
print(fileList[choice])
you own version of code with few modifications, hope this solves your purpose
If you have the names in an array like this:
fileList = ['FileName1.log','FileName2.log']
you can pull them out by using their index (remember that arrarys are 0-indexed) so fileList[0] would be 'FileName1.log'
when you ask for the user to input a number (eg 0, 1, 2) you would then use that number to get the file you want. like this:
fileToRead=fileList[userInput]
if you asked for 1,2,3 you would need to use userInput-1 to make sure it is correctly 0-indexed.
then you open the file you now have:
f=open(fileToRead, 'r')
you can read more about open here
If fileList is a list of files, and fileName is the user input, you can reference the file the user chose by using the following:
fileList[fileName]
import glob
import os
dirpath = r"D:\Logs" # the directory that contains the log files
prefix = "FileName"
fpaths = glob.glob(os.path.join(dirpath, "{}*.log".format(prefix))) # get all the log files
fpaths.sort(key=lambda fname: int(fname.split('.',1)[0][len(prefix):])) # sort the log files by number
print("Select a file to view:")
for i,fpath in enumerate(fpaths, 1):
print("[{}]: {}".format(i, os.path.basename(fpath)))
choice = int(input("Enter a selection number: ")) # assuming valid inputs
choice -= 1 # correcting for python's 0-indexing
print("You have chosen", os.path.basename(fpaths[choice]))
Just add in the end something like this...
sys.stdout.write(fileList[int(fileName)])
Indexing in python as in many other languages starts from 0. Try this:
import os
import sys
items = os.listdir("D:/Logs")
fileList = []
for names in items:
if names.endswith(".log"):
fileList.append(names)
cnt = 0
for fileName in fileList:
sys.stdout.write( "[%d] %s\n\r" %(cnt, fileName) )
cnt = cnt + 1
fileName = int(raw_input("\n\rSelect log file [0 - " + str(cnt - 1) + "]: "))
print(fileList[fileName])
You need to cast input from raw_input() to int. And then you can use the obtained number as index for your list. 0 is the first file, 1 is the second file etc.
I need to iterate over a folder tree. I have to check each subfolder, which looks like this:
moduleA-111-date
moduleA-112-date
moduleA-113-date
moduleB-111-date
moduleB-112-date
etc.
I figured out how to iterate over a folder tree. I can also use stat with mtime to get the date of the folder which seems easier than parsing the name of the date.
How do I single out modules with the same prefix (such as "moduleA") and compare their mtime's so I can delete the oldest?
Since you have no code, I assume that you're looking for design help. I'd lead my students to something like:
Make a list of the names
From each name, find the prefix, such as "moduleA. Put those in a set.
For each prefix in the set
Find all names with that prefix; put these in a temporary list
Sort this list.
For each file in this list *except* the last (newest)
delete the file
Does this get you moving?
I'm posting the code (answer) here, I suppose my question wasn't clear since I'm getting minus signs but anyway the solution wasn't as straight forward as I thought, I'm sure the code could use some fine tuning but it get's the job done.
#!/usr/bin/python
import os
import sys
import fnmatch
import glob
import re
import shutil
##########################################################################################################
#Remove the directory
def remove(path):
try:
shutil.rmtree(path)
print "Deleted : %s" % path
except OSError:
print OSError
print "Unable to remove folder: %s" % path
##########################################################################################################
#This function will look for the .sh files in a given path and returns them as a list.
def searchTreeForSh(path):
full_path = path+'*.sh'
listOfFolders = glob.glob(full_path)
return listOfFolders
##########################################################################################################
#Gets the full path to files containig .sh and returns a list of folder names (prefix) to be acted upon.
#listOfScripts is a list of full paths to .sh file
#dirname is the value that holds the root directory where listOfScripts is operating in
def getFolderNames(listOfScripts):
listOfFolders = []
folderNames = []
for foldername in listOfScripts:
listOfFolders.append(os.path.splitext(foldername)[0])
for folders in listOfFolders:
folder = folders.split('/')
foldersLen=len(folder)
folderNames.append(folder[foldersLen-1])
folderNames.sort()
return folderNames
##########################################################################################################
def minmax(items):
return max(items)
##########################################################################################################
#This function will check the latest entry in the tuple provided, and will then send "everything" to the remove function except that last entry
def sortBeforeDelete(statDir, t):
count = 0
tuple(statDir)
timeNotToDelete = minmax(statDir)
for ff in t:
if t[count][1] == timeNotToDelete:
count += 1
continue
else:
remove(t[count][0])
count += 1
##########################################################################################################
#A loop to run over the fullpath which is broken into items (see os.listdir above), elemenates the .sh and the .txt files, leaves only folder names, then matches it to one of the
#name in the "folders" variable
def coolFunction(folderNames, path):
localPath = os.listdir(path)
for folder in folderNames:
t = () # a tuple to act as sort of a dict, it will hold the folder name and it's equivalent st_mtime
statDir = [] # a list that will hold the st_mtime for all the folder names in subDirList
for item in localPath:
if os.path.isdir(path + item) == True:
if re.search(folder, item):
mtime = os.stat(path + '/' + item)
statDir.append(mtime.st_mtime)
t = t + ((path + item,mtime.st_mtime),)# the "," outside the perenthasis is how to make t be a list of lists and not set the elements one after theother.
if t == ():continue
sortBeforeDelete(statDir, t)
##########################################################################################################
def main(path):
dirs = os.listdir(path)
for component in dirs:
if os.path.isdir(component) == True:
newPath = path + '/' + component + '/'
listOfFolders= searchTreeForSh(newPath)
folderNames = getFolderNames(listOfFolders)
coolFunction(folderNames, newPath)
##########################################################################################################
if __name__ == "__main__":
main(sys.argv[1])
I can't figure out what's wrong. I've used rename before without any problems, and can't find a solution in other similar questions.
import os
import random
directory = "C:\\whatever"
string = ""
alphabet = "abcdefghijklmnopqrstuvwxyz"
listDir = os.listdir(directory)
for item in listDir:
path = os.path.join(directory, item)
for x in random.sample(alphabet, random.randint(5,15)):
string += x
string += path[-4:] #adds file extension
os.rename(path, string)
string= ""
There are a few strange things in your code. For example, your source to the file is the full path but your destination to rename is just a filename, so files will appear in whatever the working directory is - which is probably not what you wanted.
You have no protection from two randomly generated filenames being the same, so you could destroy some of your data this way.
Try this out, which should help you identify any problems. This will only rename files, and skip subdirectories.
import os
import random
import string
directory = "C:\\whatever"
alphabet = string.ascii_lowercase
for item in os.listdir(directory):
old_fn = os.path.join(directory, item)
new_fn = ''.join(random.sample(alphabet, random.randint(5,15)))
new_fn += os.path.splitext(old_fn)[1] #adds file extension
if os.path.isfile(old_fn) and not os.path.exists(new_fn):
os.rename(path, os.path.join(directory, new_fn))
else:
print 'error renaming {} -> {}'.format(old_fn, new_fn)
If you want to save back to the same directory you will need to add a path to your 'string' variable. Currently it is just creating a filename and os.rename requires a path.
for item in listDir:
path = os.path.join(directory, item)
for x in random.sample(alphabet, random.randint(5,15)):
string += x
string += path[-4:] #adds file extension
string = os.path.join(directory,string)
os.rename(path, string)
string= ""
I am trying to iterate through a number .rtf files and for each file: read the file, perform some operations, and then write new files into a sub-directory as plain text files with the same name as the original file, but with .txt extensions. The problem I am having is with the file naming.
If a file is named foo.rtf, I want the new file in the subdirectory to be foo.txt. here is my code:
import glob
import os
import numpy as np
dir_path = '/Users/me/Desktop/test/'
file_suffix = '*.rtf'
output_dir = os.mkdir('sub_dir')
for item in glob.iglob(dir_path + file_suffix):
with open(item, "r") as infile:
reader = infile.readlines()
matrix = []
for row in reader:
row = str(row)
row = row.split()
row = [int(value) for value in row]
matrix.append(row)
np_matrix = np.array(matrix)
inv_matrix = np.transpose(np_matrix)
new_file_name = item.replace('*.rtf', '*.txt') # i think this line is the problem?
os.chdir(output_dir)
with open(new_file_name, mode="w") as outfile:
outfile.write(inv_matrix)
When I run this code, I get a Type Error:
TypeError: coercing to Unicode: need string or buffer, NoneType found
How can I fix my code to write new files into a subdirectory and change the file extensions from .rtf to .txt? Thanks for the help.
Instead of item.replace, check out some of the functions in the os.path module (http://docs.python.org/library/os.path.html). They're made for splitting up and recombining parts of filenames. For instance, os.path.splitext will split a filename into a file path and a file extension.
Let's say you have a file /tmp/foo.rtf and you want to move it to /tmp/foo.txt:
old_file = '/tmp/foo.rtf'
(file,ext) = os.path.splitext(old_file)
print 'File=%s Extension=%s' % (file,ext)
new_file = '%s%s' % (file,'.txt')
print 'New file = %s' % (new_file)
Or if you want the one line version:
old_file = '/tmp/foo.rtf'
new_file = '%s%s' % (os.path.splitext(old_file)[0],'.txt')
I've never used glob, but here's an alternative way without using a module:
You can easily strip the suffix using
name = name[:name.rfind('.')]
and then add the new suffix:
name = name + '.txt'
Why not using a function ?
def change_suffix(string, new_suffix):
i = string.rfind('.')
if i < 0:
raise ValueError, 'string does not have a suffix'
if not new_suffix[0] == '.':
new_suffix += '.'
return string[:i] + new_suffix
glob.iglob() yields pathnames, without the character '*'.
therefore your line should be:
new_file_name = item.replace('.rtf', '.txt')
consider working with clearer names (reserve 'filename' for a file name and use 'path' for a complete path to a file; use 'path_original' instead of 'item'), os.extsep ('.' in Windows) and os.path.splitext():
path_txt = os.extsep.join([os.path.splitext(path_original)[0], 'txt'])
now the best hint of all:
numpy can probably read your file directly:
data = np.genfromtxt(filename, unpack=True)
(see also here)
To better understand where your TypeError comes from, wrap your code in the following try/except block:
try:
(your code)
except:
import traceback
traceback.print_exc()