So I've written something to pull out certain string (beneficiary) from pdf's and rename the file based on the string but the problem is if there are duplicates, is there any way to add a +1 counter behind the name?
My inefficient code as follow, appreciate any help!:
for filename in os.listdir(input_dir):
if filename.endswith('.pdf'):
input_path = os.path.join(input_dir, filename)
pdf_array = (glob.glob(input_dir + '*.pdf'))
for current_pdf in pdf_array:
with pdfplumber.open(current_pdf) as pdf:
page = pdf.pages[0]
text = page.extract_text()
keyword = text.split('\n')[2]
try:
if 'attention' in keyword:
pdf_to_att = text.split('\n')[2]
start_to_att = 'For the attention of: '
to_att = pdf_to_att.removeprefix(start_to_att)
pdf.close()
result = to_att
os.rename(current_pdf, result + '.pdf')
else:
pdf_to_ben = text.split('\n')[1]
start_to_ben = 'Beneficiary Name : '
end_to_ben = pdf_to_ben.rsplit(' ', 1)[1]
to_ben = pdf_to_ben.removeprefix(start_to_ben).removesuffix(end_to_ben).rstrip()
pdf.close()
result = to_ben
os.rename(current_pdf, result + '.pdf')
except Exception:
pass
messagebox.showinfo("A Title", "Done!")
edit: the desired output should be
AAA.pdf
AAA_2.pdf
BBB.pdf
CCC.pdf
CCC_2.pdf
I would use a dict to record the occurrence count of each filename.
dict.get() returns the value for key if key is in the dictionary, else default. If default is not given, it defaults to None
pdf_name_count = {}
for current_pdf in pdf_array:
with pdfplumber.open(current_pdf) as pdf:
page = pdf.pages[0]
text = page.extract_text()
keyword = text.split('\n')[2]
try:
if 'attention' in keyword:
...
result = to_att
else:
...
result = to_ben
filename_count = pdf_name_count.get(result, 0)
if filename_count >= 1:
filename = f'{result}_{filename_count+1}.pdf'
else:
filename = result + '.pdf'
os.rename(current_pdf, filename)
# increase the name occurrence by 1
pdf_name_count[result] = filename_count + 1
except Exception:
pass
What you want is to build a string, for the filename, that includes a counter,
let's call it cnt. Python has the f-string syntax for this exact purpose, it
lets you interpolate a variable into a string.
Initialize your counter before the for loop:
cnt = 0
Replace
os.rename(current_pdf, result + '.pdf')
with
os.rename(current_pdf, f'{result}_{cnt}.pdf')
cnt += 1
The f before the opening quote introduces the f-string, and the curly braces
{} let you include any python expression, in your case we just substitute the
values of the two variables result and cnt. Then we increment the counter,
of course.
os.path.isfile can be your mate meet your needs.
import os
def get_new_name(result):
file_name = result + '{}.pdf'
file_number = 0
if os.path.isfile(file_name.format('')): # AAA.pdf
file_number = 2
while os.path.isfile(file_name.format('_{}'.format(file_number))):
file_number += 1
if file_number:
pdf_name = file_name.format('_{}'.format(file_number))
else:
pdf_name = file_name.format('')
return pdf_name
my screenshot
I update code for your output format, it can be work.
Related
I’m writing a program that makes music albums into files that you can search for, and for that i need a str in the file that have a specific value that is made after the list is complete. Can you go back in that list and change a blank str with a new value?
I have searched online and found something called words.replace, but it doesn’t work, i get a Attribute error.
def create_album():
global idnumber, current_information
file_information = []
if current_information[0] != 'N/A':
save()
file_information.append(idnumber)
idnumber += 1
print('Type c at any point to abort creation')
for i in creation_list:
value = input('\t' + i)
if value.upper == 'C':
menu()
else:
-1file_information.append('')
file_information.append(value)
file_information.append('Album created - ' + file_information[2] +'\nSongs:')
-2file_information = [w.replace(file_information[1], str(file_information[0]) + '-' + file_information[2]) for w in file_information]
current_information = file_information
save_name = open(save_path + str(file_information[0]) + '-' + str(file_information[2]) + '.txt', 'w')
for i in file_information:
save_name.write(str(i) + '\n')
current_files_ = open(information_file + 'files.txt', 'w')
filenames.append(file_information[0])
for i in filenames:
current_files_.write(str(i) + '\n')
id_file = open(information_file + 'albumid.txt', 'w')
id_file.write(str(idnumber))
-1 is where i have put aside a blank row
-2 is the where i try to replace row 1 in the list with the value of row 0 and row 2.
The error message I receive is ‘int’ object has no attribute ‘replace’
Did you try this?
-2file_information = [w.replace(str(file_information[1]), str(file_information[0]) + '-' + file_information[2]) for w in file_information]
I have not used Python in years and trying to get back into it. I have a Input_file (.csv) that I want to parse and store the output in a output.csv or .txt
I have managed to parse the .csv file using this code, and for the most part the it works but I cant get it save to save to file (Issue 1) without getting the below error (error 1)
import csv
import re
import itertools
file_name = 'PhoneCallData1.txt'
try:
lol = list(csv.reader(open(file_name, 'r'), delimiter=' '))
count =0
except:
print('File cannot be opened:',file_name)
exit()
try:
fout = open('output.txt','w')
except:
Print("File cannot be written to:","OutputFile")
exit()
d = dict()
for item in itertools.chain(lol): # Lists all items (field) in the CSV file.
count +=1 # counter to keep track of row im looping through
if lol[count][3] is None:
print("value is not blank")
count +=1
else:
try:
check_date = re.search(r'(\d+/\d+/\d+)', lol[count][3]) # check to determine if date is a date
except:
continue
check_cost = re.compile(r'($+\d*)', lol[count][9]) # check to determine if value is a cost
if check_date ==TRUE:
try:
key =lol[count][3] # If is a date value, store key
except ValueError:
continue
if check_cost==TRUE:
value = lol[count][9] # if is a cost ($) store value
d[key] = value
print (d[key])
# fout.write((d[key])
# What if there is no value in the cell?
# I keep getting "IndexError: list index out of range", anyone know why?
# Is there a better way to do this?
# I only want to store the destination and the charge
and now comes the complicated part. The file I need to parse has a number of irrelevant rows of data before and in between the required data.
Data Format
What I want to do;
I want to iterate over two columns of data, and only store the rows that have a date or cost in them, dis-guarding the rest of the data.
import csv
import re
import itertools
lol = list(csv.reader(open('PhoneCallData1.txt', 'r'), delimiter=' '))
count =0
d = dict()
for item in itertools.chain(lol): #Lists all items (field) in the CSV file.
count +=1 # counter to keep track of row im looping through
check_date = re.search(r'(\d+/\d+/\d+)', lol[count][3]) #check to determine
check_cost = re.compile(r'($+\d*)', lol[count][9]) #check to determine if value is a cost
if check_date ==TRUE:
key =lol[count][3] #If is a date value, store key
if check_cost==TRUE:
value = lol[count][9] #if is a cost ($) store value
d[key] = value
print (d[key])
#What if there is no value in the cell?
# I keep getting "IndexError: list index out of range", anyone know why?
# Is there a better way to do this?
# I only want to store the destination and the charges
What I have tried;
I tried to index the data after I loaded it, but that didn't seem to work.
I created this to only look at rows at that were more than a certain length, but its terrible code. I was hoping for something more practical and reusable.
import re
with open('PhoneCallData1.txt','r') as f, open('sample_output.txt','w') as fnew:
for line in f:
if len(line) > 50:
print(line)
fnew.write(line + '\n')
Import csv
lol = list(csv.reader(open('PhoneCallData1.txt', 'rb'), delimiter='\t'))
#d = dict()
#key = lol[5][0] # cell A7
#value = lol[5][3] # cell D7
#d[key] = value # add the entry to the dictionary
Keep getting index out of bounds errors
import re
import csv
match=re.search(r'(\d+/\d+/\d+)','testing date 11/12/2017')
print match.group(1)
Trying to use regex to search for the date in the first column of data.
NOTE: I wanted to try Pandas but I feel I need to start here. Any Help would be awesome.
answer to if next record need to be parsed must be specific, and I have answer a similar question, in the same way, finite-state machine may help
main code is:
state = 'init'
output = []
# for line loop:
if state == 'init': # seek for start parsing
# check if start parsing
state = 'start'
elif state == 'start': # start parsing now
# parsing
# check if need to end parsing
state = 'init'
import csv
import re
import itertools
import timeit
start_time = timeit.default_timer()
# code you want to evaluate
file_name = 'PhoneCallData.txt'
try:
lol = list(csv.reader(open(file_name, 'r'), delimiter=' '))
except:
print('File cannot be opened:', file_name)
exit()
try:
fout = open('output.txt','w')
except:
Print("File cannot be written to:","OutputFile")
exit()
# I could assign key value pairs and store in dictionry. Then print, search,ect on the dictionary. Version2
# d = dict()
count =0
total = 0
for row in lol: # Lists all items (field) in the CSV file.
#print(len(row))
count +=1 # counter to keep track of row im looping through
if len(row) == 8:
if row[2].isdigit():
# Remove the $ and convert to float
cost = re.sub('[$]', '', row[7])
# Assign total value
try:
# Calculate total for verification purposes
total = total + float(cost)
total = round(total, 2)
except:
continue
string = str(row[2] + " : " + (row[7]) + " : " + str(total) + "\n")
print (string)
fout.write(string)
if len(row) == 9:
if row[2].isdigit():
# Remove the $ and convert to float
cost = re.sub('[$]', '', row[8])
# Assign total value
try:
# Calculate total for verification purposes
total = total + float(cost)
total = round(total, 2)
except:
continue
string = str(row[2] + " : " + row[8] + " : " + str(total) + "\n")
print(string)
fout.write(string)
if len(row) == 10:
# print (row[2] +":"+ row[9])
# Remove the $ and convert to float
cost = re.sub('[$]', '', row[9])
# Assign total value
try:
# Calculate total for verification purposes
total = total + float(cost)
total = round(total, 2)
except:
continue
string = str(row[2] + " : " + row[9] + " : " + str(total) + "\n")
print(string)
fout.write(string)
# Convert to string so I can print and store in file
count_string = str(count)
total_string = str(total)
total_string.split('.', 2)
# Write to screen
print (total_string + " Total\n")
print("Rows parsed :" + count_string)
# write to file
fout.write(count_string + " Rows were parsed\n")
fout.write(total_string + " Total")
# Calcualte time spent on task
elapsed = timeit.default_timer() - start_time
round_elapsed = round(elapsed, 2)
string_elapsed = str(round_elapsed)
fout.write(string_elapsed)
print(string_elapsed + " seconds")
fout.close()
I have the below code in one of my configuration files:
appPackage_name = sqlncli
appPackage_version = 11.3.6538.0
The left side is the key and the right side is value.
Now i want to be able to replace the value part with something else given a key in Python.
import re
Filepath = r"C:\Users\bhatsubh\Desktop\Everything\Codes\Python\OO_CONF.conf"
key = "appPackage_name"
value = "Subhayan"
searchstr = re.escape(key) + " = [\da-zA-Z]+"
replacestr = re.escape(key) + " = " + re.escape(value)
filedata = ""
with open(Filepath,'r') as File:
filedata = File.read()
File.close()
print ("Before change:",filedata)
re.sub(searchstr,replacestr,filedata)
print ("After change:",filedata)
I assume there is something wrong with the regex i am using. But i am not able to figure out what . Can someone please help me ?
Use the following fix:
import re
#Filepath = r"C:\Users\bhatsubh\Desktop\Everything\Codes\Python\OO_CONF.conf"
key = "appPackage_name"
value = "Subhayan"
#searchstr = re.escape(key) + " = [\da-zA-Z]+"
#replacestr = re.escape(key) + " = " + re.escape(value)
searchstr = r"({} *= *)[\da-zA-Z.]+".format(re.escape(key))
replacestr = r"\1{}".format(value)
filedata = "appPackage_name = sqlncli"
#with open(Filepath,'r') as File:
# filedata = File.read()
#File.close()
print ("Before change:",filedata)
filedata = re.sub(searchstr,replacestr,filedata)
print ("After change:",filedata)
See the Python demo
There are several issues: you should not escape the replacement pattern, only the literal user-defined values in the regex pattern. You can use a capturing group (a pair of unescaped (...)) and a backreference (here, \1 since the group is only one in the pattern) to restore the part of the matched string you need to keep rather than build that replacement string dynamically. As the version value contains dots, you should add a . to the character class, [\da-zA-Z.]. You also need to assign new value after replacing, so as to actually modify it.
I have a file config and the contents are separated by space " "
cat config
/home/user1 *.log,*.txt 30
/home/user2 *.trm,*.doc,*.jpeg 10
I want to read this file,parse each line and print each field from the each line.
Ex:-
Dir = /home/user1
Fileext = *.log,*.txt
days=30
I couldn't go further than the below..
def dir():
file = open('config','r+')
cont = file.readlines()
print "file contents are %s" % cont
for i in range(len(cont)):
j = cont[i].split(' ')
dir()
Any pointers how to move further?
Your code is fine, you are just missing the last step processing each element of the splitted string, try this:
def dir():
file = open('config','r+')
cont = file.readlines()
print "file contents are %s" % cont + '\n'
elements = []
for i in range(len(cont)):
rowElems = cont[i].split(' ')
elements.append({ 'dir' : rowElems[0], 'ext' : rowElems[1], 'days' : rowElems[2] })
for e in elements:
print "Dir = " + e['dir']
print "Fileext = " + e['ext']
print "days = " + e['days']
dir()
At the end of this code, you will have all the rows processed and stored in an array of dictionaries you can easily access later.
You can write a custom function to parse each line, and then use the map function to apply that function against each line in file.readlines():
def parseLine(line):
# function to split and parse each line,
# and return the formatted string
Dir, FileExt, Days = line.split(' ')[:3]
return 'Dir = {}\nFileext = {}\nDays = {}'.format(Dir, FileExt, Days)
def dir():
with open('config','r+') as file:
print 'file contents are\n' + '\n'.join(map(parseLine, file.readlines()))
Results:
>>> dir()
file contents are
Dir = /home/user1
Fileext = *.log,*.txt
Days = 30
Dir = /home/user2
Fileext = *.trm,*.doc,*.jpeg
Days = 10
This is the which i am doing
import csv
output = open('output.txt' , 'wb')
# this functions return the min for num.txt
def get_min(num):
return int(open('%s.txt' % num, 'r+').readlines()[0])
# temporary variables
last_line = ''
input_list = []
#iterate over input.txt in sort the input in a list of tuples
for i, line in enumerate(open('input.txt', 'r+').readlines()):
if i%2 == 0:
last_line = line
else:
input_list.append((last_line, line))
filtered = [(header, data[:get_min(header[-2])] + '\n' ) for (header, data) in input_list]
[output.write(''.join(data)) for data in filtered]
output.close()
In this code input.txt is something like this
>012|013|0|3|M
AFDSFASDFASDFA
>005|5|67|0|6
ACCTCTGACC
>029|032|4|5|S
GGCAGGGAGCAGGCCTGTA
and num.txt is something like this
M 4
P 10
I want that in above input.txt check the amount of value from the num.txt by looking at its last column which is same like in num.txt and cut its character according to that values
I think the error in my code is that it only accept the integer text file , where it should also accept file which contain alphabets
The totally revised version, after a long chat with the OP;
import os
import re
# Fetch all hashes and counts
file_c = open('num.txt')
file_c = file_c.read()
lines = re.findall(r'\w+\.txt \d+', file_c)
numbers = {}
for line in lines:
line_split = line.split('.txt ')
hash_name = line_split[0]
count = line_split[1]
numbers[hash_name] = count
#print(numbers)
# The input file
file_i = open('input.txt')
file_i = file_i.read()
for hash_name, count in numbers.iteritems():
regex = '(' + hash_name.strip() + ')'
result = re.findall(r'>.*\|(' + regex + ')(.*?)>', file_i, re.S)
if len(result) > 0:
data_original = result[0][2]
stripped_data = result[0][2][int(count):]
file_i = file_i.replace(data_original, '\n' + stripped_data)
#print(data_original)
#print(stripped_data)
#print(file_i)
# Write the input file to new input_new.txt
f = open('input_new.txt', 'wt')
f.write(file_i)
You can do it like so;
import re
min_count = 4 # this variable will contain that count integer from where to start removing
str_to_match = 'EOG6CC67M' # this variable will contain the filename you read
input = '' # The file input (input.txt) will go in here
counter = 0
def callback_f(e):
global min_count
global counter
counter += 1
# Check your input
print(str(counter) + ' >>> ' + e.group())
# Only replace the value with nothing (remove it) after a certain count
if counter > min_count:
return '' # replace with nothing
result = re.sub(r''+str_to_match, callback_f, input)
With this tactic you can keep count with a global counter and there's no need to do hard line-loops with complex structures.
Update
More detailed version with file access;
import os
import re
def callback_f(e):
global counter
counter += 1
# Check your input
print(str(counter) + ' >>> ' + e.group())
# Fetch all hash-file names and their content (count)
num_files = os.listdir('./num_files')
numbers = {}
for file in num_files:
if file[0] != '.':
file_c = open('./num_files/' + file)
file_c = file_c.read()
numbers[file.split('.')[0]] = file_c
# Now the CSV files
csv_files = os.listdir('./csv_files')
for file in csv_files:
if file[0] != '.':
for hash_name, min_count in numbers.iteritems():
file_c = open('./csv_files/' + file)
file_c = file_c.read()
counter = 0
result = re.sub(r''+hash_name, callback_f, file_c)
# Write the replaced content back to the file here
Considered directory/file structure;
+ Projects
+ Project_folder
+ csv_files
- input1.csv
- input2.csv
~ etc.
+ num_files
- EOG6CC67M.txt
- EOG62JQZP.txt
~ etc.
- python_file.py
The CSV files contain the big chunks of text you state in your original question.
The Num files contain the hash-files with an Integer in them
What happens in this script;
Collect all Hash files (in a dictionary) and it's inner count number
Loop through all CSV files
Subloop through the collected numbers for each CSV file
Replace/remove (based on what you do in callback_f()) hashes after a certain count
Write the output back (it's the last comment in the script, would contain the file.write() functionality)