How to increment output filename in Python - python

I have a script that works, but when I run it a second time it doesn't because it keeps saving the output filename the same. I'm very new to Python and programming in general, so dumb you answers down...and then dumb them down some more. :)
arcpy.gp.Spline_sa("Observation_RegionalClip_Clip", "observatio", "C:/Users/moshell/Documents/ArcGIS/Default.gdb/Spline_shp16", "514.404", "REGULARIZED", "0.1", "12")
Where Spline_shp16 is the output filename, I would like it to save as Spline_shp17 the next time I run the script, and then Spline_shp18 the time after that, etc.

If you want to use numbers in the file names, you can check what files with similar names already exist in that directory, take the largest one, and increment it by one. Then pass this new number as a variable in the string for the filename.
For example:
import glob
import re
# get the numeric suffixes of the appropriate files
file_suffixes = []
for file in glob.glob("./Spline_shp*"):
regex_match = re.match(".*Spline_shp(\d+)", file)
if regex_match:
file_suffix = regex_match.groups()[0]
file_suffix_int = int(file_suffix)
file_suffixes.append(file_suffix_int)
new_suffix = max(file_suffixes) + 1 # get max and increment by one
new_file = f"C:/Users/moshell/Documents/ArcGIS/Default.gdb/Spline_shp{new_suffix}" # format new file name
arcpy.gp.Spline_sa(
"Observation_RegionalClip_Clip",
"observatio",
new_file,
"514.404",
"REGULARIZED",
"0.1",
"12",
)
Alternatively, if you are just interested in creating unique filenames so that nothing gets overwritten, you can append a timestamp to the end of the filename. So you would have files with names like "Spline_shp-1551375142," for example:
import time
timestamp = str(time.time())
filename = "C:/Users/moshell/Documents/ArcGIS/Default.gdb/Spline_shp-" + timestamp
arcpy.gp.Spline_sa(
"Observation_RegionalClip_Clip",
"observatio",
filename,
"514.404",
"REGULARIZED",
"0.1",
"12",
)

Related

Using python and regex to find stings and put them together to replace the filename of a .pdf- the rename fails when using more than one group

I have several thousand pdfs which I need to re-name based on the content. The layouts of the pdfs are inconsistent. To re-name them I need to locate a specific string "MEMBER". I need the value after the string "MEMBER" and the values from the two lines above MEMBER, which are Time and Date values respectively.
So:
STUFF
STUFF
STUFF
DD/MM/YY
HH:MM:SS
MEMBER ######
STUFF
STUFF
STUFF
I have been using regex101.com and have ((.*(\n|\r|\r\n)){2})(MEMBER.\S+) which matches all of the values I need. But it puts them across four groups with group 3 just showing a carriage return.
What I have so far looks like this:
import fitz
from os import DirEntry, curdir, chdir, getcwd, rename
from glob import glob as glob
import re
failed_pdfs = []
count = 0
pdf_regex = r'((.*(\n|\r|\r\n)){2})(MEMBER.\S+)'
text = ""
get_curr = getcwd()
directory = 'PDF_FILES'
chdir(directory)
pdf_list = glob('*.pdf')
for pdf in pdf_list:
with fitz.open(pdf) as pdf_obj:
for page in pdf_obj:
text += page.get_text()
new_file_name = re.search(pdf_regex, text).group().strip().replace(":","").replace("-","") + '.pdf'
text = ""
#clean_new_file_name = new_file_name.translate({(":"): None})
print(new_file_name)
# Tries to rename a pdf. If the filename doesn't already exist
# then rename. If it does exist then throw an error and add to failed list
try:
rename(pdf, new_file_name )
except WindowsError:
count += 1
failed_pdfs.append(str(count) + ' - FAILED TO RENAME: [' + pdf + " ----> " + str(new_file_name) + "]")
If I specify a group in the re.search portion- Like for instance Group 4 which contains the MEMBER ##### value, then the file renames successfully with just that value. Similarly, Group 2 renames with the TIME value. I think the multiple lines are preventing it from using all of the values I need. When I run it with group(), the print value shows as
DATE
TIME
MEMBER ######.pdf
And the log count reflects the failures.
I am very new at this, and stumbling around trying to piece together how things work. Is the issue with how I built the regex or with the re.search portion? Or everything?
I have tried re-doing the Regular Expression, but I end up with multiple lines in the results, which seems connected to the rename failure.
The strategy is to read the page's text by words and sort them adequately.
If we then find "MEMBER", the word following it represents the hashes ####, and the two preceeding words must be date and time respectively.
found = False
for page in pdf_obj:
words = page.get_text("words", sort=True)
# all space-delimited strings on the page, sorted vertically,
# then horizontally
for i, word in enumerate(words):
if word[4] == "MEMBER":
hashes = words[i+1][4] # this must be the word after MEMBER!
time-string = words[i-1][4] # the time
date_string = words[i-2][4] # the date
found = True
break
if found == True: # no need to look in other pages
break

How to get the latest folder in a directory using Python

I need to retrieve the directory of the most recently create folder. I am using a program that will output a new run## folder each time it is executed (i.e run01, run02, run03 and so on). Within any one run## folder resides a data file that I want analyze (file-i-want.txt).
folder_numb = 'run01'
dir = os.path.dirname(__file__)
filepath = os.path.join(dir, '..\data\directory',run_numb,'file-i-want.txt')
In short I want to skip having to hardcode in run## and just get the directory of a file within the most recently created run## folder.
You can get the creation date with os.stat
path = '/a/b/c'
#newest
newest = max([f for f in os.listdir(path)], key=lambda x: os.stat(os.path.join(path,x)).st_birthtime)
# all files sorted
sorted_files = sorted([f for f in os.listdir(path)],key=lambda x: os.stat(os.path.join(path, x)).st_birthtime, reverse=True)
pathlib is the recommeded over os for filesystem related tasks.
reference
You can try:
filepath = Path(__file__).parent / 'data/directory'
fnames = sorted(list(Path(filepath).rglob('file-i-want.txt')), key=lambda x: Path.stat(x).st_mtime, reverse=True)
filepath = str(fnames[0])
filepath
glob.glob('run*') will return the list of files/directories that match the pattern ordered by name.
so if you want the latest run your code will be:
import glob
print(glob.glob('run*')[-1]) # raises index error if there are no runs
IMPORTANT, the files are ordered by name, in this case, for example, 'run21' will come AFTER 'run100', so you will need to use a high enough number of digits to not see this error. or just count the number of matched files and recreate the name of the folder with this number.
you can use glob to check the number of files with the same name pattern:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n)
Note: with this code the file names starts from 0, if you want to start from 1 just add 1 to n.
if you want always double digit run number (00, 01, 02) instead of 'str(n)' use 'str(n).zfill(2)'
example:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n + 1).zfill(2)

Python insert array into string

I'm trying to create new files based on my store_array list. If the name doesn't exist yet in directory then create a new one, then another, then another. I have 300 files I need to create.
store_array = ["1234567", "987654", "1919103039"]
if store_number == "1":
continue
print(store_number, file=open(r'C:\Users\hank\Desktop\project\json_' + [store_number] + '".json', 'w'))
TypeError: must be str, not list
I can get output with a simple print(store_number) but I need to concat the text from the array into my filename.
Thanks for the help!
If the error is in the lack of an existing file, the following example may demonstrate a solution. It writes a blank json file for each str(number) in a list.
import json, os
placeholder_data = {}
store_array = ["1","2","3"]
for store_number in store_array:
filename = os.path.join('C:\users\csind\documents\pscripts','test{}.json'.format(store_number))
with open(filename,'w') as file:
json.dump(placeholder_data,file)
print(store_number, filename)

Can't get unique word/phrase counter to work - Python

I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.
Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)
Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]

Python : Rename files

I read out of the XML-files their category and I rename and save them with the year.
So, file "XYZ.xml" is now "News_2014.xml".
The Problem is that there are several XML-files with the category "News" from 2014. With my code, I delete all other files and I can save only 1 file.
What can I do in order that every file is saved? For example, if there are 2 files with the category "News" and the Year 2014, there file-names should be: "News_2014_01.xml" and "News_2014_02.xml".
Since there are other categories, I can not simply implement an increasing integer, i.e. another file with the category "History" should still have the Name "History_2014_01.xml" (and not ...03.xml).
Actually, I have the following code:
for text, key in enumerate(d):
#print key, d[key]
name = d[key][(d[key].find("__")+2):(d[key].rfind("__"))]
year = d[key][(d[key].find("*")+1):(d[key].rfind("*"))]
cat = d[key][(d[key].rfind("*")+1):]
os.rename(name, cat+"_"+year+'.xml')
Once you have figured out the “correct” name for the file, e.g. News_2014.xml, you could make a loop that checks whether the file exists and adds an incrementing suffix to it while that’s the case:
import os
fileName = 'News_2014.xml'
baseName, extension = os.path.splitext(fileName)
suffix = 0
while os.path.exists(os.path.join(directory, fileName)):
suffix += 1
fileName = '{}_{:02}.{}'.format(baseName, suffix, extension)
print(fileName)
os.rename(originalName, fileName)
You can put that into a function, so it’s easier to use:
def getIncrementedFileName (fileName):
baseName, extension = os.path.splitext(fileName)
suffix = 0
while os.path.exists(os.path.join(directory, fileName)):
suffix += 1
fileName = '{}_{:02}.{}'.format(baseName, suffix, extension)
return fileName
And then use that in your code:
for key, item in d.items():
name = item[item.find("__")+2:item.rfind("__")]
year = item[item.find("*")+1:item.rfind("*")]
cat = item[item.rfind("*")+1:]
fileName = getIncrementedFileName(cat + '_' + year + '.xml')
os.rename(name, fileName)
[EDIT] #poke solution is much more elegant, let alone he posted it earlier
You can check if target filename already exists, and if it does, modify filename.
The easiest solution for me would be to always start with adding 'counter' to file name, so you start with News_2014_000.xml (maybe better be prepared for more than 100 files?).
Later you loop until you find filename, that does not exist:
def versioned_filename(candidate):
target = candidate
while os.path.exists(target):
fname, ext = target.rsplit('.', 1)
head, tail = fname.rsplit('_', 1)
count = int(tail)
#:03d formats as 3-digit with leading zero
target = "{}_{:03d}.{}".format(head, count+1, ext)
return target
So, if you want to save as 'News_2014_###.xml' file you can create name as usual, but call os.rename(sourcename, versioned_filename(targetname)).
If you want more efficient solution, you can parse output of glob.glob() to find highest count, you will save on multiple calling to os.path.exists, but it makes sense only if you expect hundreds or thousands of files.
You could use a dictionary to keep track of the count. That way, there is no need to modify file names after you've renamed them. The downside is that every filename will have a number in it, even if the max number for that category ends up being 1.
cat_count = {}
for text, key in enumerate(d):
name = d[key][(d[key].find("__")+2):(d[key].rfind("__"))]
year = d[key][(d[key].find("*")+1):(d[key].rfind("*"))]
cat = d[key][(d[key].rfind("*")+1):]
cat_count[cat] = cat_count[cat] + 1 if cat in cat_count else 1
os.rename(name, "%s_%s_%02d.xml" % (cat, year, cat_count[cat]))

Categories