I need to visually separate photos (JPEGs) in a folder by placing black placeholder pictures between series with identical file names (only last two digits of the file names are different). The folder is typically containing single (stand alone) photos, named something like 03-12345-randomfilename.jpg and series named 03-12345-file01.jpg, 03-12345-file02.jpg, ..03, ..04, etc.
The singles should be left alone, but I need to place a black picture before and after all series.
I have the following Python script (originally written by someone else) that is intermittently failing for no apparent reason. It usually works, but sometimes it will overwrite files in the middle of a series, or more typically, it will fail to place a black picture after the last photo in a series. I've spent hours trying to figure out what's going on, but I'm stuck.
Any suggestions most appreciated.
def blackJPG(directory):
# iterate over every file name in the directory
blackJPG = '/Users/username/black.jpg'
filelist = {}
for file_name in os.listdir(directory):
filename, file_extension = os.path.splitext(file_name)
stringmatch = re.compile(r'(\d{2})(.*?)(\d+)(.*?)(([A-Za-z]+))(.*?)(\d+)')
m = stringmatch.search(file_name)
#Create search table
if m:
sequence = int(m.group(8))
filename_without_sequence = "{0}{1}{2}{3}{4}{5}".format(m.group(1),m.group(2),m.group(3),m.group(4),m.group(6),m.group(7))
filelist.update({filename_without_sequence: (sequence)})
for key, value in filelist.iteritems():
if value > 1:
newJPG = "{0}/{1}00.jpg".format(directory, key)
if value >= 10:
lastJPG = "{0}/{1}{2}.jpg".format(directory, key, value+1)
else:
lastJPG = "{0}/{1}0{2}.jpg".format(directory, key, value+1)
#Create first blackJPG
shutil.copyfile(blackJPG, newJPG)
#Create last blackJPG
shutil.copyfile(blackJPG, lastJPG)
return "Done"
If the variation is always the last 2 characters, then you can grab the part that doesn't change (the prefix) count the number of prefixes and create a file for those with more than one file:
def add_black_jpg(directory):
series_count = {}
for file in os.listdir(directory):
name, ext = os.path.splitext(file)
prefix = name[:-2]
count = series_count.get(prefix, 0)
series_count[prefix] = count + 1
for prefix, count in series_count.items():
if count > 1:
shutil.copyfile(black_jpg_location, f"{prefix}00.jpg")
Related
I have several thousand pdfs which I need to re-name based on the content. The layouts of the pdfs are inconsistent. To re-name them I need to locate a specific string "MEMBER". I need the value after the string "MEMBER" and the values from the two lines above MEMBER, which are Time and Date values respectively.
So:
STUFF
STUFF
STUFF
DD/MM/YY
HH:MM:SS
MEMBER ######
STUFF
STUFF
STUFF
I have been using regex101.com and have ((.*(\n|\r|\r\n)){2})(MEMBER.\S+) which matches all of the values I need. But it puts them across four groups with group 3 just showing a carriage return.
What I have so far looks like this:
import fitz
from os import DirEntry, curdir, chdir, getcwd, rename
from glob import glob as glob
import re
failed_pdfs = []
count = 0
pdf_regex = r'((.*(\n|\r|\r\n)){2})(MEMBER.\S+)'
text = ""
get_curr = getcwd()
directory = 'PDF_FILES'
chdir(directory)
pdf_list = glob('*.pdf')
for pdf in pdf_list:
with fitz.open(pdf) as pdf_obj:
for page in pdf_obj:
text += page.get_text()
new_file_name = re.search(pdf_regex, text).group().strip().replace(":","").replace("-","") + '.pdf'
text = ""
#clean_new_file_name = new_file_name.translate({(":"): None})
print(new_file_name)
# Tries to rename a pdf. If the filename doesn't already exist
# then rename. If it does exist then throw an error and add to failed list
try:
rename(pdf, new_file_name )
except WindowsError:
count += 1
failed_pdfs.append(str(count) + ' - FAILED TO RENAME: [' + pdf + " ----> " + str(new_file_name) + "]")
If I specify a group in the re.search portion- Like for instance Group 4 which contains the MEMBER ##### value, then the file renames successfully with just that value. Similarly, Group 2 renames with the TIME value. I think the multiple lines are preventing it from using all of the values I need. When I run it with group(), the print value shows as
DATE
TIME
MEMBER ######.pdf
And the log count reflects the failures.
I am very new at this, and stumbling around trying to piece together how things work. Is the issue with how I built the regex or with the re.search portion? Or everything?
I have tried re-doing the Regular Expression, but I end up with multiple lines in the results, which seems connected to the rename failure.
The strategy is to read the page's text by words and sort them adequately.
If we then find "MEMBER", the word following it represents the hashes ####, and the two preceeding words must be date and time respectively.
found = False
for page in pdf_obj:
words = page.get_text("words", sort=True)
# all space-delimited strings on the page, sorted vertically,
# then horizontally
for i, word in enumerate(words):
if word[4] == "MEMBER":
hashes = words[i+1][4] # this must be the word after MEMBER!
time-string = words[i-1][4] # the time
date_string = words[i-2][4] # the date
found = True
break
if found == True: # no need to look in other pages
break
I would like to save different positions of a file name in different panda df columns.
For example my file names look like this:
001015io.png
position 0-2 in column 'y position' in this case '001'
position 3-5 in column 'x position' in this case '015'
position 6-7 in column 'status' in this case 'io'
My folder contains about 400 of these picture files. I'm a beginner in programming, so I don't know how I should start to solve this.
If the parts of the file names that you need are consistent (same position and length in all files), you can use string slicing to create new columns from the pieces of the file name like this:
import pandas as pd
df = pd.DataFrame({'file_name': ['001015io.png']})
df['y position'] = df['file_name'].str[0:3]
df['x position'] = df['file_name'].str[3:6]
df['status'] = df['file_name'].str[6:8]
This results in the dataframe:
file_name y position x position status
0 001015io.png 001 015 io
Note that when you slice a string you give a start position and a stop position like [0:3]. The start position is inclusive, but the stop position is not, so [0:3] gives you the substring from 0-2.
You can do this with slicing. A string is basically a list of character, so you can slice this string into the parts you need. See the example below.
filename = '001015io.png'
x = filename[0:3]
y = filename[3:6]
status = filename[6:8]
print(x, y, status)
output
001 015 io
As for getting the list of files, there's an absurdly complete answer for that here.
I have this function below in my personal library which I reuse whenever I need to generate a list of files.
def get_files_from_path(path: str = ".", ext=None) -> list:
"""Find files in path and return them as a list.
Gets all files in folders and subfolders
See the answer on the link below for a ridiculously
complete answer for this. I tend to use this one.
note that it also goes into subdirs of the path
https://stackoverflow.com/a/41447012/9267296
Args:
path (str, optional): Which path to start on.
Defaults to '.'.
ext (str/list, optional): Optional file extention.
Defaults to None.
Returns:
list: list of full file paths
"""
result = []
for subdir, dirs, files in os.walk(path):
for fname in files:
filepath = f"{subdir}{os.sep}{fname}"
if ext == None:
result.append(filepath)
elif type(ext) == str and fname.lower().endswith(ext.lower()):
result.append(filepath)
elif type(ext) == list:
for item in ext:
if fname.lower().endswith(item.lower()):
result.append(filepath)
return result
There's one thing you need to take into account here, this function will give the full filepath, fe: path/to/file/001015io.png
You can use the code below to get just the filename:
import os
print(os.path.basename('path/to/file/001015io.png'))
ouput
001015io.png
Use what Bill the Lizard said to turn it into a df
I'm iterating through a folder with files and adding each of the file's path to a list. The folder contains files with incrementing file names, such as 00-0.txt, 00-1.txt, 00-2.txt, 01-0.txt, 01-1.txt, 01-2.txt and so on.
The number of files is not fixed and always varies. Also, sometimes a file could be missing. This means that I will sometimes get this list instead:
00-0.txt, 00-1.txt, 01-0.txt, 01-1.txt, 01-2.txt.
However, in my final list, I should always have groups of 9 (so 00-0, 00-1, 00-2 and so on until 00-8 is one group). If a file is missing, then I will append 'is missing' string text in the new list instead.
What I was thinking to do is the following:
Get the last character of the filename (for ex. '3')
Check if it's value is the same as the previous index + 1.
If it's not, then append 'it's missing' string
If it's the same, then append the file name
In pseudo-code (please don't mind the syntax errors, I'm mainly looking for high level advice), it would be something like this:
empty_list = []
list_with_paths = glob.glob("/path/to/dir*.txt")
for index, item in enumerate(list_with_paths):
basename = os.path.basename(item)
filename = os.path.splitext(basename)[0]
if index == 0 and int(filename[-1]) != 0:
empty_list.append('is missing')
elif filename[-1] != empty_list[index - 1] + 1:
empty_list.append('is missing')
else:
empty_list.append(filename)
I'm sure there is a more optimal solution in order to achieve this.
Once you have the set of actual paths, just iterate over the expected paths until you have accounted for all of the actual paths.
from itertools import count
list_with_paths = set(glob.glob("/path/to/dir/*.txt"))
groups = count()
results = []
for g in groups:
if not list_with_paths:
break
for i in range(0,9):
expected = "{:02}-{}.txt".format(g, i)
if "/path/to/dir/" + expected in list_with_paths:
list_with_paths.remove(expected)
else:
expected = "is missing"
results.append(expected)
Im running .txt files through a for loop which should slice out keywords and .append them into lists. For some reason my REGEX statements are returning really odd results.
My first statement which iterates through the full filenames and slices out the keyword works well.
# Creates a workflow list of file names within target directory for further iteration
stack = os.listdir(
"/Users/me/Documents/software_development/my_python_code/random/countries"
)
# declares list, to be filled, and their associated regular expression, to be used,
# in the primary loop
names = []
name_pattern = r"-\s(.*)\.txt"
# PRIMARY LOOP
for entry in stack:
if entry == ".DS_Store":
continue
# extraction of country name from file name into `names` list
name_match = re.search(name_pattern, entry)
name = name_match.group(1)
names.append(name)
This works fine and creates the list that I expect
However, once I move on to a similar process with the actual contents of files, it no longer works.
religions = []
reli_pattern = r"religion\s=\s(.+)."
# PRIMARY LOOP
for entry in stack:
if entry == ".DS_Store":
continue
# opens and reads file within `contents` variable
file_path = (
"/Users/me/Documents/software_development/my_python_code/random/countries" + "/" + entry
)
selection = open(file_path, "rb")
contents = str(selection.read())
# extraction of religion type and placement into `religions` list
reli_match = re.search(reli_pattern, contents)
religion = reli_match.group(1)
religions.append(religion)
The results should be something like: "therevada", "catholic", "sunni" etc.
Instead i'm getting seemingly random pieces of text from the document which have nothing to do with my REGEX like ruler names and stat values that do not contain the word "religion"
To try and figure this out I isolated some of the code in the following way:
contents = "religion = catholic"
reli_pattern = r"religion\s=\s(.*)\s"
reli_match = re.search(reli_pattern, contents)
print(reli_match)
And None is printed to the console so I am assuming the problem is with my REGEX. What silly mistake am I making which is causing this?
Your regular expression (religion\s=\s(.*)\s) requires that there be a trailing whitespace (the last \s there). Since your string doesn't have one, it doesn't find anything when searching thus re.search returns None.
You should either:
Change your regex to be r"religion\s=\s(.*)" or
Change the string you're searching to have a trailing whitespace (i.e 'religion = catholic' to 'religion = catholic ')
I read out of the XML-files their category and I rename and save them with the year.
So, file "XYZ.xml" is now "News_2014.xml".
The Problem is that there are several XML-files with the category "News" from 2014. With my code, I delete all other files and I can save only 1 file.
What can I do in order that every file is saved? For example, if there are 2 files with the category "News" and the Year 2014, there file-names should be: "News_2014_01.xml" and "News_2014_02.xml".
Since there are other categories, I can not simply implement an increasing integer, i.e. another file with the category "History" should still have the Name "History_2014_01.xml" (and not ...03.xml).
Actually, I have the following code:
for text, key in enumerate(d):
#print key, d[key]
name = d[key][(d[key].find("__")+2):(d[key].rfind("__"))]
year = d[key][(d[key].find("*")+1):(d[key].rfind("*"))]
cat = d[key][(d[key].rfind("*")+1):]
os.rename(name, cat+"_"+year+'.xml')
Once you have figured out the “correct” name for the file, e.g. News_2014.xml, you could make a loop that checks whether the file exists and adds an incrementing suffix to it while that’s the case:
import os
fileName = 'News_2014.xml'
baseName, extension = os.path.splitext(fileName)
suffix = 0
while os.path.exists(os.path.join(directory, fileName)):
suffix += 1
fileName = '{}_{:02}.{}'.format(baseName, suffix, extension)
print(fileName)
os.rename(originalName, fileName)
You can put that into a function, so it’s easier to use:
def getIncrementedFileName (fileName):
baseName, extension = os.path.splitext(fileName)
suffix = 0
while os.path.exists(os.path.join(directory, fileName)):
suffix += 1
fileName = '{}_{:02}.{}'.format(baseName, suffix, extension)
return fileName
And then use that in your code:
for key, item in d.items():
name = item[item.find("__")+2:item.rfind("__")]
year = item[item.find("*")+1:item.rfind("*")]
cat = item[item.rfind("*")+1:]
fileName = getIncrementedFileName(cat + '_' + year + '.xml')
os.rename(name, fileName)
[EDIT] #poke solution is much more elegant, let alone he posted it earlier
You can check if target filename already exists, and if it does, modify filename.
The easiest solution for me would be to always start with adding 'counter' to file name, so you start with News_2014_000.xml (maybe better be prepared for more than 100 files?).
Later you loop until you find filename, that does not exist:
def versioned_filename(candidate):
target = candidate
while os.path.exists(target):
fname, ext = target.rsplit('.', 1)
head, tail = fname.rsplit('_', 1)
count = int(tail)
#:03d formats as 3-digit with leading zero
target = "{}_{:03d}.{}".format(head, count+1, ext)
return target
So, if you want to save as 'News_2014_###.xml' file you can create name as usual, but call os.rename(sourcename, versioned_filename(targetname)).
If you want more efficient solution, you can parse output of glob.glob() to find highest count, you will save on multiple calling to os.path.exists, but it makes sense only if you expect hundreds or thousands of files.
You could use a dictionary to keep track of the count. That way, there is no need to modify file names after you've renamed them. The downside is that every filename will have a number in it, even if the max number for that category ends up being 1.
cat_count = {}
for text, key in enumerate(d):
name = d[key][(d[key].find("__")+2):(d[key].rfind("__"))]
year = d[key][(d[key].find("*")+1):(d[key].rfind("*"))]
cat = d[key][(d[key].rfind("*")+1):]
cat_count[cat] = cat_count[cat] + 1 if cat in cat_count else 1
os.rename(name, "%s_%s_%02d.xml" % (cat, year, cat_count[cat]))