Python: Unique list (based on part of file path)

Python: Unique list (based on part of file path) - python

Sorry; I know there are a thousand 'make unique list' threads. I've tried to solve this on my own, or to hack another "make unique list" solution, but I've been unsuccessful with my not amazing python skills.
I have a list of video file names (these are shots in a movie). For any given shot I want to remove duplicates, based on part of the path (circled in red in the image below); only the one with the highest tk_value should end up in the final list.
e.g In the image below, for shot de05_001 only tk_3 should end up in the list.
Input (with duplicates):
raw_list = ['D:\\de05\\de05_001\\postvis\\tk_2\\blasts\\tb205_de05_001.POSTVIS.mov',
'D:\\de05\\de05_001\\postvis\\tk_3\\blasts\\tb205_de05_001.POSTVIS.mov',
'D:\\de05\\de05_002\\postvis\\tk_1\\blasts\\tb205_de05_002.POSTVIS.mov',
'D:\\de05\\de05_017\\postvis\\tk_2\\blasts\\tb205_de05_017.POSTVIS.mov',
'D:\\de05\\de05_019\\postvis\\tk_2\\blasts\\tb205_de05_019.POSTVIS.mov',
'D:\\de05\\de05_019\\postvis\\tk_3\\blasts\\tb205_de05_019.POSTVIS.mov',
'D:\\de05\\de05_019\\postvis\\tk_4\\blasts\\tb205_de05_019.POSTVIS.mov',
'D:\\de05\\de05_019\\postvis\\tk_1\\blasts\\tb205_de05_019.POSTVIS.mov', ]
Output (duplicates removed, only highest tk_ numbers remain):
outputList = ['D:\\de05\\de05_001\\postvis\\tk_3\\blasts\\tb205_de05_001.POSTVIS.mov',
'D:\\de05\\de05_002\\postvis\\tk_1\\blasts\\tb205_de05_002.POSTVIS.mov',
'D:\\de05\\de05_017\\postvis\\tk_2\\blasts\\tb205_de05_017.POSTVIS.mov',
'D:\\de05\\de05_019\\postvis\\tk_4\\blasts\\tb205_de05_019.POSTVIS.mov', ]
Any help would be great. Thank you.

One way would be to create a dictionary and keep reassigning the keys, so you only end up with the last value in the directory:
import os
raw_list1 = [
'D:\\\\de05\\de05_019\\postvis\\tk_2\\blasts\\tb205_de05_019.POSTVIS.mov',
'D:\\\\de05\\de05_019\\postvis\\tk_3\\blasts\\tb205_de05_019.POSTVIS.mov',
'D:\\\\de05\\de05_019\\postvis\\tk_4\\blasts\\tb205_de05_019.POSTVIS.mov',
'D:\\\\de05\\de05_019\\postvis\\tk_1\\blasts\\tb205_de05_019.POSTVIS.mov',
'D:\\\\tw05\\tw05_036\\postvis\\tk_9\\blasts\\tb205_tw05_036.POSTVIS.mov',
'D:\\\\tw05\\tw05_036\\postvis\\tk_13\\blasts\\tb205_tw05_036.POSTVIS.mov'
]
raw_list2 = [
'D:\\de05\\de05_001\\postvis\\tk_2\\blasts\\tb205_de05_001.POSTVIS.mov',
'D:\\de05\\de05_001\\postvis\\tk_3\\blasts\\tb205_de05_001.POSTVIS.mov',
'D:\\de05\\de05_002\\postvis\\tk_1\\blasts\\tb205_de05_002.POSTVIS.mov',
'D:\\de05\\de05_017\\postvis\\tk_2\\blasts\\tb205_de05_017.POSTVIS.mov',
'D:\\de05\\de05_019\\postvis\\tk_2\\blasts\\tb205_de05_019.POSTVIS.mov',
'D:\\de05\\de05_019\\postvis\\tk_3\\blasts\\tb205_de05_019.POSTVIS.mov',
'D:\\de05\\de05_019\\postvis\\tk_4\\blasts\\tb205_de05_019.POSTVIS.mov',
'D:\\de05\\de05_019\\postvis\\tk_1\\blasts\\tb205_de05_019.POSTVIS.mov',
]
def path_split(p, folders=None):
folders = folders or []
head, tail = os.path.split(p)
if not tail:
return folders
return path_split(head, [tail] + folders)
for raw_list in (raw_list1, raw_list2):
results = {}
for p in raw_list:
# Split your path accordingly
# For something simple you could have just done s.split('\\'), but since we're working with paths, we might as well use os.path.split
shot1, shot2, folder1, take, folder2, file_name = path_split(p)
# If something like 'de05_019' defines your shot, make that the key
key = shot2
# Extract the take number into an integer
new_take_num = int(take.split('_')[-1])
# Try finding the take you already stored (default to Nones)
existing_take_num, existing_path = results.get(key, (None, None))
# See if the new take is bigger than the existing one, based on the take number.
# Lambda is there for comparison, meaning I'm only comparing the take numbers, not the paths. I'll link the docs to max in the comments.
value = max((existing_take_num, existing_path), (new_take_num, p), key=lambda take_num_and_path: take_num_and_path[0])
# Assign the value (which is either the existing take, or the new take)
results[key] = value
for res in sorted(results.values()):
print res
print '*' * 80
This outputs (you could also just print res[1] to only print the path):
(4, 'D:\\\\de05\\de05_019\\postvis\\tk_4\\blasts\\tb205_de05_019.POSTVIS.mov')
(13, 'D:\\\\tw05\\tw05_036\\postvis\\tk_13\\blasts\\tb205_tw05_036.POSTVIS.mov')
********************************************************************************
(1, 'D:\\de05\\de05_002\\postvis\\tk_1\\blasts\\tb205_de05_002.POSTVIS.mov')
(2, 'D:\\de05\\de05_017\\postvis\\tk_2\\blasts\\tb205_de05_017.POSTVIS.mov')
(3, 'D:\\de05\\de05_001\\postvis\\tk_3\\blasts\\tb205_de05_001.POSTVIS.mov')
(4, 'D:\\de05\\de05_019\\postvis\\tk_4\\blasts\\tb205_de05_019.POSTVIS.mov')
********************************************************************************

Related

How to traverse dictionary keys in sorted order

I am reading a cfg file, and receive a dictionary for each section. So, for example:
Config-File:
[General]
parameter1="Param1"
parameter2="Param2"
[FileList]
file001="file1.txt"
file002="file2.txt" ......
I have the FileList section stored in a dictionary called section. In this example, I can access "file1.txt" as test = section["file001"], so test == "file1.txt". To access every file of FileList one after the other, I could try the following:
for i in range(1, (number_of_files + 1)):
access_key = str("file_00" + str(i))
print(section[access_key])
This is my current solution, but I don't like it at all. First of all, it looks kind of messy in python, but I will also face problems when more than 9 files are listed in the config.
I could also do it like:
for i in range(1, (number_of_files + 1)):
if (i <= 9):
access_key = str("file_00" + str(i))
elif (i > 9 and i < 100):
access_key = str("file_0" + str(i))
print(section[access_key])
But I don't want to start with that because it becomes even worse. So my question is: What would be a proper and relatively clean way to go through all the file names in order? I definitely need the loop because I need to perform some actions with every file.

Use zero padding to generate the file number (for e.g. see this SO question answer: https://stackoverflow.com/a/339013/3775361). That way you don’t have to write the logic of moving through digit rollover yourself—you can use built-in Python functionality to do it for you. If you’re using Python 3 I’d also recommend you try out f-strings (one of the suggested solutions at the link above). They’re awesome!

If we can assume the file number has three digits, then you can do the followings to achieve zero padding. All of the below returns "015".
i = 15
str(i).zfill(3)
# or
"%03d" % i
# or
"{:0>3}".format(i)
# or
f"{i:0>3}"

Start by looking at the keys you actually have instead of guessing what they might be. You need to filter out the ones that match your pattern, and sort according to the numerical portion.
keys = [key for key in section.keys() if key.startswith('file') and key[4:].isdigit()]
You can add additional conditions, like len(key) > 4, or drop the conditions entirely. You might also consider learning regular expressions to make the checking more elegant.
To sort the names without having to account for padding, you can do something like
keys = sorted(keys, key=lambda s: int(s[4:]))
You can also try a library like natsort, which will handle the custom sort key much more generally.
Now you can iterate over the keys and do whatever you want:
for key in sorted((k for k in section if k.startswith('file') and k[4:].isdigit()), key=lambda s: int(s[4:])):
print(section[key])
Here is what a solution equipt with re and natsort might look like:
import re
from natsort import natsorted
pattern = re.compile(r'file\d+')
for key in natsorted(k for k in section if pattern.fullmatch(k)):
print(section[key])

Given a list of filenames, group filenames with all derivatives based on similarity of the filename

I might not be searching the best terms to find a solution but so far nothing I've found has been able to solve my problem and I really don't know where to start or even what mechanisms to investigate.
I have a large list of image files in various locations on my hard drive and I'm trying to clean it up by removing the duplicates. Most of these are easy to find using hash codes but I have a lot of corrupted or edited versions which aren't so easy to find. I know I'll need some user interaction to identify and delete (archive) the unwanted files and I'll be doing some further processing to make sure metadata such as dates and geotagging are correct (also used to potentially match files) and then display similar images with all known data through a simple html interface.
One of the steps I've identified is grouping similarly named files or files which have part of another filename in its name. Sometimes these can be completely unrelated and so the user interaction will be required.
Below is a sample of files, what I would like is to group them into filenames which are similar, disregarding path and file extension.
[
"/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
"/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
"/Users/stu/Photos/2013/IMAG0097.jpg",
"/Users/stu/Photos/2014/IMAG0097.jpg",
"/Users/stu/Photos/2013/IMAG0126.jpg",
"/Users/stu/Photos/Holidays/IMAG0132.jpg",
"/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg",
"/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg",
"/Users/stu/Downloads/Photos/IMG_20140412_195105.png",
"/Users/stu/Photos/2014/IMG_20140412_195110.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png",
"/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg",
"/Users/stu/Photos/2013/IMAG0126-edited.jpg",
"/Users/stu/Photos/2013/IMAG0126546.jpg"
]
The list of files above should output something like this:
{
"IMG_20140413_072335": [
"/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
"/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
"/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg"
],
"IMAG0097": [
"/Users/stu/Photos/2013/IMAG0097.jpg",
"/Users/stu/Photos/2014/IMAG0097.jpg"
],
"IMAG0126": [
"/Users/stu/Photos/2013/IMAG0126.jpg",
"/Users/stu/Photos/2013/IMAG0126-edited.jpg",
"/Users/stu/Photos/2013/IMAG0126546.jpg"
],
"IMAG0132": [
"/Users/stu/Photos/Holidays/IMAG0132.jpg"
],
"IMG_20140322_142557": [
"/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg"
],
"IMG_20140330_200132": [
"/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg"
],
"IMG_20140412_195105": [
"/Users/stu/Downloads/Photos/IMG_20140412_195105.png"
],
"IMG_20140412_195110": [
"/Users/stu/Photos/2014/IMG_20140412_195110.png"
],
"IMG_20140413_143245": [
"/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png"
]
}
Any ideas how to do this in Python3?
Thanks
Edit: I just added a few more examples to the sample set of filenames.

the following worked for me:
from pprint import pprint
d = dict()
for i in t:
tmp = os.path.basename(i).split(".")[0] # if file with extension given return the name before "."
# else return the base name, without changes
k = tmp.split("(")[0] # the (..) is a typical windows signiture for simillar names
# if so split and take the name before it
d.setdefault(k,[]) # the line reassures the uniquenes of the records
if k in tmp:
d[k].append(i)
# SENTINEL
if sum([len(i) for i in d.values()]) !=len(t):
raise ValueError("The sanity check wasn't successful !")
pprint(d)
RESULT:
{'IMAG0097': ['/Users/stu/Photos/2013/IMAG0097.jpg',
'/Users/stu/Photos/2014/IMAG0097.jpg'],
'IMAG0126': ['/Users/stu/Photos/2013/IMAG0126.jpg'],
'IMAG0132': ['/Users/stu/Photos/Holidays/IMAG0132.jpg'],
'IMG_20140322_142557-edited': ['/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg'],
'IMG_20140330_200132': ['/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg'],
'IMG_20140412_195105': ['/Users/stu/Downloads/Photos/IMG_20140412_195105.png'],
'IMG_20140412_195110': ['/Users/stu/Photos/2014/IMG_20140412_195110.png'],
'IMG_20140413_072335': ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335.jpg',
'/Users/stu/Documents/Backup/IMG_20140413_072335(4).png',
'/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg',
'/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg'],
'IMG_20140413_143245': ['/Users/stu/Photos/2014/IMG_20140413_143245(6).png',
'/Users/stu/Photos/2014/IMG_20140413_143245(7).png',
'/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245.png',
'/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg',
'/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg',
'/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png']}

It appears that your pictures are identified by what is between the last 'G' (from 'IMG' or 'IMAG') and the next '.' or '(' or '-'.
Using that portion of the strings as a key, we can easily group filenames into a dict of lists.
files = ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335.jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(4).png', '/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg', '/Users/stu/Photos/2013/IMAG0097.jpg', '/Users/stu/Photos/2014/IMAG0097.jpg', '/Users/stu/Photos/2013/IMAG0126.jpg', '/Users/stu/Photos/Holidays/IMAG0132.jpg', '/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg', '/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg', '/Users/stu/Downloads/Photos/IMG_20140412_195105.png', '/Users/stu/Photos/2014/IMG_20140412_195110.png', '/Users/stu/Photos/2014/IMG_20140413_143245(6).png', '/Users/stu/Photos/2014/IMG_20140413_143245(7).png', '/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245.png', '/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png', '/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png', '/Users/stu/Photos/2013/IMG_20140413_072335.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg', '/Users/stu/Photos/2013/IMAG0126-edited.jpg', '/Users/stu/Photos/2013/IMAG0126546.jpg']
def photo_id(filename):
i = filename.rfind('G') + 1
j1 = filename.find('.', i)
j2 = filename.find('(', i)
j3 = filename.find('-', i)
j = min(j for j in (j1,j2,j3,len(filename)) if j > -1)
return filename[i:j]
photos = {}
for filename in files:
photos.setdefault(photo_id(filename), []).append(filename)
print(photos)
# {'_20140413_072335': ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335.jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(4).png', '/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg', '/Users/stu/Photos/2013/IMG_20140413_072335.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg'],
# '0097': ['/Users/stu/Photos/2013/IMAG0097.jpg', '/Users/stu/Photos/2014/IMAG0097.jpg'],
# '0126': ['/Users/stu/Photos/2013/IMAG0126.jpg', '/Users/stu/Photos/2013/IMAG0126-edited.jpg'],
# '0132': ['/Users/stu/Photos/Holidays/IMAG0132.jpg'],
# '_20140322_142557': ['/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg'],
# '_20140330_200132': ['/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg'],
# '_20140412_195105': ['/Users/stu/Downloads/Photos/IMG_20140412_195105.png'],
# '_20140412_195110': ['/Users/stu/Photos/2014/IMG_20140412_195110.png'],
# '_20140413_143245': ['/Users/stu/Photos/2014/IMG_20140413_143245(6).png', '/Users/stu/Photos/2014/IMG_20140413_143245(7).png', '/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245.png', '/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png', '/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png'],
# '_20140413_072335_01': ['/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg'],
# '_20140413_072335_9352': ['/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg'],
# '0126546': ['/Users/stu/Photos/2013/IMAG0126546.jpg']}

So I worked out a solution that gives me what I'm after. Not sure that it's the very best way to solve this but certainly does the job.
Firstly, I created a dict with full path as the key and filename minus extension as the value. This is then sorted by value length so that as I iterate through I'm able to start with shorter values and work up. Then I simply iterate through and check against all lower entities looking for a match within the string and grouping matches together. I've also allowed for small filenames by comparing lengths of values and only matching if the threshold is reached (0.5 in the example below).
import os
from pprint import pprint
files = [
"/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
"/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
"/Users/stu/Photos/2013/IMAG0097.jpg",
"/Users/stu/Photos/2014/IMAG0097.jpg",
"/Users/stu/Photos/2013/IMAG0126.jpg",
"/Users/stu/Photos/Holidays/IMAG0132.jpg",
"/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg",
"/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg",
"/Users/stu/Downloads/Photos/IMG_20140412_195105.png",
"/Users/stu/Photos/2014/IMG_20140412_195110.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png",
"/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg",
"/Users/stu/Photos/2013/IMAG0126-edited.jpg",
"/Users/stu/Photos/20.png",
"/Users/stu/Photos/203.png",
"/Users/stu/Photos/2021.png",
"/Users/stu/Photos/2021q.png",
"/User/2.jpg"
]
relevanceFactor = 0.5
rawFiles = {}
for file in files:
rawFiles[file] = os.path.splitext(os.path.basename(file))[0]
sortedFiles = sorted(rawFiles.items(), key=lambda kv: (len(kv[1]), kv[0]))
alreadyGrouped = []
groupedFiles = {}
for i, file in enumerate(sortedFiles):
fullPath = file[0]
cluster = file[1]
if cluster not in alreadyGrouped:
groupedFiles[cluster] = [fullPath]
for compareFile in sortedFiles[i+1:]:
compareFullPath = compareFile[0]
compareCluster = compareFile[1]
if len(cluster)/len(compareCluster) < relevanceFactor:
break
if (compareCluster not in alreadyGrouped
and cluster in compareCluster):
alreadyGrouped.append(compareCluster)
groupedFiles[cluster].append(compareFullPath)
if cluster not in alreadyGrouped:
alreadyGrouped.append(cluster)
pprint(groupedFiles)

How to sort a list of strings delimited by '.' with also numbers in the middle?

I have a list of strings that contain commands separated by a dot . like this:
DeviceA.CommandA.1.Hello,
DeviceA.CommandA.2.Hello,
DeviceA.CommandA.11.Hello,
DeviceA.CommandA.3.Hello,
DeviceA.CommandB.1.Hello,
DeviceA.CommandB.1.Bye,
DeviceB.CommandB.What,
DeviceA.SubdeviceA.CommandB.1.Hello,
DeviceA.SubdeviceA.CommandB.2.Hello,
DeviceA.SubdeviceB.CommandA.1.What
And I would want to order them in natural order:
The order must prioritize by field index (e.g The commands that start with DeviceA will always go before DeviceB etc)
Order alphabetically the strings
When it finds a number sort numerically in ascending order
Therefore, the sorted output should be:
DeviceA.CommandA.1.Hello,
DeviceA.CommandA.2.Hello,
DeviceA.CommandA.3.Hello,
DeviceA.CommandA.11.Hello,
DeviceA.CommandB.1.Bye,
DeviceA.CommandB.1.Hello,
DeviceA.SubdeviceA.CommandB.1.Hello,
DeviceA.SubdeviceA.CommandB.2.Hello,
DeviceA.SubdeviceB.CommandA.What,
DeviceB.CommandB.What
Also note that the length of the command fields is dynamic, the number of fields separated by dot can be any size.
So far I tried this without luck (the numbers are order alphabetically, for example 11 goes before 5):
list = [
"DeviceA.CommandA.1.Hello",
"DeviceA.CommandA.2.Hello",
"DeviceA.CommandA.11.Hello",
"DeviceA.CommandA.3.Hello",
"DeviceA.CommandB.1.Hello",
"DeviceA.CommandB.1.Bye",
"DeviceB.CommandB.What",
"DeviceA.SubdeviceA.CommandB.1.Hello",
"DeviceA.SubdeviceA.CommandB.2.Hello",
"DeviceA.SubdeviceB.CommandA.1.What"
]
sorted_list = sorted(list, key=lambda x: x.split('.'))
EDIT: Corrected typo error.

Something like this should get you going.
from pprint import pprint
data_list = [
"DeviceA.CommandA.1.Hello",
"DeviceA.CommandA.2.Hello",
"DeviceA.CommandA.3.Hello",
"DeviceA.CommandB.1.Hello",
"DeviceA.CommandB.1.Bye",
"DeviceB.CommandB.What",
"DeviceA.SubdeviceA.CommandB.1.Hello",
"DeviceA.SubdeviceA.CommandB.15.Hello", # added test case to ensure numbers are sorted numerically
"DeviceA.SubdeviceA.CommandB.2.Hello",
"DeviceA.SubdeviceB.CommandA.1.What",
]
def get_sort_key(s):
# Turning the pieces to integers would fail some comparisons (1 vs "What")
# so instead pad them on the left to a suitably long string
return [
bit.rjust(30, "0") if bit.isdigit() else bit
for bit in s.split(".")
]
# Note the key function must be passed as a kwarg.
sorted_list = sorted(data_list, key=get_sort_key)
pprint(sorted_list)
The output is
['DeviceA.CommandA.1.Hello',
'DeviceA.CommandA.2.Hello',
'DeviceA.CommandA.3.Hello',
'DeviceA.CommandB.1.Bye',
'DeviceA.CommandB.1.Hello',
'DeviceA.SubdeviceA.CommandB.1.Hello',
'DeviceA.SubdeviceA.CommandB.2.Hello',
'DeviceA.SubdeviceA.CommandB.15.Hello',
'DeviceA.SubdeviceB.CommandA.1.What',
'DeviceB.CommandB.What']

Specifying a key in sorted seems to achieve what you want:
import re
def my_key(s):
n = re.search("\d+",s)
return (s[:n.span()[0]], int(n[0])) if n else (s,)
print(sorted(l, key = my_key))
Output:
['DeviceA.CommandA.1.Hello', 'DeviceA.CommandA.2.Hello', 'DeviceA.CommandA.3.Hello', 'DeviceA.CommandA.11.Hello', 'DeviceA.CommandB.1.Hello', 'DeviceA.CommandB.1.Bye', 'DeviceA.SubdeviceA.CommandB.1.Hello', 'DeviceA.SubdeviceA.CommandB.2.Hello', 'DeviceA.SubdeviceB.CommandA.1.What', 'DeviceB.CommandB.What']

There are many ways to achieve this. Here's one that doesn't rely on importation of any additional modules:
LOS = ['DeviceA.CommandA.1.Hello',
'DeviceA.CommandA.2.Hello',
'DeviceA.CommandA.11.Hello',
'DeviceA.CommandA.3.Hello',
'DeviceA.CommandB.1.Hello',
'DeviceA.CommandB.1.Bye',
'DeviceB.CommandB.What',
'DeviceA.SubdeviceA.CommandB.1.Hello',
'DeviceA.SubdeviceA.CommandB.2.Hello',
'DeviceA.SubdeviceB.CommandA.1.What']
def func(s):
tokens = s.split('.')
for i, token in enumerate(tokens):
try:
v = int(token)
return ('.'.join(tokens[0:i]), v)
except ValueError:
pass
return (s, 0)
print(sorted(LOS, key=func))

Data analysis for inconsistent string formatting

I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)

is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).

Matching strings for multiple data set in Python

I am working on python and I need to match the strings of several data files. First I used pickle to unpack my files and then I place them into a list. I only want to match strings that have the same conditions. This conditions are indicated at the end of the string.
My working script looks approximately like this:
import pickle
f = open("data_a.dat")
list_a = pickle.load( f )
f.close()
f = open("data_b.dat")
list_b = pickle.load( f )
f.close()
f = open("data_c.dat")
list_c = pickle.load( f )
f.close()
f = open("data_d.dat")
list_d = pickle.load( f )
f.close()
for a in list_a:
for b in list_b:
for c in list_c
for d in list_d:
if a.GetName()[12:] in b.GetName():
if a.GetName[12:] in c.GetName():
if a.GetName[12:] in d.GetName():
"do whatever"
This seems to work fine for these 2 lists. The problems begin when I try to add more 8 or 9 more data files for which I also need to match the same conditions. The script simple won't process and it gets stuck. I appreciate your help.
Edit: Each of the lists contains histograms named after the parameters that were used to create them. The name of the histograms contains these parameters and their values at the end of the string. In the example I did it for 2 data sets, now I would like to do it for 9 data sets without using multiple loops.
Edit 2. I just expanded the code to reflect more accurately what I want to do. Now if I try to do that for 9 lists, it does not only look horrible, but it also doesn't work.

out of my head:
files = ["file_a", "file_b", "file_c"]
sets = []
for f in files:
f = open("data_a.dat")
sets.append(set(pickle.load(f)))
f.close()
intersection = sets[0].intersection(*sets[1:])
EDIT: Well I overlooked your mapping to x.GetName()[12:], but you should be able to reduce your problem to set logic.

Here a small piece of code you can inspire on. The main idea is the use of a recursive function.
For simplicity sake, I admit that I already have data loaded in lists but you can get them from file before :
data_files = [
'data_a.dat',
'data_b.dat',
'data_c.dat',
'data_d.dat',
'data_e.dat',
]
lists = [pickle.load(open(f)) for f in data_files]
And because and don't really get the details of what you really need to do, my goal here is to found the matches on the four firsts characters :
def do_wathever(string):
print "I have match the string '%s'" % string
lists = [
["hello", "world", "how", "grown", "you", "today", "?"],
["growl", "is", "a", "now", "on", "appstore", "too bad"],
["I", "wish", "I", "grow", "Magnum", "mustache", "don't you?"],
]
positions = [0 for i in range(len(lists))]
def recursive_match(positions, lists):
strings = map(lambda p, l: l[p], positions, lists)
match = True
searched_string = strings.pop(0)[:4]
for string in strings:
if searched_string not in string:
match = False
break
if match:
do_wathever(searched_string)
# increment positions:
new_positions = positions[:]
lists_len = len(lists)
for i, l in enumerate(reversed(lists)):
max_position = len(l)-1
list_index = lists_len - i - 1
current_position = positions[list_index]
if max_position > current_position:
new_positions[list_index] += 1
break
else:
new_positions[list_index] = 0
continue
return new_positions, not any(new_positions)
search_is_finished = False
while not search_is_finished:
positions, search_is_finished = recursive_match(positions, lists)
Of course you can optimize a lot of things here, this is draft code, but take a look at the recursive function, this is a major concept.

In the end I ended up using the map built in function. I realize now I should have been even more explicit than I was (which I will do in the future).
My data files are histograms with 5 parameters, some with 3 or 4. Something like this,
par1=["list with some values"]
par2=["list with some values"]
par3=["list with some values"]
par4=["list with some values"]
par5=["list with some values"]
I need to examine the behavior of the quantity plotted for each possible combination of the values of the parameters. In the end, I get a data file with ~300 histograms each identified in their name with the corresponding values of the parameters and the sample name. It looks something like,
datasample1-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample1-"permutation of the above values"
...
datasample9-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample9-"permutation of the above values"
So I get 300 histograms for each of the 9 data files, but luckily all of this histograms are created in the same order. Hence I can pair all of them just using the map built in function. I unpack the data files, put each on lists and the use the map function to pair each histogram with its corresponding configuration in the other data samples.
for lst in map(None, data1_histosli, data2_histosli, ...data9_histosli):
do_something(lst)
This solves my problem. Thank you to all for your help!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Unique list (based on part of file path) - python

Related

How to traverse dictionary keys in sorted order

Given a list of filenames, group filenames with all derivatives based on similarity of the filename

How to sort a list of strings delimited by '.' with also numbers in the middle?

Data analysis for inconsistent string formatting

Matching strings for multiple data set in Python

Categories

Resources