Sorting nested list - python

Im trying to sort my list which contains of 3 nested lists: paths, file names and finally file creation time. So I want to sort them to be able to get the latest files.
So Ive seen people been using lambda for this, but I dont feel comfortable using those and kind of dont get how to the sorting with that works.
I think the best way is just to switch the list components, but this does not work:
class FILE:
PATH = 0
NAME = 1
DATE = 2
mayaFiles = [[],[],[]]
mayaFiles[FILE.DATE] = [0,56,3,12,7,35,16]
doSwitch = True
while (doSwitch):
for ma in range(0, len(mayaFiles[FILE.DATE])-1):
doSwitch = False
doSwitch = mayaFiles[FILE.DATE][ma] > mayaFiles[FILE.DATE][ma+1]
hi = mayaFiles[FILE.DATE][ma]
lo = mayaFiles[FILE.DATE][ma+1]
if doSwitch:
mayaFiles[FILE.DATE][ma] = lo
mayaFiles[FILE.DATE][ma+1] = hi
else:
break
print mayaFiles[FILE.DATE]

Assuming these lists are already aligned, you'll have a much easier time by combing the there separate lists into a list of tuples arranged by your sort order. the namedtuple construct in the collections module is great for this sort of thing. I'm assuming you can get your data into three lists: paths, dates and names. I'm supplying some dummy data here so you can see what I'm assuming.
names = "a.ma", "b.ma", "c.ma", "d.ma"
paths = "c:/test", "c/test", "c:/other", "d:/extra"
dates = "17-01-01", "16-01-01", "17-02-01", "17-06-30"
# this creates a namedtuple, which is a
# mini-class with named fields that otherwise
# works like a tuple
from collections import namedtuple
record = namedtuple("filerecord", "date name path")
# in real use this should be a list comp
# but this is easier to read:
records = []
for date, name, path in zip(dates, names, paths):
records.append(record(date, name, path))
records.sort(reverse=True)
for item in records:
print item
# filerecord(date='17-06-30', name='d.ma', path='d:/extra')
# filerecord(date='17-02-01', name='c.ma', path='c:/other')
# filerecord(date='17-01-01', name='a.ma', path='c:/test')
# filerecord(date='16-01-01', name='b.ma', path='c/test')
You could sort on other fields using the 'key' argument to sort():
records.sort(key=lambda k: k.name)
for item in records:
print item
# filerecord(date='17-01-01', name='a.ma', path='c:/test')
# filerecord(date='16-01-01', name='b.ma', path='c/test')
# filerecord(date='17-02-01', name='c.ma', path='c:/other')
# filerecord(date='17-06-30', name='d.ma', path='d:/extra')

Related

How to sort array in Python based on particular data set

I am using Python and have the following directories and files in an array list as shown below:
file_list = []
for file in file_list:
print(file)
The output is shown below in no particular order.
Testcase/result.log
Testcase/system/system.log
Testcase/data/database.log
Testcase/mem/mem.log
Testcase/cashe/cashe.log
Now, I have another Python string called target_str that can have random values, for example 'mem' or 'database'. In this case, if the target_str matches with the entry of file_list contents, the order of the file_list array should be changed and the matched value should come first.
For example:
target_str = 'mem'
Testcase/mem/mem.log [Note: this entry moves at the first position since it matches the 'target_str']
Testcase/result.log
Testcase/system/system.log
Testcase/data/database.log
Testcase/cashe/cashe.log
Wondering how to sort the file_list entry based on the value of target_str?
You can use the function sorted() and provide it a lambda key. In this case, we'll keep them the same order, but move the ones containing target_str to the top:
file_list = [
'Testcase/result.log',
'Testcase/system/system.log',
'Testcase/data/database.log',
'Testcase/mem/mem.log',
'Testcase/cashe/cashe.log',
]
print(
sorted(file_list, key=lambda f:[target_str not in f, file_list.index(f)])
)
# [
# 'Testcase/mem/mem.log',
# 'Testcase/result.log',
# 'Testcase/system/system.log',
# 'Testcase/data/database.log',
# 'Testcase/cashe/cashe.log'
# ]
If the key function for sorted() returns a list or tuple, then the elements take sort priority in that order. First, sort by target_str not in f (False sorts before True, so if the filename does contain the target string, it'll come first). Then, in case of a tie, sort by the index of the filename.
If you have a very large list of files, then you might want to sort enumerate(file_list) instead, to get the index of each file without having to call .index() every time. .index() is expensive.
Assuming you just want to rearrange the paths including target_dir to the top. You could filter the paths which includes target_dir and combine with the rest
data = [
'Testcase/result.log',
'Testcase/system/system.log',
'Testcase/data/database.log',
'Testcase/mem/mem.log',
'Testcase/cashe/cashe.log',
]
target_dir = 'mem'
target = [item for item in data if target_dir in item]
rest = [item for item in data if item not in target]
res = [*target, *rest]
print(res)
Sorting would not be a good idea here.
Instead use a OrderedDict.
file_list = ["Testcase/result.log",
"Testcase/system/system.log",
"Testcase/data/database.log",
"Testcase/mem/mem.log",
"Testcase/cashe/cashe.log"]
d = {}
from collections import OrderedDict
keys = list(map(lambda x:x.split("/")[-1].strip(".log"), file_list))#I took the <text>.log the text as the key
d = OrderedDict(zip(keys,file_list))
text = "mem"
d.move_to_end(text, last=False)
print(d)
You can then use list(d.values()) once you complete your process.

How to read and print a list in a specific order/format based on the content in the list for python?

New to python and for this example list
lst = ['<name>bob</name>', '<job>doctor</job>', '<gender>male</gender>', '<name>susan</name>', '<job>teacher</job>', '<gender>female</gender>', '<name>john</name>', '<gender>male</gender>']
There are 3 categories of name, job, and gender. I would want those 3 categories to be on the same line which would look like
<name>bob</name>, <job>doctor</job>, <gender>male</gender>
My actual list is really big with 10 categories I would want to be on the same line. I am also trying to figure out a way where if one of the categories is not in the list, it would print something like N/A to indicate that it is not in the list
for example I would want it to look like
<name>bob</name>, <job>doctor</job>, <gender>male</gender>
<name>susan</name>, <job>teacher</job>, <gender>female</gender>
<name>john</name>, N/A, <gender>male</gender>
What would be the best way to do this?
This is one way to do it. This would handle any length list, and guarantee grouping no matter how long the lists are as long as they are in the correct order.
Updated to convert to dict, so you can test for key existence.
lst = ['<name>bob</name>', '<job>doctor</job>', '<gender>male</gender>', '<name>susan</name>', '<job>teacher</job>', '<gender>female</gender>', '<name>john</name>', '<gender>male</gender>']
newlst = []
tmplist = {}
for item in lst:
value = item.split('>')[1].split('<')[0]
key = item.split('<')[1].split('>')[0]
if '<name>' in item:
if tmplist:
newlst.append(tmplist)
tmplist = {}
tmplist[key] = value
#handle the remaining items left over in the list
if tmplist:
newlst.append(tmplist)
print(newlst)
#test for existance
for each in newlst:
print(each.get('job', 'N/A'))

How can I combine separate dictionary outputs from a function in one dictionary?

For our python project we have to solve multiple questions. We are however stuck at this one:
"Write a function that, given a FASTA file name, returns a dictionary with the sequence IDs as keys, and a tuple as value. The value denotes the minimum and maximum molecular weight for the sequence (sequences can be ambiguous)."
import collections
from Bio import Seq
from itertools import product
def ListMW(file_name):
seq_records = SeqIO.parse(file_name, 'fasta',alphabet=generic_dna)
for record in seq_records:
dictionary = Seq.IUPAC.IUPACData.ambiguous_dna_values
result = []
for i in product(*[dictionary[j] for j in record]):
result.append("".join(i))
molw = []
for sequence in result:
molw.append(SeqUtils.molecular_weight(sequence))
tuple= (min(molw),max(molw))
if min(molw)==max(molw):
dict={record.id:molw}
else:
dict={record.id:(min(molw), max(molw))}
print(dict)
Using this code we manage to get this output:
{'seq_7009': (6236.9764, 6367.049999999999)}
{'seq_418': (3716.3642000000004, 3796.4124000000006)}
{'seq_9143_unamb': [4631.958999999999]}
{'seq_2888': (5219.3359, 5365.4089)}
{'seq_1101': (4287.7417, 4422.8254)}
{'seq_107': (5825.695099999999, 5972.8073)}
{'seq_6946': (5179.3118, 5364.420900000001)}
{'seq_6162': (5531.503199999999, 5645.577399999999)}
{'seq_504': (4556.920899999999, 4631.959)}
{'seq_3535': (3396.1715999999997, 3446.1969999999997)}
{'seq_4077': (4551.9108, 4754.0073)}
{'seq_1626_unamb': [3724.3894999999998]}
As you can see this is not one dictionary but multiple dictionaries under each other. So is there anyway we can change our code or type an extra command to get it in this format:
{'seq_7009': (6236.9764, 6367.049999999999),
'seq_418': (3716.3642000000004, 3796.4124000000006),
'seq_9143_unamb': (4631.958999999999),
'seq_2888': (5219.3359, 5365.4089),
'seq_1101': (4287.7417, 4422.8254),
'seq_107': (5825.695099999999, 5972.8073),
'seq_6946': (5179.3118, 5364.420900000001),
'seq_6162': (5531.503199999999, 5645.577399999999),
'seq_504': (4556.920899999999, 4631.959),
'seq_3535': (3396.1715999999997, 3446.1969999999997),
'seq_4077': (4551.9108, 4754.0073),
'seq_1626_unamb': (3724.3894999999998)}
Or in someway manage to make clear that it should use the seq_ID ans key and the Molecular weight as a value for one dictionary?
Set a dictionnary right before your for loop, then update it during your loop such as :
import collections
from Bio import Seq
from itertools import product
def ListMW(file_name):
seq_records = SeqIO.parse(file_name, 'fasta',alphabet=generic_dna)
retDict = {}
for record in seq_records:
dictionary = Seq.IUPAC.IUPACData.ambiguous_dna_values
result = []
for i in product(*[dictionary[j] for j in record]):
result.append("".join(i))
molw = []
for sequence in result:
molw.append(SeqUtils.molecular_weight(sequence))
tuple= (min(molw),max(molw))
if min(molw)==max(molw):
retDict[record.id] = molw
else:
retDict[record.id] = (min(molw), max(molw))}
# instead of printing now, print in the end of your function / script
# print(dict)
Right now, you're setting a new dict at each turn of your loop, and print it. It is just a normal behaviour of your code to print lots and lots of dict.
you're creating a dictionary with 1 entry at each iteration.
You want to:
define a dict variable (better use dct to avoid reusing built-in type name) before your loop
rewrite the assignment to dict in the loop
So before the loop:
dct = {}
and in the loop (instead of your if + dict = code), in a ternary expression, with min & max computed only once:
minval = min(molw)
maxval = max(molw)
dct[record.id] = molw if minval == maxval else (minval,maxval)

Python : Find tuples from a list of tuples having duplicate data in the 0th element(of the tuple)

I am having a list of tuples containing filename and filepath.
I want to find duplicates filename(but filepath may be different) i.e. tuples whose filename is same but filepath may be different.
Example of a list of tuples:
file_info = [('foo1.txt','/home/fold1'), ('foo2.txt','/home/fold2'), ('foo1.txt','/home/fold3')]
I want to find the duplicate filename i.e. file_info[2](in the above case) print it and delete it.
I possibly could iteratively check like:
count = 0
for (filename,filepath) in file_info:
count = count + 1
for (filename1,filepath1) in file_info[count:]:
if filename == filename1:
print filename1,filepath1
file_info.remove((filename1,filepath1))
But is there a more efficient/shorter/more correct/pythonic way of accomplishing the same task.
Thank You.
Using a set lets you avoid creating a double loop; add items you haven't seen yet to a new list to avoid altering the list you are looping over (which will lead to skipped items):
seen = set()
keep = []
for filename, filepath in file_info:
if filename in seen:
print filename, filepath
else:
seen.add(filename)
keep.append((filename, filepath))
file_info = keep
If order doesn't matter and you don't have to print the items you removed, then another approach is to use a dictionary:
file_info = dict(reversed(file_info)).items()
Reversing the input list assures that the first entry is kept rather than the last.
If you needed all the full paths for files with duplicates, I'd build a dictionary with lists as values, then remove anything that has only one element:
filename_to_paths = {}
for filename, filepath in file_info:
filename_to_paths.setdefault(filename, []).append(filepath)
duplicates = {filename: paths for filename, paths in filename_to_paths.iteritems() if len(paths) > 1}
The duplicates dictionary now only contains filenames where you have more than 1 path in the file_info list.
You can use a mapping with defaultdict and use it to see which paths hold the same file (the file is the key). A mapping's keys are already a set, and it will let you keeps the files for printing and deleting (they will be the mapping's values):
from collections import defaultdict
file_info = [
('foo1.txt','/home/fold1'),
('foo2.txt','/home/fold2'),
('foo1.txt','/home/fold3')]
# create a mapping that defaults to an empty list
path_by_file_name = defaultdict(list)
# populate the mapping
for name, path in file_info:
path_by_file_name[name].append(path)
# find duplicates (lists with more than one item)
duplicates = filter(lambda kv: len(kv[1]) > 1, path_by_file_name.items())
print duplicates # [('foo1.txt', ['/home/fold1', '/home/fold3'])]
This is basically as fast as the set solution, but keeps a bigger state (all the files mapped to a file name - which can become handy later). If you have millions of files that can become a problem, but it probably won't.
def get_uniqe(unique_index,initial_data):
seen = set()
for item in initial_data:
if item[unique_index] not in seen:
seen.add(item[unique_index])
yield item
print list(get_unique(0,my_list_of_file_info))

setting up iterative item query in python

I am trying to set up a function that will query an item for its sub components if those exists and return those else return the item itself.
Imagine an object that can contain more objects within it. To access those objects i would do object.GetSubComponentIds() now if that object contains sub objects it would return a list of those sub objects or EmptyList if there are none. In case that there are sub objects contained within it I want to keep going and then for each subobject i want to check if there are any subobjects contained within them. So for every SubObject.GetSubComponentIds() now if those do not contain anything then i would love to return them while maintaining nested structure of objects that they came from.
object1(contains 3 sub objects)
object2(contains 3 sub object and each sub object contains one more sub object)
object3(does not contain sub objects)
inputlist = [object1, object2]
outputlist = [[obj1sub1, obj1sub2, obj1sub3],[[obj2sub1sub1],[obj2sub2sub1],[obj2sub3sub1]],[obj3]]
I am interested in maintaining that nested list structure that will allow me to always trace back the origin of the sub object. Again, a method to get a sub object list is object.GetSubComponentIds() and it will either return a list or Empty List.
Can anyone help me set up an iterative function to retrieve them. Keep in mind that I do not know whether there are any sub objects contained within an object or haw many levels deep are they. It's basically that if it returns a list i need to check every item on that list for more sub objects.
Thank you in advance
Here's my humble first try:
#unwrap all elements to use with API
elements = []
for i in IN[0]:
elements.append(UnwrapElement(i))
#create element set from python list
elementSet = Autodesk.Revit.DB.ElementSet()
for i in elements:
elementSet.Insert(i)
#convert element set to List[Element]
setForCheck = List[Autodesk.Revit.DB.Element]()
elemIter = elementSet.ForwardIterator()
elemIter.Reset()
while elemIter.MoveNext():
curElem = elemIter.Current
setForCheck.Add(curElem)
#iterate throuh all elements to extract nested elements
setLoop = List[Autodesk.Revit.DB.Element]()
elemSet = List[Autodesk.Revit.DB.Element]()
itemOut = []
counter = 0
while setForCheck.Count >= 1:
setLoop.Clear()
for i in setForCheck:
itemOut.append(i)
if i.GetSubComponentIds().Count >= 1:
elem = Autodesk.Revit.DB.ElementSet()
for j in i.GetSubComponentIds():
elem.Insert(doc.GetElement(j))
elemIterA = elem.ForwardIterator()
elemIterA.Reset()
while elemIterA.MoveNext():
curElemA = elemIterA.Current
setLoop.Add(curElemA)
setForCheck.Clear()
elemIterB = setLoop.GetEnumerator()
elemIterB.Reset()
while elemIterB.MoveNext():
curElemB = elemIterB.Current
setForCheck.Add(curElemB)
counter += 1
if counter > 1000:
break
#Assign your output to the OUT variable
OUT = itemOut
You're using some specific libraries, like Autodesk, that I'm not familiar with. Let me answer your question in terms of an abstract example.
Suppose we're dealing with Thing objects, where Thing is defined as:
class Thing(object):
def __init__(self, name):
self.name = name
self.inside = []
We can make Things and put other things inside of them. The example you give in your post can be written:
ob1 = Thing("ob1")
ob1.inside.extend([Thing("ob1sub1"), Thing("ob1sub2"), Thing("ob1sub3")])
ob2 = Thing("ob2")
for i in xrange(1,4):
name = "ob2sub{}".format(i)
thing = Thing(name)
thing.inside.append(Thing(name + "sub1"))
ob2.inside.append(thing)
ob3 = Thing("ob3")
things = [ob1, ob2, ob3]
This makes a sort of tree. Now we'd like to return a nested list of all of the leaf nodes in the tree:
def search_things(things):
names = []
for thing in things:
if not thing.inside:
names.append(thing)
else:
names.append(search_things(thing.inside))
return names
A test:
>>> search_things(things)
[['ob1sub1', 'ob1sub2', 'ob1sub3'],
[['ob2sub1sub1'], ['ob2sub2sub1'], ['ob2sub3sub1']],
'ob3']
I'll let you transform this to your specific problem, but this is the general idea. Note that the algorithm is recursive, not iterative. You said you wanted an iterative algorithm -- and the above can be written iteratively -- but this gives you the idea.

Categories