Let's say I have such a list:
['word_4_0_w_7',
'word_4_0_w_6',
'word_3_0_w_10',
'word_3_0_w_2']
and I want to sort them according to number that comes after "word" and according to number after "w".
It will look like this:
['word_3_0_w_2',
'word_3_0_w_10',
'word_4_0_w_6',
'word_4_0_w_7']
What comes in mind is to create a bunch of list and according to index after "word" stuff them with sorted strings according "w", and then merge them.
Is in Python more clever way to do it?
Use Python's key functionality, in conjunction with other answers:
def mykey(value):
ls = value.split("_")
return int(ls[1]), int(ls[-1])
newlist = sorted(firstlist, key=mykey)
## or, if you want it in place:
firstlist.sort(key=mykey)
Python will be more efficient with key vs cmp.
You can provide a function to the sort() method of list objects:
l = ['word_4_0_w_7',
'word_4_0_w_6',
'word_3_0_w_10',
'word_3_0_w_2']
def my_key_func(x):
xx = x.split("_")
return (int(xx[1]), int(xx[-1]))
l.sort(key=my_key_func)
Output:
print l
['word_3_0_w_2', 'word_3_0_w_10', 'word_4_0_w_6', 'word_4_0_w_7']
edit: Changed code according to comment by #dwanderson ; more info on this can be found here.
You can use a function to extract the relevant parts of your string and then use those parts to sort:
a = ['word_4_0_w_7', 'word_4_0_w_6', 'word_3_0_w_10', 'word_3_0_w_2']
def sort_func(x):
parts = x.split('_');
sort_key = parts[1]+parts[2]+"%02d"%int(parts[4])
return sort_key
a_sorted = sorted(a,key=sort_func)
The expression "%02d" %int(x.split('_')[4]) is used to add a leading zero in front of second number otherwise 10 will sort before 2. You may have to do the same with the number extracted by x.split('_')[2].
My program, written in Python 3, has many places where it starts with a (very large) table-like numeric data structure and adds columns to it following a certain algorithm. (The algorithm is different in every place.)
I am trying to convert this into pure functional approach since I run into problems with the imperative approach (hard to reuse, hard to memoize interim steps, hard to achieve "lazy" computation, bug-prone due to reliance on state, etc.).
The Table class is implemented as a dictionary of dictionaries: the outer dictionary contains rows, indexed by row_id; the inner contains values within a row, indexed by column_title. The table's methods are very simple:
# return the value at the specified row_id, column_title
get_value(self, row_id, column_title)
# return the inner dictionary representing row given by row_id
get_row(self, row_id)
# add a column new_column_title, defined by func
# func signature must be: take a row and return a value
add_column(self, new_column_title, func)
Until now, I simply added columns to the original table, and each function took the whole table as an argument. As I'm moving to pure functions, I'll have to make all arguments immutable. So, the initial table becomes immutable. Any additional columns will be created as standalone columns and passed only to those functions that need them. A typical function would take the initial table, and a few columns that are already created, and return a new column.
The problem I run into is how to implement the standalone column (Column)?
I could make each of them a dictionary, but it seems very expensive. Indeed, if I ever need to perform an operation on, say, 10 fields in each logical row, I'll need to do 10 dictionary lookups. And on top of that, each column will contain both the key and the value, doubling its size.
I could make Column a simple list, and store in it a reference to the mapping from row_id to the array index. The benefit is that this mapping could be shared between all columns that correspond to the same initial table, and also once looked up once, it works for all columns. But does this create any other problems?
If I do this, can I go further, and actually store the mapping inside the initial table itself? And can I place references from the Column objects back to the initial table from which they were created? It seems very different from how I imagined a functional approach to work, but I cannot see what problems it would cause, since everything is immutable.
In general does functional approach frown on keeping a reference in the return value to one of the arguments? It doesn't seem like it would break anything (like optimization or lazy evaluation), since the argument was already known anyway. But maybe I'm missing something.
Here is how I would do it:
Derive your table class from a frozenset.
Each row should be a sublcass of tuple.
Now you can't modify the table -> immutability, great! The next step
could be to consider each function a mutation which you apply to the
table to produce a new one:
f T -> T'
That should be read as apply the function f on the table T to produce
a new table T'. You may also try to objectify the actual processing of
the table data and see it as an Action which you apply or add to the
table.
add(T, A) -> T'
The great thing here is that add could be subtract instead giving you
an easy way to model undo. When you get into this mindset, your code
becomes very easy to reason about because you have no state that can
screw things up.
Below is an example of how one could implement and process a table
structure in a purely functional way in Python. Imho, Python is not
the best language to learn about FP in because it makes it to easy to
program imperatively. Haskell, F# or Erlang are better choices I think.
class Table(frozenset):
def __new__(cls, names, rows):
return frozenset.__new__(cls, rows)
def __init__(self, names, rows):
frozenset.__init__(self, rows)
self.names = names
def add_column(rows, func):
return [row + (func(row, idx),) for (idx, row) in enumerate(rows)]
def table_process(t, (name, func)):
return Table(
t.names + (name,),
add_column(t, lambda row, idx: func(row))
)
def table_filter(t, (name, func)):
names = t.names
idx = names.index(name)
return Table(
names,
[row for row in t if func(row[idx])]
)
def table_rank(t, name):
names = t.names
idx = names.index(name)
rows = sorted(t, key = lambda row: row[idx])
return Table(
names + ('rank',),
add_column(rows, lambda row, idx: idx)
)
def table_print(t):
format_row = lambda r: ' '.join('%15s' % c for c in r)
print format_row(t.names)
print '\n'.join(format_row(row) for row in t)
if __name__ == '__main__':
from random import randint
cols = ('c1', 'c2', 'c3')
T = Table(
cols,
[tuple(randint(0, 9) for x in cols) for x in range(10)]
)
table_print(T)
# Columns to add to the table, this is a perfect fit for a
# reduce. I'd honestly use a boring for loop instead, but reduce
# is a perfect example for how in FP data and code "becomes one."
# In fact, this whole program could have been written as just one
# big reduce.
actions = [
('max', max),
('min', min),
('sum', sum),
('avg', lambda r: sum(r) / float(len(r)))
]
T = reduce(table_process, actions, T)
table_print(T)
# Ranking is different because it requires an ordering, which a
# table does not have.
T2 = table_rank(T, 'sum')
table_print(T2)
# Simple where filter: select * from T2 where c2 < 5.
T3 = table_filter(T2, ('c2', lambda c: c < 5))
table_print(T3)
I have the two following lists:
# List of tuples representing the index of resources and their unique properties
# Format of (ID,Name,Prefix)
resource_types=[('0','Group','0'),('1','User','1'),('2','Filter','2'),('3','Agent','3'),('4','Asset','4'),('5','Rule','5'),('6','KBase','6'),('7','Case','7'),('8','Note','8'),('9','Report','9'),('10','ArchivedReport',':'),('11','Scheduled Task',';'),('12','Profile','<'),('13','User Shared Accessible Group','='),('14','User Accessible Group','>'),('15','Database Table Schema','?'),('16','Unassigned Resources Group','#'),('17','File','A'),('18','Snapshot','B'),('19','Data Monitor','C'),('20','Viewer Configuration','D'),('21','Instrument','E'),('22','Dashboard','F'),('23','Destination','G'),('24','Active List','H'),('25','Virtual Root','I'),('26','Vulnerability','J'),('27','Search Group','K'),('28','Pattern','L'),('29','Zone','M'),('30','Asset Range','N'),('31','Asset Category','O'),('32','Partition','P'),('33','Active Channel','Q'),('34','Stage','R'),('35','Customer','S'),('36','Field','T'),('37','Field Set','U'),('38','Scanned Report','V'),('39','Location','W'),('40','Network','X'),('41','Focused Report','Y'),('42','Escalation Level','Z'),('43','Query','['),('44','Report Template ','\\'),('45','Session List',']'),('46','Trend','^'),('47','Package','_'),('48','RESERVED','`'),('49','PROJECT_TEMPLATE','a'),('50','Attachments','b'),('51','Query Viewer','c'),('52','Use Case','d'),('53','Integration Configuration','e'),('54','Integration Command f'),('55','Integration Target','g'),('56','Actor','h'),('57','Category Model','i'),('58','Permission','j')]
# This is a list of resource ID's that we do not want to reference directly, ever.
unwanted_resource_types=[0,1,3,10,11,12,13,14,15,16,18,20,21,23,25,27,28,32,35,38,41,47,48,49,50,57,58]
I'm attempting to compare the two in order to build a third list containing the 'Name' of each unique resource type that currently exists in unwanted_resource_types. e.g. The final result list should be:
result = ['Group','User','Agent','ArchivedReport','ScheduledTask','...','...']
I've tried the following that (I thought) should work:
result = []
for res in resource_types:
if res[0] in unwanted_resource_types:
result.append(res[1])
and when that failed to populate result I also tried:
result = []
for res in resource_types:
for type in unwanted_resource_types:
if res[0] == type:
result.append(res[1])
also to no avail. Is there something i'm missing? I believe this would be the right place to perform list comprehension, but that's still in my grey basket of understanding fully (The Python docs are a bit too succinct for me in this case).
I'm also open to completely rethinking this problem, but I do need to retain the list of tuples as it's used elsewhere in the script. Thank you for any assistance you may provide.
Your resource types are using strings, and your unwanted resources are using ints, so you'll need to do some conversion to make it work.
Try this:
result = []
for res in resource_types:
if int(res[0]) in unwanted_resource_types:
result.append(res[1])
or using a list comprehension:
result = [item[1] for item in resource_types if int(item[0]) in unwanted_resource_types]
The numbers in resource_types are numbers contained within strings, whereas the numbers in unwanted_resource_types are plain numbers, so your comparison is failing. This should work:
result = []
for res in resource_types:
if int( res[0] ) in unwanted_resource_types:
result.append(res[1])
The problem is that your triples contain strings and your unwanted resources contain numbers, change the data to
resource_types=[(0,'Group','0'), ...
or use int() to convert the strings to ints before comparison, and it should work. Your result can be computed with a list comprehension as in
result=[rt[1] for rt in resource_types if int(rt[0]) in unwanted_resource_types]
If you change ('0', ...) into (0, ... you can leave out the int() call.
Additionally, you may change the unwanted_resource_types variable into a set, like
unwanted_resource_types=set([0,1,3, ... ])
to improve speed (if speed is an issue, else it's unimportant).
The one-liner:
result = map(lambda x: dict(map(lambda a: (int(a[0]), a[1]), resource_types))[x], unwanted_resource_types)
without any explicit loop does the job.
Ok - you don't want to use this in production code - but it's fun. ;-)
Comment:
The inner dict(map(lambda a: (int(a[0]), a[1]), resource_types)) creates a dictionary from the input data:
{0: 'Group', 1: 'User', 2: 'Filter', 3: 'Agent', ...
The outer map chooses the names from the dictionary.
I want to compare a large set of data in the form of 2 dictionaries of varying lengths.
(edit)
post = {0: [0.96180319786071777, 0.37529754638671875],
10: [0.20612385869026184, 0.17849941551685333],
20: [0.20612400770187378, 0.17510984838008881],...}
pre = {0: [0.96180319786071777, 0.37529754638671875],
1: [0.20612385869026184, 0.17849941551685333],
2: [0.20612400770187378, 0.17510984838008881],
5065: [0.80861318111419678, 0.76381617784500122],...}
The answer we need to get is 5065: [0.80861318111419678, 0.76381617784500122]. This is based on the fact that we are only comparing the values and not the indices at all.
I am using this key value pair only to remember the sequence of data. The data type can be replaced with a list/set if need be. I need to find out the key:value (index and value) pairs of the elements that are not in common to the dictionaries.
The code that I am using is very simple..
new = {}
found = []
for i in range(0, len(post)):
found= []
for j in range(0, len(pre)):
if post[i] not in pre.values():
if post[i] not in new:
new[i] = post[i]
found.append(j)
break
if found:
for f in found: pre.pop(f)
new{} contains the elements I need.
The problem I am facing is that this process is too slow. It takes sometimes over an hour to process. The data can be much larger at times. I need it to be faster.
Is there an efficient way of doing what I am trying to achieve ? I would like it if we dont depend on external packages apart from those bundled with python 2.5 (64 bit) unless absolutely necessary.
Thank you all.
This is basically what sets are designed for (computing differences in sets of items). The only gotcha is that the things you put into a set need to be hashable, and lists aren't. However, tuples are, so if you convert to that, you can put those into a set:
post_set = set(tuple(x) for x in post.itervalues())
pre_set = set(tuple(x) for x in pre.itervalues())
items_in_only_one_set = post_set ^ pre_set
For more about sets: http://docs.python.org/library/stdtypes.html#set
To get the original indices after you've computed the differences, what you'd probably want is to generate reverse lookup tables:
post_indices = dict((tuple(v),k) for k,v in post.iteritems())
pre_indices = dict((tuple(v),k) for k,v in pre.iteritems())
Then you can just take a given tuple and look up its index via the dictionaries:
index = post_indices.get(a_tuple, pre_indices.get(a_tuple))
Your problem is likely the nested for loops combined with use of range(), which creates a new list each time which can be slow. You will probably get some automatic speedups by iterating pre and post directly, and avoid doing so in a nested fashion.
post = {0: [0.96180319786071777, 0.37529754638671875],
10: [0.20612385869026184, 0.17849941551685333],
20: [0.20612400770187378, 0.17510984838008881]}
pre = {0: [0.96180319786071777, 0.37529754638671875],
1: [0.20612385869026184, 0.17849941551685333],
2: [0.20612400770187378, 0.17510984838008881],
5065: [0.80861318111419678, 0.76381617784500122]}
'''Create sets of values, independent of dict key for O(1) lookup'''
post_set=set(map(tuple, post.values()))
pre_set=set(map(tuple, pre.values()))
'''Iterate through each structure only once, filtering items that are found in
the sets we created earlier, updating new_diff'''
from itertools import ifilterfalse
new_diff=dict(ifilterfalse(lambda x: tuple(x[1]) in pre_set, post.items()))
new_diff.update(ifilterfalse(lambda x: tuple(x[1]) in post_set, pre.items()))
new_diff is now a dict such that each value is not found in both post and pre, with the original index preserved.
>>> print new_diff
{5065: [0.80861318111419678, 0.76381617784500122]}