Python:reduce list but keep details - python

say i have a list of items which some of them are similiar up to a point
but then differ by a number after a dot
['abc.1',
'abc.2',
'abc.3',
'abc.7',
'xyz.1',
'xyz.3',
'xyz.11',
'ghj.1',
'thj.1']
i want to to produce from this list a new list which collapses multiples but preserves some of their data, namely the numbers suffixes
so the above list should produce a new list
[('abc',('1','2','3','7'))
('xyz',('1','3','11'))
('ghj',('1'))
('thj',('1'))]
what I have thought, is the first list can be split by the dot into pairs
but then how i group the pairs by the first part without losing the second
I'm sorry if this question is noobish, and thanks in advance
...
wow, I didnt expect so many great answers so fast, thanks

from collections import defaultdict
d = defaultdict(list)
for el in elements:
key, nr = el.split(".")
d[key].append(nr)
#revert dict to list
newlist = d.items()

Map the list with a separator function, use itertools.groupby with a key that takes the first element, and collect the second element into the result.
from itertools import groupby, imap
list1 = ["abc.1", "abc.2", "abc.3", "abc.7", "xyz.1", "xyz.3", "xyz.11", "ghj.1", "thj.1"]
def break_up(s):
a, b = s.split(".")
return a, int(b)
def prefix(broken_up): return broken_up[0]
def suffix(broken_up): return broken_up[1]
result = []
for key, sub in groupby(imap(break_up, list1), prefix):
result.append((key, tuple(imap(suffix, sub))))
print result
Output:
[('abc', (1, 2, 3, 7)), ('xyz', (1, 3, 11)), ('ghj', (1,)), ('thj', (1,))]

Related

Assistance with Python 'sort(key=None)'

I'm having a hard time understanding why my function is not returning the reversed version of my list. I've spent a long time trying to understand why and i hit a wall: ---it only returns my list in ascending order.
letters = 'abcdefghijk'
numbers = '123456'
dict1 = {}
def reverseOrder(listing):
lst2 = []
lst2.append(listing)
lst2.sort(reverse=True)
return lst2
for l, n in zip(letters, numbers):
dict1.update({l:n})
lst1 = list(dict1) + list(dict1.values())
lst1.sort(key=reverseOrder)
print(lst1)
The key function passed to list.sort has a very specific purpose:
key specifies a function of one argument that is used to extract a comparison key from each list element (for example, key=str.lower). The key corresponding to each item in the list is calculated once and then used for the entire sorting process. The default value of None means that list items are sorted directly without calculating a separate key value.
So the function is supposed to take in a single list element, and then return a key that determines its sorting compared to the other elements.
For example, if you wanted to sort a list by the length of their contents, you could do it like this:
def lengthOfItem (item):
return len(item)
lst.sort(key=lengthOfItem)
Since the function only takes a single item, it makes it unsuitable for sorting behaviors where you actually need to compare two elements in order to make a relation. But those sortings are very inefficient, so you should avoid them.
In your case, it seems like you want to reverse your list. In that case you can just use list.reverse().
You are using sort function in an invalid way.
Here is the definition of sort function (from builtins.py):
def sort(self, key=None, reverse=False): # real signature unknown; restored from __doc__
""" L.sort(key=None, reverse=False) -> None -- stable sort *IN PLACE* """
pass
key argument has to be used if there is 'ambiguity' on how items have to be sorted e.g. items are tuples, dictionaries, etc.
Example:
lst = [(1, 2), (2, 1)]
lst.sort(key=lambda x: x[0]) # lst = [(1, 2), (2, 1)]
lst.sort(key=lambda x: x[1]) # lst = [(2, 1), (1, 2)]
Not quite sure what you want with this part though:
for l, n in zip(letters, numbers):
dict1.update({l:n})
lst1 = list(dict1) + list(dict1.values())
Seems like you want a list of all numbers and letters but you are doing it an odd way.
Edit: I have updated answer.

Comparing a 3-tuple to a list of 3-tuples using only the first two parts of the tuple

I have a list of 3-tuples in a Python program that I'm building while looking through a file (so one at a time), with the following setup:
(feature,combination,durationOfTheCombination),
such that if a unique combination of feature and combination is found, it will be added to the list. The list itself holds a similar setup, but the durationOfTheCombination is the sum of all duration that share the unique combination of (feature,combination). Therefore, when deciding if it should be added to the list, I need to only compare the first two parts of the tuple, and if a match is found, the duration is added to the corresponding list item.
Here's an example for clarity. If the input is
(ABC,123,10);(ABC,123,10);(DEF,123,5);(ABC,123,30);(EFG,456,30)
The output will be (ABC,123,50);(DEF,123,5);(EFG,456,30).
Is there any way to do this comparison?
You can do this with Counter,
In [42]: from collections import Counter
In [43]: lst = [('ABC',123,10),('ABC',123,10),('DEF',123,5)]
In [44]: [(i[0],i[1],i[2]*j) for i,j in Counter(lst).items()]
Out[44]: [('DEF', 123, 5), ('ABC', 123, 20)]
As per the OP suggestion if it's have different values, use groupby
In [26]: lst = [('ABC',123,10),('ABC',123,10),('ABC',123,25),('DEF',123,5)]
In [27]: [tuple(list(n)+[sum([i[2] for i in g])]) for n,g in groupby(sorted(lst,key = lambda x:x[:2]), key = lambda x:x[:2])]
Out[27]: [('ABC', 123, 45), ('DEF', 123, 5)]
If you don't want to use Counter, you can use a dict instead.
setOf3Tuples = dict()
def add3TupleToSet(a):
key = a[0:2]
if key in setOf3Tuples:
setOf3Tuples[a[0:2]] += a[2]
else:
setOf3Tuples[a[0:2]] = a[2]
def getRaw3Tuple():
for k in setOf3Tuples:
yield k + (setOf3Tuples[k],)
if __name__ == "__main__":
add3TupleToSet(("ABC",123,10))
add3TupleToSet(("ABC",123,10))
add3TupleToSet(("DEF",123,5))
print([i for i in getRaw3Tuple()])
It seems a dict is more suited than a list here, with the first 2 fields as key. And to avoid checking each time if the key is already here you can use a defaultdict.
from collections import defaultdict
d = defaultdict(int)
for t in your_list:
d[t[:2]] += t[-1]
Assuming your input is collected in a list as below, you can use pandas groupby to accomplish this quickly:
import pandas as pd
input = [('ABC',123,10),('ABC',123,10),('DEF',123,5),('ABC',123,30),('EFG',456,30)]
output = [tuple(x) for x in pd.DataFrame(input).groupby([0,1])[2].sum().reset_index().values]

Creating a list by iterating over a dictionary

I defined a dictionary like this (list is a list of integers):
my_dictionary = {'list_name' : list, 'another_list_name': another_list}
Now, I want to create a new list by iterating over this dictionary. In the end, I want it to look like this:
my_list = [list_name_list_item1, list_name_list_item2,
list_name_list_item3, another_list_name_another_list_item1]
And so on.
So my question is: How can I realize this?
I tried
for key in my_dictionary.keys():
k = my_dictionary[key]
for value in my_dictionary.values():
v = my_dictionary[value]
v = str(v)
my_list.append(k + '_' + v)
But instead of the desired output I receive a Type Error (unhashable type: 'list') in line 4 of this example.
You're trying to get a dictionary item by it's value whereas you already have your value.
Do it in one line using a list comprehension:
my_dictionary = {'list_name' : [1,4,5], 'another_list_name': [6,7,8]}
my_list = [k+"_"+str(v) for k,lv in my_dictionary.items() for v in lv]
print(my_list)
result:
['another_list_name_6', 'another_list_name_7', 'another_list_name_8', 'list_name_1', 'list_name_4', 'list_name_5']
Note that since the order in your dictionary is not guaranteed, the order of the list isn't either. You could fix the order by sorting the items according to keys:
my_list = [k+"_"+str(v) for k,lv in sorted(my_dictionary.items()) for v in lv]
Try this:
my_list = []
for key in my_dictionary:
for item in my_dictionary[key]:
my_list.append(str(key) + '_' + str(item))
Hope this helps.
Your immediate problem is that dict().values() is a generator yielding the values from the dictionary, not the keys, so when you attempt to do a lookup on line 4, it fails (in this case) as the values in the dictionary can't be used as keys. In another case, say {1:2, 3:4}, it would fail with a KeyError, and {1:2, 2:1} would not raise an error, but likely give confusing behaviour.
As for your actual question, lists do not attribute any names to data, like dictionaries do; they simply store the index.
def f()
a = 1
b = 2
c = 3
l = [a, b, c]
return l
Calling f() will return [1, 2, 3], with any concept of a, b, and c being lost entirely.
If you want to simply concatenate the lists in your dictionary, making a copy of the first, then calling .extend() on it will suffice:
my_list = my_dictionary['list_name'][:]
my_list.extend(my_dictionary['another_list_name'])
If you're looking to keep the order of the lists' items, while still referring to them by name, look into the OrderedDict class in collections.
You've written an outer loop over keys, then an inner loop over values, and tried to use each value as a key, which is where the program failed. Simply use the dictionary's items method to iterate over key,value pairs instead:
["{}_{}".format(k,v) for k,v in d.items()]
Oops, failed to parse the format desired; we were to produce each item in the inner list. Not to worry...
d={1:[1,2,3],2:[4,5,6]}
list(itertools.chain(*(
["{}_{}".format(k,i) for i in l]
for (k,l) in d.items() )))
This is a little more complex. We again take key,value pairs from the dictionary, then make an inner loop over the list that was the value and format those into strings. This produces inner sequences, so we flatten it using chain and *, and finally save the result as one list.
Edit: Turns out Python 3.4.3 gets quite confused when doing this nested as generator expressions; I had to turn the inner one into a list, or it would replace some combination of k and l before doing the formatting.
Edit again: As someone posted in a since deleted answer (which confuses me), I'm overcomplicating things. You can do the flattened nesting in a chained comprehension:
["{}_{}".format(k,v) for k,l in d.items() for v in l]
That method was also posted by Jean-François Fabre.
Use list comprehensions like this
d = {"test1":[1,2,3,],"test2":[4,5,6],"test3":[7,8,9]}
new_list = [str(item[0])+'_'+str(v) for item in d.items() for v in item[1]]
Output:
new_list:
['test1_1',
'test1_2',
'test1_3',
'test3_7',
'test3_8',
'test3_9',
'test2_4',
'test2_5',
'test2_6']
Let's initialize our data
In [1]: l0 = [1, 2, 3, 4]
In [2]: l1 = [10, 20, 30, 40]
In [3]: d = {'name0': l0, 'name1': l1}
Note that in my example, different from yours, the lists' content is not strings... aren't lists heterogeneous containers?
That said, you cannot simply join the keys and the list's items, you'd better cast these value to strings using the str(...) builtin.
Now it comes the solution to your problem... I use a list comprehension
with two loops, the outer loop comes first and it is on the items (i.e., key-value couples) in the dictionary, the inner loop comes second and it is on the items in the corresponding list.
In [4]: res = ['_'.join((str(k), str(i))) for k, l in d.items() for i in l]
In [5]: print(res)
['name0_1', 'name0_2', 'name0_3', 'name0_4', 'name1_10', 'name1_20', 'name1_30', 'name1_40']
In [6]:
In your case, using str(k)+'_'+str(i) would be fine as well, but the current idiom for joining strings with a fixed 'text' is the 'text'.join(...) method. Note that .join takes a SINGLE argument, an iterable, and hence in the list comprehension I used join((..., ...))
to collect the joinands in a single argument.

Indexing a list with an unique index

I have a list say l = [10,10,20,15,10,20]. I want to assign each unique value a certain "index" to get [1,1,2,3,1,2].
This is my code:
a = list(set(l))
res = [a.index(x) for x in l]
Which turns out to be very slow.
l has 1M elements, and 100K unique elements. I have also tried map with lambda and sorting, which did not help. What is the ideal way to do this?
You can do this in O(N) time using a defaultdict and a list comprehension:
>>> from itertools import count
>>> from collections import defaultdict
>>> lst = [10, 10, 20, 15, 10, 20]
>>> d = defaultdict(count(1).next)
>>> [d[k] for k in lst]
[1, 1, 2, 3, 1, 2]
In Python 3 use __next__ instead of next.
If you're wondering how it works?
The default_factory(i.e count(1).next in this case) passed to defaultdict is called only when Python encounters a missing key, so for 10 the value is going to be 1, then for the next ten it is not a missing key anymore hence the previously calculated 1 is used, now 20 is again a missing key and Python will call the default_factory again to get its value and so on.
d at the end will look like this:
>>> d
defaultdict(<method-wrapper 'next' of itertools.count object at 0x1057c83b0>,
{10: 1, 20: 2, 15: 3})
The slowness of your code arises because a.index(x) performs a linear search and you perform that linear search for each of the elements in l. So for each of the 1M items you perform (up to) 100K comparisons.
The fastest way to transform one value to another is looking it up in a map. You'll need to create the map and fill in the relationship between the original values and the values you want. Then retrieve the value from the map when you encounter another of the same value in your list.
Here is an example that makes a single pass through l. There may be room for further optimization to eliminate the need to repeatedly reallocate res when appending to it.
res = []
conversion = {}
i = 0
for x in l:
if x not in conversion:
value = conversion[x] = i
i += 1
else:
value = conversion[x]
res.append(value)
Well I guess it depends on if you want it to return the indexes in that specific order or not. If you want the example to return:
[1,1,2,3,1,2]
then you can look at the other answers submitted. However if you only care about getting a unique index for each unique number then I have a fast solution for you
import numpy as np
l = [10,10,20,15,10,20]
a = np.array(l)
x,y = np.unique(a,return_inverse = True)
and for this example the output of y is:
y = [0,0,2,1,0,2]
I tested this for 1,000,000 entries and it was done essentially immediately.
Your solution is slow because its complexity is O(nm) with m being the number of unique elements in l: a.index() is O(m) and you call it for every element in l.
To make it O(n), get rid of index() and store indexes in a dictionary:
>>> idx, indexes = 1, {}
>>> for x in l:
... if x not in indexes:
... indexes[x] = idx
... idx += 1
...
>>> [indexes[x] for x in l]
[1, 1, 2, 3, 1, 2]
If l contains only integers in a known range, you could also store indexes in a list instead of a dictionary for faster lookups.
You can use collections.OrderedDict() in order to preserve the unique items in order and, loop over the enumerate of this ordered unique items in order to get a dict of items and those indices (based on their order) then pass this dictionary with the main list to operator.itemgetter() to get the corresponding index for each item:
>>> from collections import OrderedDict
>>> from operator import itemgetter
>>> itemgetter(*lst)({j:i for i,j in enumerate(OrderedDict.fromkeys(lst),1)})
(1, 1, 2, 3, 1, 2)
For completness, you can also do it eagerly:
from itertools import count
wordid = dict(zip(set(list_), count(1)))
This uses a set to obtain the unique words in list_, pairs
each of those unique words with the next value from count() (which
counts upwards), and constructs a dictionary from the results.
Original answer, written by nneonneo.

iterating quickly through list of tuples

I wonder whether there's a quicker and less time consuming way to iterate over a list of tuples, finding the right match. What I do is:
# this is a very long list.
my_list = [ (old1, new1), (old2, new2), (old3, new3), ... (oldN, newN)]
# go through entire list and look for match
for j in my_list:
if j[0] == VALUE:
PAIR_FOUND = True
MATCHING_VALUE = j[1]
break
this code can take quite some time to execute, depending on the number of items in the list. I'm sure there's a better way of doing this.
I think that you can use
for j,k in my_list:
[ ... stuff ... ]
Assuming a bit more memory usage is not a problem and if the first item of your tuple is hashable, you can create a dict out of your list of tuples and then looking up the value is as simple as looking up a key from the dict. Something like:
dct = dict(tuples)
val = dct.get(key) # None if item not found else the corresponding value
EDIT: To create a reverse mapping, use something like:
revDct = dict((val, key) for (key, val) in tuples)
The question is dead but still knowing one more way doesn't hurt:
my_list = [ (old1, new1), (old2, new2), (old3, new3), ... (oldN, newN)]
for first,*args in my_list:
if first == Value:
PAIR_FOUND = True
MATCHING_VALUE = args
break
The code can be cleaned up, but if you are using a list to store your tuples, any such lookup will be O(N).
If lookup speed is important, you should use a dict to store your tuples. The key should be the 0th element of your tuples, since that's what you're searching on. You can easily create a dict from your list:
my_dict = dict(my_list)
Then, (VALUE, my_dict[VALUE]) will give you your matching tuple (assuming VALUE exists).
I wonder whether the below method is what you want.
You can use defaultdict.
>>> from collections import defaultdict
>>> s = [('red',1), ('blue',2), ('red',3), ('blue',4), ('red',1), ('blue',4)]
>>> d = defaultdict(list)
>>> for k, v in s:
d[k].append(v)
>>> sorted(d.items())
[('blue', [2, 4, 4]), ('red', [1, 3, 1])]

Categories