I am looking to do the subtraction of a list from another list but by respecting repetitions:
>>> a = ['a', 'b', 'c','c', 'c', 'c', 'd', 'e', 'e']
>>> b = ['a', 'c', 'e', 'f','c']
>>> a - b
['b', 'c','c', 'd', 'e']
Order of elements does not matter.
There is a question with answers here but it ignores the repetitions. Solutions there would give:
>>> a - b
['b', 'd']
One solution considers duplicates but it alters one of the original list:
[i for i in a if not i in b or b.remove(i)]
I wrote this solution:
a_sub_b = list(a)
b_sub_a = list(b)
for e in a:
if e in b_sub_a:
a_sub_b.remove(e)
b_sub_a.remove(e)
print a_sub_b # a - b
print b_sub_a # b - a
That works for me , but is there a better solution , simpler or more efficient ?
If order doesn't matter, use collections.Counter:
c = list((Counter(a) - Counter(b)).elements())
Counter(a) - Counter(b) builds a Counter with the count of an element x equal to the number of times x appears in a minus the number of times x appears in b. elements() creates an iterator that yields each element a number of times equal to its count, and list turns that into a list. The whole thing takes O(len(a)+len(b)) time.
Note that depending on what you're doing, it might be best to not work in terms of lists and just keep a, b, and c represented as Counters.
This is going to search every element of b for each element of a. It's also going to do a linear remove on each list for each element that matches. So, your algorithm takes quadratic time—O(max(N, M)^2) where N is the length of a and M is the length of b.
If you just copy b into a set instead of a list, that solves the problem. Now you're just doing a constant-time set lookup for each element in a, and a constant-time set remove instead of a list remove. But you've still got the problem with the linear-time and incorrect removing from the a copy. And you can't just copy a into a set, because that loses duplicates.
On top of that, a_sub_b.remove(e) removes an element matching e. That isn't necessarily the same element as the element you just looked up. It's going to be an equal element, and if identity doesn't matter at all, that's fine… but if it does, then remove may do the wrong thing.
At any rate, performance is already a good enough reason not to use remove. Once you've solved the problems above, this is the only thing making your algorithm quadratic instead of linear.
The easiest way to solve this problem is to build up a new list, rather than copying the list and removing from it.
Solving both problems, you have O(2N+M) time, which is linear.
So, putting the two together:
b_set = set(b)
new_a = []
for element in a:
if a in b_set:
b_set.remove(element)
else:
new_a.append(element)
However, this still may have a problem. You haven't stated things very clearly, so it's hard to be sure, but can b contain duplicates, and, if so, does that mean the duplicated elements should be removed from a multiple times? If so, you need a multi-set, not a set. The easiest way to do that in Python is with a Counter:
from collections import Counter
b_counts = Counter(b)
new_a = []
for element in a:
if b_counts[element]:
b_counts[element] -= 1
else:
new_a.append(element)
On the other hand, if the order of neither a nor b matters, this just reduces to multiset difference, which makes it even easier:
new_a = list((Counter(a) - Counter(b)).elements())
But really, if the order of both is meaningless, you probably should have been using a Counter or other multiset representation in the first place, not a list…
The following uses standard library only:
a = ['a', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e']
b = ['a', 'c', 'e', 'f','c']
a_set = set(a)
b_set = set(b)
only_in_a = list(a_set - b_set)
diff_list = list()
for _o in only_in_a:
tmp = a.count(_o) * _o
diff_list.extend(tmp)
for _b in b_set:
tmp = (a.count(_b) - b.count(_b)) * _b
diff_list.extend(tmp)
print diff_list
And gives:
['b', 'b', 'd', 'd', 'd', 'c', 'c', 'e']
as expected.
Related
my_list = ['a', 'b', 'a', 'd', 'e', 'f']
my_list_2 = ['a', 'b', 'c']
The common items are:
c = ['a', 'b', 'a']
The code:
for e in my_list:
if e in my_list_2:
c.append(e)
...
If the my_list is long, this would be very inefficient. If I convert both lists into two sets, then use set's intersection() function to get the common items, I will lose the duplicates in my_list.
How to deal with this efficiently?
dict is already a hashmap, so lookups are practically as efficient as a set, so you may not need to do any extra work collecting the values - if it wasn't, you could pack the values into a set to check before checking the dict
However, a large improvement may be to make a generator for the values, rather than creating a new intermediate list, to iterate over where you actually want the values
def foo(src_dict, check_list):
for value in check_list:
if value in my_dict:
yield value
With the edit, you may find you're better off packing all the inputs into a set
def foo(src_list, check_list):
hashmap = set(src_list)
for value in check_list:
if value in hashmap:
yield value
If you know a lot about the inputs, you can do better, but that's an unusual case (for example if the lists are ordered you could bisect, or if you have a huge verifying list or very very few values to check against it you may find some efficiency in the ordering and if you make a set)
I am not sure about time efficiency, but, personally speaking, list comprehension would always be more of interest to me:
[x for x in my_list if x in my_list_2]
Output
['a', 'b', 'a']
First, utilize the set.intersection() method to get the intersecting values in the list. Then, use a nested list comprehension to include the duplicates based on the original list's count on each value:
my_list = ['a', 'b', 'a', 'd', 'e', 'f']
my_list_2 = ['a', 'b', 'c']
c = [x for x in set(my_list).intersection(set(my_list_2)) for _ in range(my_list.count(x))]
print(c)
The above may be slower than just
my_list = ['a', 'b', 'a', 'd', 'e', 'f']
my_list_2 = ['a', 'b', 'c']
c = []
for e in my_list:
if e in my_list_2:
c.append(e)
print(c)
But when the lists are significantly larger, the code block utilizing the set.intersection() method will be significantly more efficient (faster).
sorry for not reading the post carefully and now it is not possible to delete.. however, it is an attempt for solution.
c = lambda my_list, my_list_2: (my_list, my_list_2, list(set(my_list).intersection(set(my_list_2))))
print("(list_1,list_2,duplicate_items) -", c(my_list, my_list_2))
Output:
(list_1,list_2,duplicate_items) -> (['a', 'b', 'a', 'd', 'e', 'f'], ['a', 'b', 'c'], ['b', 'a'])
or can be
[i for i in my_list if i in my_list_2]
output:
['a', 'b', 'a']
Assume the following set of symbols: [A, B, C]
Assume the following result set size of 4.
I would like to produce a list of all permutations, with repeated symbols, but without "functionally identical" items.
Ie: [A, A, A, A] is "functionally identical" to [B, B, B, B] and thus [B, B, B, B] should not be in the final result, etc.
I tried generating the full 3^4 possibilities, then "rotating the symbols", checking one by one if there is a dupe and removing them but I realized this doesn't catch cases where "two symbols swap", and of course, when increasing the number of symbols and set size, there are plenty of other "symbol swap" cases I'm not accounting for. Plus, it seems like "generate the worst case and then prune" is a terrible algorithm, I'm sure there is a much better way.
Here is a manually generated result of the expected output:
['A', 'A', 'A', 'A']
['A', 'A', 'A', 'B']
['A', 'A', 'B', 'A']
['A', 'A', 'B', 'B']
['A', 'A', 'B', 'C']
['A', 'B', 'A', 'A']
['A', 'B', 'A', 'B']
['A', 'B', 'A', 'C']
['A', 'B', 'B', 'A']
['A', 'B', 'B', 'B']
['A', 'B', 'B', 'C']
['A', 'B', 'C', 'A']
['A', 'B', 'C', 'B']
['A', 'B', 'C', 'C']
(And of course, at some point I want to extend the symbol set size and the result set size to see what results I get.)
(Preferred language is python but I'm not picky, I'm just trying to understand the algorithm)
Edit: Let me clarify my definition of "functionally identical". Essentially all that matters is the "topology" of the result. For example, let's say that random colors are assigned to the symbols once we have the sets generated.
[A A A A] simply means "All the items are the same color", thus, [B B B B] would be functionally identical. There is no difference between the two because we don't know what random color is going to be assigned to A or B, all we know is that they are all the same color.
Another example:
[A A A B] is functionally identical to [B B B C], because again, we don't know what colors will be assigned to what symbols, all we know is "The last color is different from the first three."
The order matters though!
[A A A B] != [B A A A]. In the first example, all items are the same color except the LAST item. In the second example, all items are the same color except the FIRST item.
This is absolutely a mathematical construct, more advanced than simple permutation, I just don't know the name for it.
Here's a recursive algorithm that does it. The key idea is that to break symmetry between the different letters, we are only allowed to add a letter that has already been used, or the first unused letter.
Given a partial solution t:
If t has the required length, yield it.
Otherwise:
For each distinct letter already in t, recurse by extending t with that latter.
If t doesn't use each possible letter, recurse by extending t with the first unused letter.
Here's a Python implementation, as a recursive generator function:
def gen_seqs(letters, n):
def helper(used, t):
if len(t) == n:
yield t
else:
for i in range(used):
yield from helper(used, t + letters[i])
if used < len(letters):
yield from helper(used + 1, t + letters[used])
return helper(0, '')
Example:
>>> for t in gen_seqs('ABC', 4):
... print(t)
...
AAAA
AAAB
AABA
AABB
AABC
ABAA
ABAB
ABAC
ABBA
ABBB
ABBC
ABCA
ABCB
ABCC
I'm trying to create a list of lists from a single list. I'm able to do this if the new list of lists have the same number of elements, however this will not always be the case
As said earlier, the function below works when the list of lists have the same number of elements.
I've tried using regular expressions to determine if an element matches a pattern using
pattern2=re.compile(r'\d\d\d\d\d\d') because the first value on my new list of lists will always be 6 digits and it will be the only one that follows that format. However, i'm not sure of the syntax of getting it to stop at the next match and create another list
def chunks(l,n):
for i in range(0,len(l),n):
yield l[i:i+n]
The code above works if the list of lists will contain the same number of elements
Below is what I expect.
OldList=[111111,a,b,c,d,222222,a,b,c,333333,a,d,e,f]
DesiredList=[[111111,a,b,c,d],[222222,a,b,c],[333333,a,d,e,f]]
Many thanks indeed.
Cheers
Likely a much more efficient way to do this (with fewer loops), but here is one approach that finds the indexes of the breakpoints and then slices the list from index to index appending None to the end of the indexes list to capture the remaining items. If your 6 digit numbers are really strings, then you could eliminate the str() inside re.match().
import re
d = [111111,'a','b','c','d',222222,'a','b','c',333333,'a','d','e','f']
indexes = [i for i, x in enumerate(d) if re.match(r'\d{6}', str(x))]
groups = [d[s:e] for s, e in zip(indexes, indexes[1:] + [None])]
print(groups)
# [[111111, 'a', 'b', 'c', 'd'], [222222, 'a', 'b', 'c'], [333333, 'a', 'd', 'e', 'f']]
You can use a fold.
First, define a function to locate the start flag:
>>> def is_start_flag(v):
... return len(v) == 6 and v.isdigit()
That will be useful if the flags are not exactly what you expected them to be, or to exclude some false positives, or even if you need a regex.
Then use functools.reduce:
>>> L = d = ['111111', 'a', 'b', 'c', 'd', '222222', 'a', 'b', 'c', '333333', 'a', 'd', 'e', 'f']
>>> import functools
>>> functools.reduce(lambda acc, x: acc+[[x]] if is_start_flag(x) else acc[:-1]+[acc[-1]+[x]], L, [])
[['111111', 'a', 'b', 'c', 'd'], ['222222', 'a', 'b', 'c'], ['333333', 'a', 'd', 'e', 'f']]
If the next element x is the start flag, then append a new list [x] to the accumulator. Else, add the element to the current list, ie the last list of the accumulator.
When looping through a list, you can work with the current item of the list. For example, if you want to replace certain items with others, you can use:
a=['a','b','c','d','e']
b=[]
for i in a:
if i=='b':
b.append('replacement')
else:
b.append(i)
print b
['a', 'replacement', 'c', 'd', 'e']
However, I wish the replace certain values not based on index i, but based on index i+1. I've been trying for ages and I can't seem to make it work. I would like something like this:
c=['a','b','c','d','e']
d=[]
for i in c:
if i+1=='b':
d.append('replacement')
else:
d.append(i)
print d
d=['replacement','b','c','d','e']
Is there any way to achieve this?
Use a list comprehension along with enumerate
>>> ['replacement' if a[i+1]=='b' else v for i,v in enumerate(a[:-1])]+[a[-1]]
['replacement', 'b', 'c', 'd', 'e']
The code replaces all those elements where the next element is b. However to take care of the last index and prevent IndexError, we just append the last element and loop till the penultimate element.
Without a list comprehension
a=['a','b','c','d','e']
d=[]
for i,v in enumerate(a[:-1]):
if a[i+1]=='b':
d.append('replacement')
else:
d.append(v)
d.append(a[-1])
print d
It's generally better style to not iterate over indices in Python. A common way to approach a problem like this is to use zip (or the similar izip_longest in itertools) to see multiple values at once:
In [32]: from itertools import izip_longest
In [33]: a=['a','b','c','d','e']
In [34]: b = []
In [35]: for c, next in izip_longest(a, a[1:]):
....: if next == 'd':
....: b.append("replacement")
....: else:
....: b.append(c)
....:
In [36]: b
Out[36]: ['a', 'b', 'replacement', 'd', 'e']
I think there's a confusion in your post between the list indices and list elements. In the loop as you have written it i will be the actual element (e.g. 'b') and not the index, thus i+1 is meaningless and will throw a TypeError exception.
I think one of the smallest set of changes you can do to your example to make it work is:
c = ['a', 'b', 'c', 'd', 'e']
d = []
for i, el in enumerate(c[:-1]):
if c[i + 1] == 'b':
d.append('replacement')
else:
d.append(el)
print d
# Output...
# ['replacement', 'b', 'c', 'd']
Additionally it's undefined how you should deal with the boundaries. Particularly when i points to the last element 'e', what should i+1 point to? There are many possible answers here. In the example above I've chosen one option, which is to end the iteration one element early (so we never point to the last element e).
If I was doing this I would do something similar to a combination of the other answers:
c = ['a', 'b', 'c', 'd', 'e']
d = ['replacement' if next == 'b' else current
for current, next in zip(c[:-1], c[1:]) ]
print d
# Output...
# ['replacement', 'b', 'c', 'd']
where I have used a list comprehension to avoid the loop, and zip on the list and a shifted list to avoid the explicit indices.
Try using index of current element to check for the next element in the list .
Replace
if i+1=='b':
with
if c[c.index(i)+1]=='b':
Let's say I have a Python list that looks like this:
list = [ a, b, c, d]
I am looking for the most efficient way performanse wise to get this:
list = [ a, a, a, a, b, b, b, c, c, d ]
So if the list is N elements long then the first element is cloned N-1 times, the second element N-2 times, and so forth...the last element is cloned N-N times or 0 times. Any suggestions on how to do this efficiently on large lists.
Note that I am testing speed, not correctness. If someone wants to edit in a unit test, I'll get around to it.
pyfunc_fastest: 152.58769989 usecs
pyfunc_local_extend: 154.679298401 usecs
pyfunc_iadd: 158.183312416 usecs
pyfunc_xrange: 162.234091759 usecs
pyfunc: 166.495800018 usecs
Ignacio: 238.87629509 usecs
Ishpeck: 311.713695526 usecs
FabrizioM: 456.708812714 usecs
JohnKugleman: 519.239497185 usecs
Bwmat: 1309.29429531 usecs
Test code here. The second revision is trash because I was rushing to get everybody tested that posted after my first batch of tests. These timings are for the fifth revision of the code.
Here's the fastest version that I was able to get.
def pyfunc_fastest(x):
t = []
lenList = len(x)
extend = t.extend
for l in xrange(0, lenList):
extend([x[l]] * (lenList - l))
Oddly, a version that I modified to avoid indexing into the list by using enumerate ran slower than the original.
>>> items = ['a', 'b', 'c', 'd']
>>> [item for i, item in enumerate(items) for j in xrange(len(items) - i)]
['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd']
First we use enumerate to pull out both indexes and values at the same time. Then we use a nested for loop to iterate over each item a decreasing number of times. (Notice that the variable j is never used. It is junk.)
This should be near optimal, with minimal memory usage thanks to the use of the enumerate and xrange generators.
How about this - A simple one
>>> x = ['a', 'b', 'c', 'd']
>>> t = []
>>> lenList = len(x)
>>> for l in range(0, lenList):
... t.extend([x[l]] * (lenList - l))
...
>>> t
['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd']
>>>
Lazy mode:
import itertools
l = ['foo', 'bar', 'baz', 'quux']
for i in itertools.chain.from_iterable(itertools.repeat(e, len(l) - i)
for i, e in enumerate(l)):
print i
Just shove it through list() if you really do need a list instead.
list(itertools.chain.from_iterable(itertools.repeat(e, len(l) - i)
for i, e in enumerate(l)))
My first instinct..
l = ['a', 'b', 'c', 'd']
nl = []
i = 0
while len(l[i:])>0:
nl.extend( [l[i]]*len(l[i:]) )
i+=1
print nl
The trick is in using repeat from itertools
from itertools import repeat
alist = "a b c d".split()
print [ x for idx, value in enumerate(alist) for x in repeat(value, len(alist) - idx) ]
>>>['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd']
Use a generator: it's O(1) memory and O(N^2) cpu, unlike any solution that produces the final list which uses O(N^2) memory and cpu. This means it'll be massively faster as soon as the input list is large enough that the constructed list fills memory and swapping starts. It's unlikely you need to have the final list in memory unless this is homework.
def triangle(seq):
for i, x in enumerate(seq):
for _ in xrange(len(seq) - i - 1):
yield x
To create that new list, list = [ a, a, a, a, b, b, b, c, c, d ] would require O(4n) = O(n) time since for every n elements, you are creating 4n elements in the second array. aaronasterling gives that linear solution.
You could cheat and just not create the new list. Simply, get the index value as input. Divide the index value by 4. Use the result as the index value of the original list.
In pseudocode:
function getElement(int i)
{
int trueIndex = i / 4;
return list[trueIndex]; // Note: that integer division will lead us to the correct index in the original array.
}
fwiw:
>>> lst = list('abcd')
>>> [i for i, j in zip(lst, range(len(lst), 0, -1)) for _ in range(j)]
['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd']
def gen_indices(list_length):
for index in range(list_length):
for _ in range(list_length - index):
yield index
new_list = [list[i] for i in gen_indices(len(list))]
untested but I think it'll work