generating list of every combination without duplicates - python

I would like to generate a list of combinations. I will try to simplify my problem to make it understandable.
We have 3 variables :
x : number of letters
k : number of groups
n : number of letters per group
I would like to generate using python a list of every possible combinations, without any duplicate knowing that : i don't care about the order of the groups and the order of the letters within a group.
As an example, with x = 4, k = 2, n = 2 :
# we start with 4 letters, we want to make 2 groups of 2 letters
letters = ['A','B','C','D']
# here would be a code that generate the list
# Here is the result that is very simple, only 3 combinations exist.
combos = [ ['AB', 'CD'], ['AC', 'BD'], ['AD', 'BC'] ]
Since I don't care about the order of or within the groups, and letters within a group, ['AB', 'CD'] and ['DC', 'BA'] is a duplicate.
This is a simplification of my real problem, which has those values : x = 12, k = 4, n = 3. I tried to use some functions from itertools, but with that many letters my computer freezes because it's too many combinations.
Another way of seeing the problem : you have 12 players, you want to make 4 teams of 3 players. What are all the possibilities ?
Could anyone help me to find an optimized solution to generate this list?

There will certainly be more sophisticated/efficient ways of doing this, but here's an approach that works in a reasonable amount of time for your example and should be easy enough to adapt for other cases.
It generates unique teams and unique combinations thereof, as per your specifications.
from itertools import combinations
# this assumes that team_size * team_num == len(players) is a given
team_size = 3
team_num = 4
players = list('ABCDEFGHIJKL')
unique_teams = [set(c) for c in combinations(players, team_size)]
def duplicate_player(combo):
"""Returns True if a player occurs in more than one team"""
return len(set.union(*combo)) < len(players)
result = (combo for combo in combinations(unique_teams, team_num) if not duplicate_player(combo))
result is a generator that can be iterated or turned into a list with list(result). On kaggle.com, it takes a minute or so to generate the whole list of all possible combinations (a total of 15400, in line with the computations by #beaker and #John Coleman in the comments). The teams are tuples of sets that look like this:
[({'A', 'B', 'C'}, {'D', 'E', 'F'}, {'G', 'H', 'I'}, {'J', 'K', 'L'}),
({'A', 'B', 'C'}, {'D', 'E', 'F'}, {'G', 'H', 'J'}, {'I', 'K', 'L'}),
({'A', 'B', 'C'}, {'D', 'E', 'F'}, {'G', 'H', 'K'}, {'I', 'J', 'L'}),
...
]
If you want, you can cast them into strings by calling ''.join() on each of them.

Another solution (players are numbered 0, 1, ...):
import itertools
def equipartitions(base_count: int, group_size: int):
if base_count % group_size != 0:
raise ValueError("group_count must divide base_count")
return set(_equipartitions(frozenset(range(base_count)), group_size))
def _equipartitions(base_set: frozenset, group_size: int):
if not base_set:
yield frozenset()
for combo in itertools.combinations(base_set, group_size):
for rest in _equipartitions(base_set.difference(frozenset(combo)), group_size):
yield frozenset({frozenset(combo), *rest})
all_combinations = [
[tuple(team) for team in combo]
for combo in equipartitions(12, 3)
]
print(all_combinations)
print(len(all_combinations))
And another:
import itertools
from typing import Iterable
def equipartitions(players: Iterable, team_size: int):
if len(players) % team_size != 0:
raise ValueError("group_count must divide base_count")
return _equipartitions(set(players), team_size)
def _equipartitions(players: set, team_size: int):
if not players:
yield []
return
first_player, *other_players = players
for other_team_members in itertools.combinations(other_players, team_size-1):
first_team = {first_player, *other_team_members}
for other_teams in _equipartitions(set(other_players) - set(first_team), team_size):
yield [first_team, *other_teams]
all_combinations = [
{''.join(sorted(team)) for team in combo} for combo in equipartitions(players='ABCDEFGHIJKL', team_size=3)
]
print(all_combinations)
print(len(all_combinations))

Firstly, you can use a list comprehension to give you all of the possible combinations (regardless of the duplicates):
comb = [(a,b) for a in letters for b in letters if a != b]
And, afterwards, you can use the sorted function to sort the tuples. After that, to remove the duplicates, you can convert all of the items to a set and then back to a list.
var = [tuple(sorted(sub)) for sub in comb]
var = list(set(var))

You could use the list comprehension approach, which has a time complexity of O(n*n-1), or you could use a more verbose way, but with a slightly better time complexity of O(n^2-n)/2:
comb = []
for first_letter_idx, _ in enumerate(letters):
for sec_letter_idx in range(first_letter_idx + 1, len(letters)):
comb.append(letters[first_letter_idx] + letters[sec_letter_idx])
print(comb)
comb2 = []
for first_letter_idx, _ in enumerate(comb):
for sec_letter_idx in range(first_letter_idx + 1, len(comb)):
if (comb[first_letter_idx][0] not in comb[sec_letter_idx]
and comb[first_letter_idx][1] not in comb[sec_letter_idx]):
comb2.append([comb[first_letter_idx], comb[sec_letter_idx]])
print(comb2)
This algorithm needs more work to handle dynamic inputs. Maybe with recursion.

Use combination from itertools
from itertools import combinations
x = list(combinations(['A','B','C','D'],2))
t = []
for i in (x):
t.append(i[0]+i[1]) # concatenating the strings and adding in a list
g = []
for i in range(0,len(t),2):
for j in range(i+1,len(t)):
g.append([t[i],t[j]])
break
print(g)

Related

Counting triplets in a DNA-sequence

I want to make a code which counts all triplets in a sequence. I've read a plenty of posts so far, but none of them could help me.
This is my code:
def cnt(seq):
mydict = {}
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
b = ''.join(a[(0+3*i):(3+3*i)])
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if b == triplet:
mydict[b] = 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
Does Biopython provide a function to solve this problem?
EDIT:
Note that, for instance, in the sequence 'ATGAAG', 'TGA' or 'GAA' are not "valid" triplets, only 'ATG' and 'AAG', because in biology and bioinformatics, we read it 'ATG' and 'AAG', thats the information we need to translate it or whatever else.
You can imagine it as a sequence of words, for example "Hello world". The way we read it is "Hello" and "world", not "Hello", "ello ", "llo w",...
It took me a while to understand that you do not want to count the number of codons but the frequency of each codon. Your title is a bit misleading in this respect. Anyway, you can employ collections.Counter for your task:
from collections import Counter
def cnt(seq):
if len(seq) % 3 == 0:
#split list into codons of three
codons = [seq[i:i+3] for i in range(0, len(seq), 3)]
#create Counter dictionary for it
codon_freq = Counter(codons)
#determine number of codons, should be len(seq) // 3
n = sum(codon_freq.values())
#print out all entries in an appealing form
for key in sorted(codon_freq):
print("{}: {} = {:5.2f}%".format(key, codon_freq[key], codon_freq[key] * 100 / n))
#or just the dictionary
#print(codon_freq)
else:
print("Error")
seq = "ATCGCAGAAATCCGCAGAATC"
cnt(seq)
Sample output:
AGA: 1 = 14.29%
ATC: 3 = 42.86%
CGC: 1 = 14.29%
GAA: 1 = 14.29%
GCA: 1 = 14.29%
You can use clever techniques, as suggested in the other answers, but I will build a solution starting from your code, which is almost working: Your problem is that every time you do mydict[b] = 1, you reset the count of b to 1.
A minimal fix
You could solve this by testing if the key is present, if not, create the entry in the dict, then increment the value, but there are more convenient tools in python.
A minimal change to your code would be to use a defaultdict(int) instead of a dict. Whenever a new key is encountered, it is assumed to have the associated default value for an int: 0. So you can increment the value instead of resetting:
from collections import defaultdict
def cnt(seq):
# instanciate a defaultdict that creates ints when necessary
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
b = ''.join(a[(0+3*i):(3+3*i)])
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if b == triplet:
# increment the existing count (or the default 0 value)
mydict[b] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
It works as desired:
cnt("ACTGGCACT")
ACT: 2
GGC: 1
Some possible improvements
Now let's try to improve your code a bit.
First, as I wrote in the comments, let's avoid the un-necessary conversion of your sequence to a list, and use a better variable name for the currently counted codon:
from collections import defaultdict
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if codon == triplet:
mydict[codon] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
Now lets simplify the nested loop part, trying all possible codons, by generating in advance the set of possible codons:
from collections import defaultdict
from itertools import product
codons = {
"".join((base1, base2, base3))
for (base1, base2, base3) in product("ACGT", "ACGT", "ACGT")}
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
if codon in codons:
mydict[codon] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
Now, your code simply ignores the triplets that are not valid codons. Maybe you should instead issue a warning:
from collections import defaultdict
from itertools import product
codons = {
"".join((base1, base2, base3))
for (base1, base2, base3) in product("ACGT", "ACGT", "ACGT")}
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
# We count even invalid triplets
mydict[codon] += 1
# We display counts only for valid triplets
for codon in sorted(codons):
print("%s: %s" % (codon, mydict[codon]))
# We compute the set of invalid triplets:
# the keys that are not codons.
invalid = mydict.keys() - codons
# An empty set has value False in a test.
# We issue a warning if the set is not empty.
if invalid:
print("Warning! There are invalid triplets:")
print(", ".join(sorted(invalid)))
else:
print("Error")
A more fancy solution
Now a more fancy solution, using cytoolz (probably needs to be installed because it is not part of usual python distributions: pip3 install cytoolz, if you are using pip):
from collections import Counter
from itertools import product, repeat
from cytoolz import groupby, keymap, partition
# To make strings out of lists of strings
CAT = "".join
# The star "extracts" the elements from the result of repeat,
# so that product has 3 arguments, and not a single one
codons = {CAT(bases) for bases in product(*repeat("ACGT", 3))}
def cnt(seq):
# keymap(CAT, ...) transforms the keys (that are tuples of letters)
# into strings
# if len(seq) is not a multiple of 3, pad="-" will append "-"
# to complete the last triplet (which will be an invalid one)
codon_counts = keymap(CAT, Counter(partition(3, seq, pad="-")))
# separate encountered codons into valids and invalids
codons_by_validity = groupby(codons.__contains__, codon_counts.keys())
# get allows to provide a default value,
# in case one of the categories is not present
valids = codons_by_validity.get(True, [])
invalids = codons_by_validity.get(False, [])
# We display counts only for valid triplets
for codon in sorted(valids):
print("%s: %s" % (codon, codon_counts[codon]))
# We issue a warning if there are invalid codons.
if invalids:
print("Warning! There are invalid triplets:")
print(", ".join(sorted(invalids)))
Hope this helps.
You could do something like this:
from itertools import product
seq = 'ATGATG'
all_triplets = [seq[i:i+3] for i in range(len(seq)) if i <= len(seq)-3]
# this gives ['ATG', 'TGA', 'GAT', 'ATG']
# add more valid_triplets here
valid_triplets = ['ATG']
len([(i, j) for i, j in product(valid_triplets, all_triplets) if i==j])
Output:
2
It is unclear what output is expected. Here we use one of many grouping functions from more_itertools to build adjacent triplets or "codons".
import more_itertools as mit
seq = "ATGATG"
codons = ["".join(w) for w in mit.grouper(3, seq)]
codons
# ['ATG', 'ATG']
Count the number of codons by calling len.
len(triplets)
# 2
For more detailed analysis, consider splitting the problem into smaller functions that (1) extract codons and (2) compute occurrences.
Code
import collections as ct
def split_codons(seq):
"Return codons from a sequence; raise for bad sequences."
for w in mit.windowed(seq, n=3, step=3, fillvalue=""):
part = "".join(w)
if len(part) < 3:
raise ValueError(f"Sequence not divisible by 3. Got extra '{part}'.")
yield part
def count_codons(codons):
"""Return dictionary of codon occurences."""
dd = ct.defaultdict(int)
for i, c in enumerate(codons, 1):
dd[c] += 1
return {k: (v, 100 * v/i) for k, v in dd.items()}
Demo
>>> seq = "ATCGCAGAAATCCGCAGAATC"
>>> bad_seq = "ATCGCAGAAATCCGCAGAATCA"
>>> list(split_codons(seq))
['ATC', 'GCA', 'GAA', 'ATC', 'CGC', 'AGA', 'ATC']
>>> list(split_codons(bad_seq))
ValueError: Sequence not divisible by 3. Got extra 'A'.
>>> count_codons(split_codons(seq))
{'ATC': (3, 42.857142857142854),
'GCA': (1, 14.285714285714286),
'GAA': (1, 14.285714285714286),
'CGC': (1, 14.285714285714286),
'AGA': (1, 14.285714285714286)}

All Unequal Subsets of Sorted String in Python 3

I would like to find all subsets of a sorted string, disregarding order and which characters are next to each other. I think the best way for this to be explained is though an example. The results should also be from longest to shortest.
These are the results for bell.
bell
bel
bll
ell
be
bl
el
ll
b
e
l
I have thought of ways to do this, but none for any length of input.
Thank you!
There are generally two ways to approach such things: generate "everything" and weed out duplicates later, or create custom algorithms to avoid generating duplicates to begin with. The former is almost always easier, so that's what I'll show here:
def gensubsets(s):
import itertools
for n in reversed(range(1, len(s)+1)):
seen = set()
for x in itertools.combinations(s, n):
if x not in seen:
seen.add(x)
yield "".join(x)
for x in gensubsets("bell"):
print(x)
That prints precisely what you said you wanted, and how it does so should be more-than-less obvious.
Here is one way using itertools.combinations.
If the order for strings of same length is important, see #TimPeters' answer.
from itertools import combinations
mystr = 'bell'
res = sorted({''.join(sorted(x, key=lambda j: mystr.index(j)))
for i in range(1, len(mystr)+1) for x in combinations(mystr, i)},
key=lambda k: -len(k))
# ['bell', 'ell', 'bel', 'bll', 'be', 'll', 'bl', 'el', 'l', 'e', 'b']
Explanation
Find all combinations of length in range(1, len(mystr)+1).
Sort by original string via key argument of sorted. This step may be omitted if not required.
Use set of ''.join on elements for unique strings.
Outer sorted call to go from largest to smallest.
You can try in one line:
import itertools
data='bell'
print(set(["".join(i) for t in range(len(data)) for i in itertools.combinations(data,r=t) if "".join(i)!='']))
output:
{'bel', 'bll', 'ell', 'el', 'be', 'bl', 'e', 'b', 'l', 'll'}

Python irregular loop

I'm looking for an elegant 'Pythonic' way to loop through input in groups of a certain size. But the size of the groups vary and you don't know the size of a particular group until you start parsing.
Actually I only care about groups of two sizes. I have a large sequence of input which come in groups of mostly size X but occasionally size Y. For this example lets say X=3 and Y=4 (my actual problem is X=7 and Y=8). I don't know until mid-way through the group of elements if it's going to be a group of size X or of size Y elements.
In my problem I'm dealing with lines of input but to simplify I'll illustrate with characters.
So if it's a group of a particular size I know I'll always get input in the same sequence. So for example if it's a size X group I'll be getting elements of type [a,a,b] but if it's a size Y group I'll be getting elements of type [a,a,c,b]. f it's something of type 'a' I'll want to process it in a certain way and 'b' another etc.
Obviously I have to test an element at some point to determine if it's of type one group or the other. As demonstrated above I cannot just check the type of every element because there may be two of the same in sequence. Also the groups may be the same pattern at the start and only differ near the end. In this example the earliest I can check if I'm in a size X or size Y group is by testing the 3rd element (to see if it's of type 'c' or type 'b').
I have a solution with a for loop with an exposed index and two extra variables, but I'm wondering if there is a more elegant solution.
Here is my code. I've put pass statements in place of where I would do the actual parsing depending on what type it is:
counter = 0
group = 3
for index, x in enumerate("aabaabaacbaabaacbaabaab"):
column = index - counter;
print(str(index) + ", " + x + ", " + str(column))
if column == 0:
pass
elif column == 1:
pass
elif column == 2:
if x == 'c':
pass
elif x == 'd':
group = 4
elif column == 3:
pass
if column + 1 == group:
counter += group
group = 3
In the code example the input stream is aabaabaacbaabaacbaabaab so that is groups of:
aab (3)
aab (3)
aacb (4)
aab (3)
aacb (4)
aab (3)
aab (3)
I would use a generator that collect these groups and determines the size for each, and then ultimately yields each group:
def getGroups (iterable):
group = []
for item in iterable:
group.append(item)
if len(group) == 3 and group[2] == 'c':
yield group
group = []
elif len(group) == 4 and group[2] == 'd':
yield group
group = []
for group in getGroups('abcabcabdcabcabdcabcabc'):
print(group)
['a', 'b', 'c']
['a', 'b', 'c']
['a', 'b', 'd', 'c']
['a', 'b', 'c']
['a', 'b', 'd', 'c']
['a', 'b', 'c']
['a', 'b', 'c']
Looks like you need a simple automata with backtracking, for example:
def parse(tokens, patterns):
backtrack = False
i = 0
while tokens:
head, tail = tokens[:i+1], tokens[i+1:]
candidates = [p for p in patterns if p.startswith(head)]
match = any(p == head for p in candidates)
if match and (backtrack or len(candidates) == 1 or not tail):
yield head
tokens = tail
backtrack = False
i = 0
elif not candidates:
if not i or backtrack:
raise SyntaxError, head
else:
backtrack = True
i -= 1
elif tail:
i += 1
else:
raise SyntaxError, head
tokens = 'aabaabcaabaabcxaabxxyzaabx'
patterns = ['aab', 'aabc', 'aabcx', 'x', 'xyz']
for p in parse(tokens, patterns):
print p
With your particular example, you could use a regex:
>>> s="aabaabaacbaabaacbaabaab"
>>> re.findall(r'aac?b', s)
['aab', 'aab', 'aacb', 'aab', 'aacb', 'aab', 'aab']
If you want a parser that does the same thing, you can do:
def parser(string, patterns):
patterns=sorted(patterns, key=len, reverse=True)
i=0
error=False
while i<len(string) and not error:
for pat in patterns:
j=len(pat)
if string[i:i+j]==pat:
i+=j
yield pat
break
else:
error=True
if error or i<len(string):
raise SyntaxWarning, "Did not match the entire string"
>>> list(parser(s, ['aab', 'aacb']))
['aab', 'aab', 'aacb', 'aab', 'aacb', 'aab', 'aab']

Sort a dictionary alphabetically, and print it by frequency

I am running python 2.7.2 on a mac.
I have a simple dictionary:
dictionary= {a,b,c,a,a,b,b,b,b,c,a,w,w,p,r}
I want it to be printed and have the output like this:
Dictionary in alphabetical order:
a 4
b 5
c 2
p 1
r 1
w 2
But what I'm getting is something like this...
a 1
a 1
a 1
a 1
b 1
.
.
.
w 1
This is the code I am using.
new_dict = []
for word in dictionary.keys():
value = dictionary[word]
string_val = str(value)
new_dict.append(word + ": " + string_val)
sorted_dictionary = sorted(new_dict)
for entry in sorted_dictionary:
print entry
Can you please tell me where is the mistake?
(By the way, I'm not a programmer but a linguist, so please go easy on me.)
What you're using is not a dictionary, it's a set! :)
And sets doesn't allow duplicates.
What you probably need is not dictionaries, but lists.
A little explanation
Dictionaries have keys, and each unique keys have their own values:
my_dict = {1:'a', 2:'b', 3:'c'}
You retrieve values by using the keys:
>>> my_dict [1]
'a'
On the other hand, a list doesn't have keys.
my_list = ['a','b','c']
And you retrieve the values using their index:
>>> my_list[1]
'b'
Keep in mind that indices starts counting from zero, not 1.
Solving The Problem
Now, for your problem. First, store the characters as a list:
l = ['a', 'b', 'c', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'a', 'w', 'w', 'p', 'r']
Next, we'll need to know what items are in this list:
items = []
for item in l:
if item not in items:
items.append(item)
This is pretty much equal to items = set(l) (the only difference is that this is a list). But just to make things clear, hope you understand what the code does.
Here is the content of items:
>>> items
['a', 'b', 'c', 'w', 'p', 'r']
With that done, we will use lst.count() method to see the number of a char's occurence in your list, and the built-in function sorted() to sort the items:
for item in sorted(items): #iterates through the sorted items.
print item, l.count(item)
Result:
a 4
b 5
c 2
w 2
p 1
r 1
Hope this helps!!
Let's start with the obvious, this:
dictionary= {a,b,c,a,a,b,b,b,b,c,a,w,w,p,r}
is not a dictionary. It is a set, and sets do not preserve duplicates. You probably meant to declare that as a list or a tuple.
Now, onto the meat of your problem: you need to implement something to count the items of your collection. Your implementation doesn't really do that. You could roll your own, but really you should use a Counter:
my_list = ['a','b','c','a','a','b','b','b','b','c','a','w','w','p','r']
from collections import Counter
c = Counter(my_list)
c
Out[19]: Counter({'b': 5, 'a': 4, 'c': 2, 'w': 2, 'p': 1, 'r': 1})
Now on to your next problem: dictionaries (of all types, including Counter objects) do not preserve key order. You need to call sorted on the dict's items(), which is a list of tuples, then iterate over that to do your printing.
for k,v in sorted(c.items()):
print('{}: {}'.format(k,v))
a: 4
b: 5
c: 2
p: 1
r: 1
w: 2
dictionary is something like this{key1:content1, key2:content2, ...} key in a dictionary is unique. then a = {1,2,3,4,5,5,4,5,6} is the set, when you print this out, you will notice that
print a
set([1,2,3,4,5,6])
duplicates are eliminated.
In your case, a better data structure you can use is a list which can hold multiple duplicates inside.
if you want to count the element number inside, a better option is collections.Counter, for instance:
import collections as c
cnt = c.Counter()
dict= ['a','b','c','a','a','b','b','b','b','c','a','w','w','p','r']
for item in dict:
cnt[item]+=1
print cnt
the results would be:
Counter({'b': 5, 'a': 4, 'c': 2, 'w': 2, 'p': 1, 'r': 1})
as you notice, the results become a dictionary here.
so by using:
for key in cnt.keys():
print key, cnt[key]
you can access the key and content
a 4
c 2
b 5
p 1
r 1
w 2
you can achieve what you want by modifying this a little bit. hope this is helpful
Dictionary cannot be defined as {'a','b'}. If it defined so, then it is an set, where you can't find duplicates in the list
If your defining a character, give it in quotes unless it is declared already.
You can't loop through like this for word in dictionary.keys():, since here dictionary is not a dictionary type.
If you like to write a code without using any builtin function, try this
input=['a','b','c','a','a','b','b','b','b','c','a','w','w','p','r']
dict={}
for x in input:
if x in dict.keys():
dict[x]=dict[x]+1
else:
dict[x]=1
for k in dict.keys():
print k, dict[k]
First, a dictionary is an unordered collection (i.e., it has no guaranteed order of its keys).
Second, each dict key must be unique.
Though you could count the frequency of characters using a dict, there's a better the solution. The Counter class in Python's collections module is based on a dict and is specifically designed for a task like tallying frequency.
from collections import Counter
letters = ['a', 'b', 'c', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'a', 'w', 'w', 'p', 'r']
cnt = Counter(letters)
print cnt
The contents of the counter are now:
Counter({'b': 5, 'a': 4, 'c': 2, 'w': 2, 'p': 1, 'r': 1})
You can print these conveniently:
for char, freq in sorted(cnt.items()):
print char, freq
which gives:
a 4
b 5
c 2
p 1
r 1
w 2

check if a number already exist in a list in python

I am writing a python program where I will be appending numbers into a list, but I don't want the numbers in the list to repeat. So how do I check if a number is already in the list before I do list.append()?
You could do
if item not in mylist:
mylist.append(item)
But you should really use a set, like this :
myset = set()
myset.add(item)
EDIT: If order is important but your list is very big, you should probably use both a list and a set, like so:
mylist = []
myset = set()
for item in ...:
if item not in myset:
mylist.append(item)
myset.add(item)
This way, you get fast lookup for element existence, but you keep your ordering. If you use the naive solution, you will get O(n) performance for the lookup, and that can be bad if your list is big
Or, as #larsman pointed out, you can use OrderedDict to the same effect:
from collections import OrderedDict
mydict = OrderedDict()
for item in ...:
mydict[item] = True
If you want to have unique elements in your list, then why not use a set, if of course, order does not matter for you: -
>>> s = set()
>>> s.add(2)
>>> s.add(4)
>>> s.add(5)
>>> s.add(2)
>>> s
39: set([2, 4, 5])
If order is a matter of concern, then you can use: -
>>> def addUnique(l, num):
... if num not in l:
... l.append(num)
...
... return l
You can also find an OrderedSet recipe, which is referred to in Python Documentation
If you want your numbers in ascending order you can add them into a set and then sort the set into an ascending list.
s = set()
if number1 not in s:
s.add(number1)
if number2 not in s:
s.add(number2)
...
s = sorted(s) #Now a list in ascending order
You could probably use a set object instead. Just add numbers to the set. They inherently do not replicate.
To check if a number is in a list one can use the in keyword.
Let's create a list
exampleList = [1, 2, 3, 4, 5]
Now let's see if it contains the number 4:
contains = 4 in exampleList
print(contains)
>>>> True
As you want to append when an element is not in a list, the not in can also help
exampleList2 = ["a", "b", "c", "d", "e"]
notcontain = "e" not in exampleList2
print(notcontain)
>>> False
But, as others have mentioned, you may want to consider using a different data structure, more specifically, set. See examples below (Source):
basket = {'apple', 'orange', 'apple', 'pear', 'orange', 'banana'}
>>> print(basket) # show that duplicates have been removed
{'orange', 'banana', 'pear', 'apple'}
'orange' in basket # fast membership testing
True
'crabgrass' in basket
False
# Demonstrate set operations on unique letters from two words
...
a = set('abracadabra')
b = set('alacazam')
a # unique letters in a
>>> {'a', 'r', 'b', 'c', 'd'}
a - b # letters in a but not in b
>>> {'r', 'd', 'b'}
a | b # letters in a or b or both
>>> {'a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'}
a & b # letters in both a and b
>>> {'a', 'c'}
a ^ b # letters in a or b but not both
>>> {'r', 'd', 'b', 'm', 'z', 'l'}

Categories