I am new to python this is what i am trying to achieve:
letter = 4
word = "Demo Deer Deep Deck Cere Reep Creep Creeps"
split_word = word.split()
I am trying to achieve words that can be formed by any 4 common letters for example:
Deer Deep Reep [these can be formed by 4 letters d, e, r & p]
Creep Cere Reep [these can be formed by 4 letters c, r, e, p]
Is there easy way to do this in python without using regex.
You'll need to do this in a few steps, I think.
First, you need to figure out what your set of letters is. You could use the whole alphabet, but I'd recommend against that if you can avoid it. I'd try using a set:
letter_pool = set([ltr.lower() for ltr in word if ltr != " "])
Next, you need to iterate through all combinations of four letters in your pool and check which words can be formed with them. This is why it's probably better not to use the whole alphabet; this is a lot of combinations. In the example below, I'm storing the results in a dictionary keyed by the combination of letters, but you could modify that to suit your needs.
results = {}
import itertools
for combination in itertools.combinations(letter_pool, letter): #in this case, letter=4
results[combination] = []
for wrd in split_word:
for character in wrd:
if character.lower() not in combination:
break
else:
results[combination].append(wrd)
if len(results[combination]) == 0:
del results[combination]
Note the for-else syntax; it means if the loop does not break, the code in the else clause executes. Basically, for a given combination of letters, this code checks each word and determines if it is composed of just those letters. If it is, it stores that information. If a given combination doesn't form any words, its entry in the dictionary is removed (to save on memory). Note that this is a pretty naive solution and will not scale well.
If you want to print out the results, you can do:
for key in results:
print ", ".join(results[key]), " [ formed by "+str(key)+"]"
You can achieve this using sets and itertools.combinations:
word = "Demo Deer Deep Deck Cere Reep Creep Creeps"
split_word = word.split()
from itertools import combinations
letters = {s for it in split_word for s in it.lower()}
out = dict()
for n in range(len(letters)):
out[n] = {''.join(letters_subset): [word
for word in split_word
if set(word.lower()).issubset(letters_subset)]
for letters_subset in combinations(letters, n)}
out[n] = {k: v for k, v in out[n].items() if len(v) > 0}
# Print output
for n, d in out.items():
for k, v in d.items():
print('{}:\t{}\t{}'.format(n, k, v))
Related
For example if I have 'QOFTHEA' in rack, I would like to create every possible combination of words length from 2 to 7 to compare them with another source of word list. How should I create it in python?
You can use use itertools.permutations and itertools.chain.from_iterable:
from itertools import chain, permutations
rack = 'QOFTHEA'
lo, hi = 2, 7
for perm in chain.from_iterable(permutations(rack, i) for i in range(lo, hi + 1)):
print(perm)
If you wanted a string instead of a tuple of characters, you can do ''.join(perm).
Instead of checking all the permutations (5912 for 7 letter is fine but 4 037 912 possibilities for 10 letters is uncomputable) against a long list. I would suggest checking the list against the letters (the complexity now depends from the length of the list which is by definition shorter than all the permutations). This allows you to not load the whole dictionary in memory as it can be quite large :
from collections import Counter
rack = Counter('QOFTHEA')
with open('words.txt') as f:
for word in (i[:-1] for i in f):
if len(word) > len(rack):
continue
word_counter = Counter(word)
for l, c in word_counter.items():
if l not in rack:
break
else:
if c > rack[l]:
break
else:
print(word)
If for some reason you don't want to use the standard library:
def Counter(iterable):
dic = {}
for i in iterable:
if i not in dic:
dic[i] = 1
dic[i] += 1
return dic
Good scrabbling !
Say I have a list of words. If a word's last letter is the same with another word's first letter, then we can connect them together. We don't connect the word with itself.The input elements are distinct.
Example: apple - elephant - tower - rank
I implemented it as this.
def transform(lst):
graph = []
for picked in lst:
link = []
i = lst.index(picked)
rest = lst[:i]+lst[i+1:]
for compare in rest:
if picked[-1] == compare[0]:
link.append(compare)
if len(link) != 0:
graph.append(link)
return graph
I don't know if I can still improve it.
=======================================================================
I think I should change
if len(link) != 0:
graph.append(link)
to
graph.append(link)
Otherwise, the order of the adjacent lists will be mixed
You should start by identifying the two things you're grouping by here. Ending letters and starting letters. Drop all the words into two dicts, keyed by each, and you'll end up with much faster lookups. list.index is a killer for efficiency, with each lookup costing O(n)
from collections import defaultdict
startswith, endswith = defaultdict(list), defaultdict(list)
wordlist = ['apple','elephant','tower','rank']
for word in wordlist:
startswith[word[0]].append(word)
endswith[word[-1]].append(word)
Then it should be a fairly simple graph traversal problem.
funny!
I hope I understand what you're talking about.
def transform(lst):
graph = []
for picked in lst:
if len(graph)==0:
graph.append(picked)
else:
ch=graph[len(graph)-1][len(graph[len(graph)-1])-1]
world_ch=[i for i in lst if ((i not in graph)and(i[0]==ch)) ]
if len(world_ch)==0: break
else: graph.append(world_ch[0])
return(graph)
lista=['apple','carloh','horse','apple','elephant','tower','rank']
print(str(transform(lista)))
['apple', 'elephant', 'tower', 'rank']
print(str([[i] for i in transform(lista)]))
[['apple'], ['elephant'], ['tower'], ['rank']]
From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']
I'm trying to write an algorithm that by given to it a bunch of letters is giving you all the words that can be constructed of the letters, for instance, given 'car' should return a list contains [arc,car,a, etc...] and out of it returns the best scrabble word. The problem is in finding that list which contains all the words.
I've got a giant txt file dictionary, line delimited and I've tried this so far:
def find_optimal(bunch_of_letters: str):
words_to_check = []
c1 = Counter(bunch_of_letters.lower())
for word in load_words():
c2 = Counter(word.lower())
if c2 & c1 == c2:
words_to_check.append(word)
max_word = max_word_value(words_to_check)
return max_word,calc_word_value(max_word)
max_word_value - returns the word with the maximum value of the list given
calc_word_value - returns the word's score in scrabble.
load_words - return a list of the dictionary.
I'm currently using counters to do the trick but, the problem is that I'm currently on about 2.5 seconds per search and I don't know how to optimize this, any thoughts?
Try this:
def find_optimal(bunch_of_letters):
bunch_of_letters = ''.join(sorted(bunch_of_letters))
words_to_check = [word for word in load_words() if ''.join(sorted(word)) in bunch_of_letters]
max_word = max_word_value(words_to_check)
return max_word, calc_word_value(max_word)
I've just used (or at least tried to use) a list comprehension. Essentially, words_to_check will (hopefully!) be a list of all of the words which are in your text file.
On a side note, if you don't want to use a gigantic text file for the words, check out enchant!
from itertools import permutations
theword = 'car' # or we can use input('Type in a word: ')
mylist = [permutations(theword, i)for i in range(1, len(theword)+1)]
for generator in mylist:
for word in generator:
print(''.join(word))
# instead of .join just print (word) for tuple
Output:
c
a
r
ca
cr
...
ar rc ra car cra acr arc rca rac
This will give us all the possible combinations (i.e. permutations) of a word.
If you're looking to see if the generated word is an actual word in the English dictionary we can use This Answer
import enchant
d = enchant.Dict("en_US")
for word in mylist:
print(d.check(word), word)
Conclusion:
If want to generate all the combinations of the word. We use this code:
from itertools import combinations, permutations, product
word = 'word' # or we can use input('Type in a word: ')
solution = permutations(word, 4)
for i in solution:
print(''.join(i)) # just print(i) if you want a tuple
On the back of a block calendar I found the following riddle:
How many common English words of 4 letters or more can you make from the letters
of the word 'textbook' (each letter can only be used once).
My first solution that I came up with was:
from itertools import permutations
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
found_words = []
ps = (permutations(given_word, i) for i in range(4, len(given_word)+1))
for p in ps:
for word in map(''.join, p):
if word in words and word != given_word:
found_words.append(word)
print set(found_words)
This gives the result set(['tote', 'oboe', 'text', 'boot', 'took', 'toot', 'book', 'toke', 'betook']) but took more than 7 minutes on my machine.
My next iteration was:
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
print [word for word in words if len(word) >= 4 and sorted(filter(lambda letter: letter in word, given_word)) == sorted(word) and word != given_word]
Which return an answer almost immediately but gave as answer: ['book', 'oboe', 'text', 'toot']
What is the fastest, correct and most pythonic solution to this problem?
(edit: added my earlier permutations solution and its different output).
I thought I'd share this slightly interesting trick although it takes a good bit more code than the rest and isn't really "pythonic". This will take a good bit more code than the other solutions but should be rather quick if I look at the timing the others need.
We're doing a bit preprocessing to speed up the computations. The basic approach is the following: We assign every letter in the alphabet a prime number. E.g. A = 2, B = 3, and so on. We then compute a hash for every word in the alphabet which is simply the product of the prime representations of every character in the word. We then store every word in a dictionary indexed by the hash.
Now if we want to find out which words are equivalent to say textbook we only have to compute the same hash for the word and look it up in our dictionary. Usually (say in C++) we'd have to worry about overflows, but in python it's even simpler than that: Every word in the list with the same index will contain exactly the same characters.
Here's the code with the slightly optimization that in our case we only have to worry about characters also appearing in the given word, which means we can get by with a much smaller prime table than otherwise (the obvious optimization would be to only assign characters that appear in the word a value at all - it was fast enough anyhow so I didn't bother and this way we could pre process only once and do it for several words). The prime algorithm is useful often enough so you should have one yourself anyhow ;)
from collections import defaultdict
from itertools import permutations
PRIMES = list(gen_primes(256)) # some arbitrary prime generator
def get_dict(path):
res = defaultdict(list)
with open(path, "r") as file:
for line in file.readlines():
word = line.strip().upper()
hash = compute_hash(word)
res[hash].append(word)
return res
def compute_hash(word):
hash = 1
for char in word:
try:
hash *= PRIMES[ord(char) - ord(' ')]
except IndexError:
# contains some character out of range - always 0 for our purposes
return 0
return hash
def get_result(path, given_word):
words = get_dict(path)
given_word = given_word.upper()
result = set()
powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x]
for word in (word for word in powerset(given_word) if len(word) >= 4):
hash = compute_hash(word)
for equiv in words[hash]:
result.add(equiv)
return result
if __name__ == '__main__':
path = "dict.txt"
given_word = "textbook"
result = get_result(path, given_word)
print(result)
Runs on my ubuntu word list (98k words) rather quickly, but not what I'd call pythonic since it's basically a port of a c++ algorithm. Useful if you want to compare more than one word that way..
How about this?
from itertools import permutations, chain
with open('/usr/share/dict/words') as fp:
words = set(fp.read().split())
given_word = 'textbook'
perms = (permutations(given_word, i) for i in range(4, len(given_word)+1))
pwords = (''.join(p) for p in chain(*perms))
matches = words.intersection(pwords)
print matches
which gives
>>> print matches
set(['textbook', 'keto', 'obex', 'tote', 'oboe', 'text', 'boot', 'toto', 'took', 'koto', 'bott', 'tobe', 'boke', 'toot', 'book', 'bote', 'otto', 'toke', 'toko', 'oket'])
There is a generator itertools.permutations with which you can gather all permutations of a sequence with a specified length. That makes it easier:
from itertools import permutations
GIVEN_WORD = 'textbook'
with open('/usr/share/dict/words', 'r') as f:
words = [s.strip() for s in f.readlines()]
print len(filter(lambda x: ''.join(x) in words, permutations(GIVEN_WORD, 4)))
Edit #1: Oh! It says "4 or more" ;) Forget what I said!
Edit #2: This is the second version I came up with:
LETTERS = set('textbook')
with open('/usr/share/dict/words') as f:
WORDS = filter(lambda x: len(x) >= 4, [l.strip() for l in f])
matching = filter(lambda x: set(x).issubset(LETTERS) and all([x.count(c) == 1 for c in x]), WORDS)
print len(matching)
Create the whole power set, then check whether the dictionary word is in the set (order of the letters doesn't matter):
powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x]
pw = map(lambda x: sorted(x), powerset(given_word))
filter(lambda x: sorted(x) in pw, words)
The following just checks each word in the dictionary to see if it is of the appropriate length, and then if it is a permutation of 'textbook'. I borrowed the permutation check from
Checking if two strings are permutations of each other in Python
but changed it slightly.
given_word = 'textbook'
with open('/usr/share/dict/words', 'r') as f:
words = [s.strip() for s in f.readlines()]
matches = []
for word in words:
if word != given_word and 4 <= len(word) <= len(given_word):
if all(word.count(char) <= given_word.count(char) for char in word):
matches.append(word)
print sorted(matches)
This finishes almost immediately and gives the correct result.
Permutations get very big for longer words. Try counterrevolutionary for example.
I would filter the dict for words from 4 to len(word) (8 for textbook).
Then I would filter with regular expression "oboe".matches ("[textbook]+").
The remaining words, I would sort, and compare them with a sorted version of your word, ("beoo", "bekoottx") with jumping to the next index of a matching character, to find mismatching numbers of characters:
("beoo", "bekoottx")
("eoo", "ekoottx")
("oo", "koottx")
("oo", "oottx")
("o", "ottx")
("", "ttx") => matched
("bbo", "bekoottx")
("bo", "ekoottx") => mismatch
Since I don't talk python, I leave the implementation as an exercise to the audience.