I have a giant list of words corpus and a particular word w. I know the index of every occurence of w in the corpus. I want to look at an n sized window around every occurence of w and create a dictionary of other words that occur within that window. The dictionary is a mapping from int to list[str] where the key is how many positions away from my target word I am, either to the left (negative) or to the right (positive), and the value is a list of words at that position.
For example, if I have the corpus: ["I", "made", "burgers", "Jack", "made", "sushi"]; my word is "made" and I am looking at a window of size 1, then I ultimately want to return {-1: ["I", "Jack"], 1: ["burgers", "sushi"]}.
There are two problems that can occur. My window may go out of bounds (if I looked at a window of size 2 in the above example) and I can encounter the same word multiple times in that window, which are cases I want to ignore. I have written the following code which seems to work, but I want to make this cleaner.
def find_neighbor(word: str, corpus: list[str], n: int = 1) -> dict[int, list[str]]:
mapping = {k: [] for k in list(range(-n,n+1)) if k != 0}
idxs = [k for k, v in enumerate(corpus) if v == word]
for idx in idxs:
for i in [x for x in range(-n,n+1) if x != 0]:
try:
item = corpus[idx+i]
if item != word:
mapping[i].append(corpus[item])
except IndexError:
continue
return mapping
Is there a way to incorporate options and pattern matching so that I can remove the try block and have something like this...
match corpus[idx+i]
case None: continue; # If it doesn't exist (out of bounds), continue / i can also break
case word: continue; # If it is the word itself, continue
case _: mapping[i].append(corpus[item]) # Otherwise, add it to the dictionary
Introduce a helper function that returns corpus[i] if i is a legal index and None otherwise:
corpus = ["foo", "bar", "baz"]
def get(i):
return corpus[i] if i<len(corpus) else None
print([get(0), get(1), get(2), get(3)])
The result of the above is:
['foo', 'bar', 'baz', None]
Now you can write:
match get(idx+i)
case None: something
case word: something
case _: something
Related
I have two lists at hand, each consisting of tuples in form [(lemma, PoS, token)]. List A is generated by a lemmatizer from input text; List B is from a preexisting database.
The objective is to: check if a token from List A is found in List B, then:
if found, return corresponding lemmas from both lists and check if they are the same, write YES if so, NO if not;
if not found, return corresponding lemma from List A.
Ideally, the output should be something like:
Token from List A
Lemma from list A
Token found in B?
Lemma from List B?
Same lemma?
sé
ég
YES
ég
YES
skór
skó
YES
skór
NO
Keplerur
Kepler
NO
n/a
n/a
So, I tried doing that with for loops cause we didn't really learn much else:
for l, pos, t in list_a:
for seq in list_b:
if t == seq[2]:
if pos == seq[1]:
if l == seq[0]:
comparison.append((t,l,"YES",seq[0],"YES"))
elif pos == seq[1]:
if l != seq[0]:
comparison.append((t,l,"YES",seq[0],"NO"))
elif pos != seq[1]:
comparison.append((t,l,"NO","na,"na"))
elif t != seq[2]:
comparison.append((t,l,"NO","na","na"))
You see, whilst List A is quite short (~120 tuples from the text I am testing on), List B is pre-existing and has >6M elements. For-looping through it for each item on List A is not going to be efficient, I guess. Apparently, my laptop can't complete executing this code anyway, so I can't even test it.
What could I do? I have a feeling a fundamentally different approach is needed here.
UPD: I have, after about an hour of trial and error, come up with this solution:
for l, pos, t in lemtok:
r = next((a for a, b in enumerate(binlistfinal) if b[2] == t), None)
ind = r
if r == None:
comparison.append((t, l, "NO", "n/a", "n/a"))
else:
if pos == binlistfinal[ind][1]:
if l == binlistfinal[ind][0]:
comparison.append((t, l, "YES", binlistfinal[ind][0], "YES"))
elif pos != binlistfinal[ind][1]:
comparison.append((t, l, "NO", "n/a", "n/a"))
elif pos == binlistfinal[ind][1]:
if l != binlistfinal[ind][0]:
comparison.append((t, l, "YES", binlistfinal[ind][0], "NO"))
print(comparison)
WHERE:
lemtok is List A, binlistfinal is List B
So your current code is O(NM) where N is the size of list A and M is the size of list B.
Assuming list B is not sorted nor does it contain unique tokens, you are going to have to at least iterate through all elements of B once. This makes it O(M).
If we can get the access time of list A down to O(1), then your runtime will still be O(M). Now I will just assume all of the tokens in A are unique since the list is quite short. If it is not, you can just generate a dictionary with tokens as the key, and a list of values that the token is associated with and adjust from there.
Sorting list B will not actually help because best case that will be O(MlogM) which is worse.
Now your code might look like the following:
old_a = ...
B = ...
A = {token: (lemma, pos) for token, lemma, pos in old_A}
B_tokens_in_A = set() # store a list of the tokens B tokens found in A which is equal to the A tokens found in B.
for lemma, pos, token in B:
if token in A:
B_tokens_in_A.add(token)
if A[token][0] == lemma:
# return lemma and write yes
else:
# return lemmas and write no
# now we need to return the lemmas from list A whose tokens were not in list B
for token, val in A.items():
if token not in B_tokens_in_A:
# return lemma from A
I've been trying to create a function which gives a hint to the user who plays hangman.
The idea behind the function is that I'm having a list of 5k words plus and I need to sort it out through numerous indicators, such as the word should be the same as the pattern, if the pattern is a___le so the words that I should look for suppose to be in the same group and that if the user has numerous wrong letter it'll not consider the words whom includes this letters
I'm aware that I didn't do it in the most pythonic or elegant way but if someone can tell me what is going wrong here I'm always getting a empty list or a list containing the words with the same length but the other conditions are being constantly ignored.
def filter_words_list(words, pattern, wrong_guess_lst):
"""
:param words: The words I received from the main function
:param pattern: the pattern of the word in seach such as p__pl_
:param wrong_guess_lst: the set of wrong used letters of the users
:return: the function returns the words which are much to the conditions.
"""
list(wrong_guess_lst) # Since I am receiving it as a set I'm converting it to a list.
words_suggestions = [] # The list I'd like to put my suggested words.
for i in range(0, len(words)): # First loop matching between the length of the patterns and the words
if len(words[i]) == len(pattern):
for j in range(0, len(pattern)):
if pattern[j] != '_':
if pattern[j] == words[i][j]: # Checking if the letters of the words are a much.
for t in range(0, len(wrong_guess_lst)):
if wrong_guess_lst[t] != words[i][j]: # Does the same as before but only with the wrong guess lst.
words_suggestions.append(words[i])
return words_suggestions
I think this is what you are looking for (explanation in code comments):
def get_suggestions(words: list, pattern: str, exclude: list) -> list:
"""Finds pattern and returns all words matching it."""
# get the length of the pattern for filtering
length = len(pattern)
# create a filtered generator so that memory is not take up;
# it only give the items from the word list that match the
# conditions i.e. the same length as pattern and not in excluded words
filter_words = (word for word in words
if len(word) == length and word not in exclude)
# create a mapping of all the letters and their corresponding index in the string
mapping = {i: letter for i, letter in enumerate(pattern) if letter != '_'}
# return list comprehension that is made of words in the filtered words that
# match the condition that all letters are in the same place and have the same
# value as the mapping
return [word for word in filter_words
if all(word[i] == v for i, v in mapping.items())]
word_list = [
'peach', 'peace', 'great', 'good', 'food',
'reach', 'race', 'face', 'competent', 'completed'
]
exclude_list = ['good']
word_pattern = 'pe___'
suggestions = get_suggestions(word_list, word_pattern, exclude_list)
print(suggestions)
# output:
# ['peach', 'peace']
# a bit of testing
# order of items in the list is important
# it should be the same as in the word_list
patterns_and_answers = {
'_oo_': ['food'], # 'good' is in the excluded words
'_omp_____': ['competent', 'completed'],
'__ce': ['race', 'face'],
'gr_a_': ['great'],
'_a_e': ['race', 'face'],
'_ea__': ['peach', 'peace', 'reach']
}
for p, correct in patterns_and_answers.items():
assert get_suggestions(word_list, p, exclude_list) == correct
print('all test cases successfull')
Write a function called word_freq(text) which takes one string
argument. This string will not have any punctuation. Perform a count
of the number of 'n' character words in this string and return a list
of tuples of the form[(n, count), (n-1, count) ...] in descending
order of the counts. For example:
Example: word_freq('a aaa a aaaa')
Result: [(4, 1), (3, 1), (1, 2)]
Note: that this does not show anything for the 2 character words. str1
= 'a aaa a aaa' str.split(str1) str.count(str1)
def word_freq(str): Python code to find frequency of each word
I tried this
text = 'a aaa a aaaa'
def word_freq(str):
tuple = ()
count = {}
for x in str:
if x in count.keys():
count[x] += 1
else:
count[x] = 1
print(count)
def count_letters(word):
char = "a"
count = 0
for c in word:
if char == c:
count += 1
return count
word_freq(text)
The code below does what you want. Now I'll explain how it works. before anything, we will make a dictionary called "WC" which will hold the count of each n-character-word in our sentence. now we start. first of all, it receives a string from user. then it takes the string and using split(), it turns the string into a LIST of words. then for each word it checks its length, if it is 2, it ignores it. otherwise, it will add 1 to the count of that n-character word in our dictionary.
after every word is checked, we use wc.items() to turn our dictionary into a list of tuples. Each element in the list is a tuple that contains data for each word. each tuple has 2 elements. the first is number of charatcers of each word and the second element is the number of times it existed in the sentence. with that out of the way, Now all we need is to do is sort this list based on the character counts in reverse (from high char count to low char count). we do that using the sorted function. we sort based on x[0] which means the first element of each tuple which is the character count for each word. Finally, we return this list of tuples. You can print it.
if anything is unclear, let me know. also, you can put print() statements at every line so you can better understand what is happening.
here's the code, I hope it helps:
inp = input("Enter your text: ")
def word_count(inp_str):
wc = {}
for item in inp_str.strip().split():
if len(item) == 2:
continue
wc[len(item)] = wc.get(len(item), 0) + 1
return sorted(wc.items(), key=lambda x: x[0], reverse = True)
print(word_count(inp))
From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']
I solved the problem using this (horribly inefficient method):
def createList(word, wordList):
#Made a set, because for some reason permutations was returning duplicates.
#Returns all permutations if they're in the wordList
return set([''.join(item) for item in itertools.permutations(word) if ''.join(item) in wordList])
def main():
a = open('C:\\Python32\\megalist.txt', 'r+')
wordList = [line.strip() for line in a]
maximum = 0
length = 0
maxwords = ""
for words in wordList:
permList = createList(words, wordList)
length = len(permList)
if length > maximum:
maximum = length
maxwords = permList
print (maximum, maxwords)
It took something like 10 minutes to find the five-letter word that has the most dictionary-valid anagrams. I want to run this on words without a letter constraint, but it would take a ludicrous amount of time. Is there anyway to optimize this?
This following seems to work OK on a smallish dictionary. By sorting the letters in the word, it becomes easy to test if two words are an anagram. From that starting point, it's just a matter of accumulating the words in some way. It wouldn't be hard to modify this to report all matches, rather than just the first one
If you do need to add constraints on the number of letters, the use of iterators is a convenient way to filter out some words.
def wordIterator(dictionaryFilename):
with open(dictionaryFilename,'r') as f:
for line in f:
word = line.strip()
yield word
def largestAnagram(words):
import collections
d = collections.defaultdict(list)
for word in words:
sortedWord = str(sorted(word))
d[ hash(sortedWord) ].append(word)
maxKey = max( d.keys(), key = lambda k : len(d[k]) )
return d[maxKey]
iter = wordIterator( 'C:\\Python32\\megalist.txt' )
#iter = ( word for word in iter if len(word) == 5 )
print largestAnagram(iter)
Edit:
In response to the comment, the hash(sortedWord), is a space saving optimization, possibly premature in this case, to reduce sortedWord back to an integer, because we don't really care about the key, so long as we can always uniquely recover all the relevant anagrams. It would have been equally valid to just use sortedWord as the key.
The key keyword argument to max lets you find the maximum element in a collection based on a predicate. So the statement maxKey = max( d.keys(), key = lambda k : len(d[k]) ) is a succinct python expression for answering the query, given the keys in the dictionary, which key has the associated value with maximum length?. That call to max in that context could have been written (much more verbosely) as valueWithMaximumLength(d) where that function was defined as:
def valueWithMaximumLength( dictionary ):
maxKey = None
for k, v in dictionary.items():
if not maxKey or len(dictionary[maxKey]) < len(v):
maxKey = k
return maxKey
wordList should be a set.
Testing membership in a list requires you to scan through all the elements of the list checking whether they are the word you have generated. Testing membership in a set does not (on average) depend on the size of the set.
The next obvious optimisation is that once you have tested a word you can remove all its permutations from wordList, since they will give exactly the same set in createList. This is a very easy operation if everything is done with sets -- indeed, you simply use a binary minus.
There is no need to do ALL permutations, - it's a waste of memory and CPU.
So, first of all your dictionary should be kept in a binary tree, like this:
e.g. Dict = ['alex', 'noise', 'mother', 'xeal', 'apple', 'google', 'question']
Step 1: find the "middle" word for the dictionary, e.g. "mother", because "m"
is somewhere in the middle of the English alphabet
(this step is not necessary, but it helps to balance the tree a bit)
Step 2: Build the binary tree:
mother
/ \
/ \
alex noise
\ / \
\ / \
apple question xeal
\
\
google
Step 3: start looking for an anagram by permutations:
alex: 1. "a"... (going deeper into binary tree, we find a word 'apple')
1.1 # of, course we should omit apple, because len('apple') != len('alex')
# but that's enough for an example:
2. Check if we can build word "pple" from "lex": ("a" is already reserved!)
-- there is no "p" in "lex", so skipping, GOTO 1
...
1. "l"... - nothing begins with "l"...
1. "l"... - nothing begins with "e"...
1. "x" - going deeper, "xeal" found
2. Check if "eal" could be built from "ale" ("x" is already reserved)
for letter in "eal":
if not letter in "ale":
return False
return True
That's it :) This algorithm should work much faster.
EDIT:
Check bintrees package to avoid spending time on binary tree implementation.