finding the index of word and its distance - python

hello there ive been trying to code a script that finds the distance the word one to the other one in a text but the code somehow doesnt work well...
import re
#dummy text
words_list = ['one']
long_string = "are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"
striped_long_text = re.sub(' +', ' ',(long_string.replace('\n',' '))).strip()
length = []
index_items = {}
for item in words_list :
text_split = striped_long_text.split('{}'.format(item))[:-1]
for space in text_split:
if space:
length.append(space.count(' ')-1)
print(length)
dummy Text :
are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one
so what im trying to do is that find the exact distance the word one has to its next similar one in text as you see , this code works well with in some exceptions... if the first text has the word one after 2-3 words the index starts to the counting from the start of the text and this will fail the output but for example if add the word one at the start of the text the code will work fine...
Output of the code vs what it should be :
result = [2, 9, 5, 13]
expected result = [ 9, 5, 13]

Sorry, I can't ask questions in comments yet, but if you want exactly to use regular expressions, than this can help you:
import re
def forOneWord(word):
length = []
lastIndex = -1
for i in range(len(words)):
if words[i].lower() == word:
if lastIndex != -1:
length.append(i - lastIndex - 1)
lastIndex = i
return length
def forWordsList(words_list):
for i in range(len(words_list)): # Make all words lowercase
words_list[i] = words_list[i].lower()
length = [[] for i in range(len(words_list))]
lastIndex = [-1 for i in range(len(words_list))]
for i in range(len(words)): # For all words in string
for j in range(len(words_list)): # For all words in list
if words[i].lower() == words_list[j]:
if lastIndex[j] != -1: # This means that this is not the first match
length[j].append(i - lastIndex[j] - 1)
lastIndex[j] = i
return length
words_list = ['one' , 'the', 'a']
long_string = "are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"
words = re.findall(r"(\w[\w!-]*)+", long_string)
print(forOneWord('one'))
print(forWordsList(words_list))
There are two functions: one just for single-word search and another to search by list. Also if it is not necessary to use RegEx, I can improve this solution in terms of performance if you want so.

Another solution, using itertools.groupby:
from itertools import groupby
words_list = ["one"]
long_string = "are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"
for w in words_list:
out = [
(v, sum(1 for _ in g))
for v, g in groupby(long_string.split(), lambda k: k == w)
]
# check first and last occurence:
if out and out[0][0] is False:
out = out[1:]
if out and out[-1][0] is False:
out = out[:-1]
out = [length for is_word, length in out if not is_word]
print(out)
Prints:
[9, 5, 13]

Related

How to stop over counting of duplicate letters in a list of strings

I'm trying to count the number of times a duplicate letter shows up in the list element.
For example, given
arr = ['capps','hat','haaah']
I out put a list and I get ['1','0','1']
def myfunc(words):
counter = 0 #counters dup letters in words
len_ = len(words)-1
for i in range(len_):
if words[i] == words[i+1]: #if the letter ahead is the same add one
counter+=1
return counter
def minimalOperations(arr):
return [*map(myfunc,arr)] #map fuc applies myfunc to element in words.
But my code would output [1,0,2]
I'm not sure why I am over counting.
Can anyone help me resolve this, thank you in advance.
A more efficient solution using a regular expression:
import re
def myfunc(words):
reg_str = r"(\w)\1{1,}"
return len(re.findall(reg_str, words))
This function will find the number of substrings of length 2 or more containing the same letter. Thus 'aaa' in your example will only be counted once.
For a string like
'hhhhfafaahggaa'
the output will be 4 , since there are 4 maximal substrings of the same letter occuring at least twice : 'hhh' , 'ss', 'gg', 'aa'
You aren't accounting for situations where you have greater than 2 identical characters in succession. To do this, you can look back as well as forward:
if (words[i] == words[i+1]) and (words[i] != words[i-1] if i != 0 else True)
# as before
The ternary statement helps for the first iteration of the loop, to avoid comparing the last letter of a string with the first.
Another solution is to use itertools.groupby and count the number of instances where a group has a length greater than 1:
arr = ['capps','hat','haaah']
from itertools import groupby
res = [sum(1 for _, j in groupby(el) if sum(1 for _ in j) > 1) for el in arr]
print(res)
[1, 0, 1]
The sum(1 for _ in j) part is used to count the number items in a generator. It's also possible to use len(list(j)), though this requires list construction.
Well, your code counts the number of duplications, so what you observe is quite logical:
your input is arr = ['capps','hat','haaah']
in 'capps', the letter p is duplicated 1 time => myfunc() returns 1
in 'hat', there is no duplicated letter => myfunc() returns 0
in 'haaah', the letter a is duplicated 2 times => myfunc() returns 2
So finally you get [1,0,2].
For your purpose, I suggest you to use a regex to match and count the number of groups of duplicated letters in each word. I also replaced the usage of map() with a list comprehension that I find more readable:
import re
def myfunc(words):
return len(re.findall(r'(\w)\1+', words))
def minimalOperations(arr):
return [myfunc(a) for a in arr]
arr = ['capps','hat','haaah']
print(minimalOperations(arr)) # [1,0,1]
arr = ['cappsuul','hatppprrrrtyyy','haaah']
print(minimalOperations(arr)) # [2,3,1]
You need to keep track of a little more state, specifically if you're looking at duplicates now.
def myfunc(words):
counter = 0 #counters dup letters in words
seen = None
len_ = len(words)-1
for i in range(len_):
if words[i] == words[i+1] and words[i+1] != seen: #if the letter ahead is the same add one and wasn't the first
counter+=1
seen = words[i]
return counter
This gives you the following output
>>> arr = ['capps','hat','haaah']
>>> map(myfunc, arr)
[1, 0, 1]
As others have pointed out, you could use a regular expression and trade clarity for performance. They key is to find a regular expression that means "two or more repeated characters" and may depend on what you consider to be characters (e.g. how do you treat duplicate punctuation?)
Note: the "regex" used for this is technically an extension on regular expressions because it requires memory.
The form will be len(re.findall(regex, words))
I would break this kind of problem into smaller chunks. Starting by grouping duplicates.
The documentation for itertools has groupby and recipes for this kind of things.
A slightly edited version of unique_justseen would look like this:
duplicates = (len(sum(1 for _ in group) for _key, group in itertools.groupby("haaah")))
and yields values: 1, 3, 1. As soon as any of these values are greater than 1 you have a duplicate. So just count them:
sum(n > 1 for n in duplicates)
Use re.findall for matches of 2 or more letters
>>> arr = ['capps','hat','haaah']
>>> [len(re.findall(r'(.)\1+', w)) for w in arr]
[1, 0, 1]

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Python How to get the each letters put into a word?

When the name is given, for example Aberdeen Scotland.
I need to get the result of Adbnearldteoecns.
Leaving the first word plain, but reverse the last word and put in between the first word.
I have done so far:
coordinatesf = "Aberdeen Scotland"
for line in coordinatesf:
separate = line.split()
for i in separate [0:-1]:
lastw = separate[1][::-1]
print(i)
A bit dirty but it works:
coordinatesf = "Aberdeen Scotland"
new_word=[]
#split the two words
words = coordinatesf.split(" ")
#reverse the second and put to lowercase
words[1]=words[1][::-1].lower()
#populate the new string
for index in range(0,len(words[0])):
new_word.insert(2*index,words[0][index])
for index in range(0,len(words[1])):
new_word.insert(2*index+1,words[1][index])
outstring = ''.join(new_word)
print outstring
Note that what you want to do is only well-defined if the the input string is composed of two words with the same lengths.
I use assertions to make sure that is true but you can leave them out.
def scramble(s):
words = s.split(" ")
assert len(words) == 2
assert len(words[0]) == len(words[1])
scrambledLetters = zip(words[0], reversed(words[1]))
return "".join(x[0] + x[1] for x in scrambledLetters)
>>> print(scramble("Aberdeen Scotland"))
>>> AdbnearldteoecnS
You could replace the x[0] + x[1] part with sum() but I think that makes it less readable.
This splits the input, zips the first word with the reversed second word, joins the pairs, then joins the list of pairs.
coordinatesf = "Aberdeen Scotland"
a,b = coordinatesf.split()
print(''.join(map(''.join, zip(a,b[::-1]))))

For each word in the text file, extract surrounding 5 words

For each occurrence of a certain word, I need to display the context by showing about 5 words preceding and following the occurrence of the word.
Example output for the word 'stranger' in a text file of content when you enter occurs('stranger', 'movie.txt'):
My code so far:
def occurs(word, filename):
infile = open(filename,'r')
lines = infile.read().splitlines()
infile.close()
wordsString = ''.join(lines)
words = wordsString.split()
print(words)
for i in range(len(words)):
if words[i].find(word):
#stuck here
I'd suggest slicing words depending on i:
print(words[i-5:i+6])
(This would go where your comment is)
Alternatively, to print as shown in your example:
print("...", " ".join(words[i-5:i+6]), "...")
To account for the word being in the first 5:
if i > 5:
print("...", " ".join(words[i-5:i+6]), "...")
else:
print("...", " ".join(words[0:i+6]), "...")
Additionally, find is not doing what you think it is. If find() doesn't find the string, it returns -1 which evaluates to True when used in a if statement. Try:
if word in words[i].lower():
This retrieves the index of every occurrence of the word in words, which is a list of all words in the file. Then slicing is used to get a list of the matched word and the 5 words before and after.
def occurs(word, filename):
infile = open(filename,'r')
lines = infile.read().splitlines()
infile.close()
wordsString = ''.join(lines)
words = wordsString.split()
matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]
for m in matches:
l = " ".join(words[m-5:m+6])
print(f"... {l} ...")
Consider the more_itertools.adajacent tool.
Given
import more_itertools as mit
s = """\
But we did not answer him, for he was a stranger and we were not used to, strangers and were shy of them.
We were simple folk, in our village, and when a stranger was a pleasant person we were soon friends.
"""
word, distance = "stranger", 5
words = s.splitlines()[0].split()
Demo
neighbors = list(mit.adjacent(lambda x: x == word, words, distance))
" ".join(word for bool_, word in neighbors if bool_)
# 'him, for he was a stranger and we were not used'
Details
more_itertools.adjacent returns an iterable of tuples, e.g. (bool, item) pairs. A True boolean is returned for words in the string that satisfy the predicate. Example:
>>> neighbors
[(False, 'But'),
...
(True, 'a'),
(True, 'stranger'),
(True, 'and'),
...
(False, 'to,')]
Neighboring words are filtered from the results given a distance from the target word.
Note: more_itertools is a third-party library. Install by pip install more_itertools.
Whenever I see rolling views of files, I think collections.deque
import collections
def occurs(needle, fname):
with open(fname) as f:
lines = f.readlines()
words = iter(''.join(lines).split())
view = collections.deque(maxlen=11)
# prime the deque
for _ in range(10): # leaves an 11-length deque with 10 elements
view.append(next(words, ""))
for w in words:
view.append(w)
if view[5] == needle:
yield list(view.copy())
Note that this approach intentionally does not handle any edge cases for needle names in the first 5 words or the last 5 words of the file. The question is ambiguous as to whether matching the third word should give the first through ninth words, or something different.

How many common English words of 4 letters or more can you make from the letters of a given word (each letter can only be used once)

On the back of a block calendar I found the following riddle:
How many common English words of 4 letters or more can you make from the letters
of the word 'textbook' (each letter can only be used once).
My first solution that I came up with was:
from itertools import permutations
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
found_words = []
ps = (permutations(given_word, i) for i in range(4, len(given_word)+1))
for p in ps:
for word in map(''.join, p):
if word in words and word != given_word:
found_words.append(word)
print set(found_words)
This gives the result set(['tote', 'oboe', 'text', 'boot', 'took', 'toot', 'book', 'toke', 'betook']) but took more than 7 minutes on my machine.
My next iteration was:
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
print [word for word in words if len(word) >= 4 and sorted(filter(lambda letter: letter in word, given_word)) == sorted(word) and word != given_word]
Which return an answer almost immediately but gave as answer: ['book', 'oboe', 'text', 'toot']
What is the fastest, correct and most pythonic solution to this problem?
(edit: added my earlier permutations solution and its different output).
I thought I'd share this slightly interesting trick although it takes a good bit more code than the rest and isn't really "pythonic". This will take a good bit more code than the other solutions but should be rather quick if I look at the timing the others need.
We're doing a bit preprocessing to speed up the computations. The basic approach is the following: We assign every letter in the alphabet a prime number. E.g. A = 2, B = 3, and so on. We then compute a hash for every word in the alphabet which is simply the product of the prime representations of every character in the word. We then store every word in a dictionary indexed by the hash.
Now if we want to find out which words are equivalent to say textbook we only have to compute the same hash for the word and look it up in our dictionary. Usually (say in C++) we'd have to worry about overflows, but in python it's even simpler than that: Every word in the list with the same index will contain exactly the same characters.
Here's the code with the slightly optimization that in our case we only have to worry about characters also appearing in the given word, which means we can get by with a much smaller prime table than otherwise (the obvious optimization would be to only assign characters that appear in the word a value at all - it was fast enough anyhow so I didn't bother and this way we could pre process only once and do it for several words). The prime algorithm is useful often enough so you should have one yourself anyhow ;)
from collections import defaultdict
from itertools import permutations
PRIMES = list(gen_primes(256)) # some arbitrary prime generator
def get_dict(path):
res = defaultdict(list)
with open(path, "r") as file:
for line in file.readlines():
word = line.strip().upper()
hash = compute_hash(word)
res[hash].append(word)
return res
def compute_hash(word):
hash = 1
for char in word:
try:
hash *= PRIMES[ord(char) - ord(' ')]
except IndexError:
# contains some character out of range - always 0 for our purposes
return 0
return hash
def get_result(path, given_word):
words = get_dict(path)
given_word = given_word.upper()
result = set()
powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x]
for word in (word for word in powerset(given_word) if len(word) >= 4):
hash = compute_hash(word)
for equiv in words[hash]:
result.add(equiv)
return result
if __name__ == '__main__':
path = "dict.txt"
given_word = "textbook"
result = get_result(path, given_word)
print(result)
Runs on my ubuntu word list (98k words) rather quickly, but not what I'd call pythonic since it's basically a port of a c++ algorithm. Useful if you want to compare more than one word that way..
How about this?
from itertools import permutations, chain
with open('/usr/share/dict/words') as fp:
words = set(fp.read().split())
given_word = 'textbook'
perms = (permutations(given_word, i) for i in range(4, len(given_word)+1))
pwords = (''.join(p) for p in chain(*perms))
matches = words.intersection(pwords)
print matches
which gives
>>> print matches
set(['textbook', 'keto', 'obex', 'tote', 'oboe', 'text', 'boot', 'toto', 'took', 'koto', 'bott', 'tobe', 'boke', 'toot', 'book', 'bote', 'otto', 'toke', 'toko', 'oket'])
There is a generator itertools.permutations with which you can gather all permutations of a sequence with a specified length. That makes it easier:
from itertools import permutations
GIVEN_WORD = 'textbook'
with open('/usr/share/dict/words', 'r') as f:
words = [s.strip() for s in f.readlines()]
print len(filter(lambda x: ''.join(x) in words, permutations(GIVEN_WORD, 4)))
Edit #1: Oh! It says "4 or more" ;) Forget what I said!
Edit #2: This is the second version I came up with:
LETTERS = set('textbook')
with open('/usr/share/dict/words') as f:
WORDS = filter(lambda x: len(x) >= 4, [l.strip() for l in f])
matching = filter(lambda x: set(x).issubset(LETTERS) and all([x.count(c) == 1 for c in x]), WORDS)
print len(matching)
Create the whole power set, then check whether the dictionary word is in the set (order of the letters doesn't matter):
powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x]
pw = map(lambda x: sorted(x), powerset(given_word))
filter(lambda x: sorted(x) in pw, words)
The following just checks each word in the dictionary to see if it is of the appropriate length, and then if it is a permutation of 'textbook'. I borrowed the permutation check from
Checking if two strings are permutations of each other in Python
but changed it slightly.
given_word = 'textbook'
with open('/usr/share/dict/words', 'r') as f:
words = [s.strip() for s in f.readlines()]
matches = []
for word in words:
if word != given_word and 4 <= len(word) <= len(given_word):
if all(word.count(char) <= given_word.count(char) for char in word):
matches.append(word)
print sorted(matches)
This finishes almost immediately and gives the correct result.
Permutations get very big for longer words. Try counterrevolutionary for example.
I would filter the dict for words from 4 to len(word) (8 for textbook).
Then I would filter with regular expression "oboe".matches ("[textbook]+").
The remaining words, I would sort, and compare them with a sorted version of your word, ("beoo", "bekoottx") with jumping to the next index of a matching character, to find mismatching numbers of characters:
("beoo", "bekoottx")
("eoo", "ekoottx")
("oo", "koottx")
("oo", "oottx")
("o", "ottx")
("", "ttx") => matched
("bbo", "bekoottx")
("bo", "ekoottx") => mismatch
Since I don't talk python, I leave the implementation as an exercise to the audience.

Categories