How to pick words from a string that matches a criteria - python

I have one string containing many words, I want to store all the words that start with a in a list.

It'll be easier if first we turn the string into a list of words.
words = s.split()
We'll make a new list called a_words which contains all the words that start with 'a'
a_words = []
for word in words:
if word[0].lower() == 'a':
a_words.append(word)
And now we're done, but this can be simplified
words = s.split()
a_words = filter(lambda word: word[0].lower() == 'a', words)

yourlist = [word for word in yourstring.split() if word.startswith("a")]

Maybe a little bit more complicated than the other answers, but this works as well.
text = "Happy Ape comes around"
list = []
for i in text.split():
x = i.lower()
if x.startswith("a"):
list.append(i)
print(list)
Output: ['Ape', 'around']

You should have a look to Regular Expressions (RE) module.
The documentation is here
The method re.findall(...) will find all the expressions matching the regex and return a list of them.
import re
string = "Ah, this is an absolutely amazing example"
re.findall('(?:\s|^)((?=a|A)\w*)', string)
# -> ["Ah", "an", "absolutely", "amazing"]
Here you select any word starting with 'a' or 'A'.
Explanation :
(?:\s|^) :
(?: non selective group
\s Any space or tab char
| & ^ or start of string
((?=a|A)\w*)
(?=a|A) Non selective group for 'a' or 'A'
\w* any word character at least zero times
( & ) selective group, that will be the returned string.
--EDIT--
After a quick benchmark test, the RegEx method seems actually slower than the Filter method proposed by #Botahamec
def regex_method(s):
start = timer()
re.findall('(?:\s|^)((?=a|A)\w*)', s)
end = timer()
return( timedelta(seconds=end-start) )
def filter_method(s):
start = timer()
words = s.split()
a_words = filter(lambda word: word[0].lower() == 'a', words)
end = timer()
return( timedelta(seconds=end-start) )
len(s)
# -> 27000000
print(regex_method(s))
# -> 0:00:01.231664
print(filter_method(s))
# -> 0:00:00.759951

Related

finding the index of word and its distance

hello there ive been trying to code a script that finds the distance the word one to the other one in a text but the code somehow doesnt work well...
import re
#dummy text
words_list = ['one']
long_string = "are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"
striped_long_text = re.sub(' +', ' ',(long_string.replace('\n',' '))).strip()
length = []
index_items = {}
for item in words_list :
text_split = striped_long_text.split('{}'.format(item))[:-1]
for space in text_split:
if space:
length.append(space.count(' ')-1)
print(length)
dummy Text :
are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one
so what im trying to do is that find the exact distance the word one has to its next similar one in text as you see , this code works well with in some exceptions... if the first text has the word one after 2-3 words the index starts to the counting from the start of the text and this will fail the output but for example if add the word one at the start of the text the code will work fine...
Output of the code vs what it should be :
result = [2, 9, 5, 13]
expected result = [ 9, 5, 13]
Sorry, I can't ask questions in comments yet, but if you want exactly to use regular expressions, than this can help you:
import re
def forOneWord(word):
length = []
lastIndex = -1
for i in range(len(words)):
if words[i].lower() == word:
if lastIndex != -1:
length.append(i - lastIndex - 1)
lastIndex = i
return length
def forWordsList(words_list):
for i in range(len(words_list)): # Make all words lowercase
words_list[i] = words_list[i].lower()
length = [[] for i in range(len(words_list))]
lastIndex = [-1 for i in range(len(words_list))]
for i in range(len(words)): # For all words in string
for j in range(len(words_list)): # For all words in list
if words[i].lower() == words_list[j]:
if lastIndex[j] != -1: # This means that this is not the first match
length[j].append(i - lastIndex[j] - 1)
lastIndex[j] = i
return length
words_list = ['one' , 'the', 'a']
long_string = "are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"
words = re.findall(r"(\w[\w!-]*)+", long_string)
print(forOneWord('one'))
print(forWordsList(words_list))
There are two functions: one just for single-word search and another to search by list. Also if it is not necessary to use RegEx, I can improve this solution in terms of performance if you want so.
Another solution, using itertools.groupby:
from itertools import groupby
words_list = ["one"]
long_string = "are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"
for w in words_list:
out = [
(v, sum(1 for _ in g))
for v, g in groupby(long_string.split(), lambda k: k == w)
]
# check first and last occurence:
if out and out[0][0] is False:
out = out[1:]
if out and out[-1][0] is False:
out = out[:-1]
out = [length for is_word, length in out if not is_word]
print(out)
Prints:
[9, 5, 13]

How to count exact words in Python [duplicate]

This question already has answers here:
Finding occurrences of a word in a string in python 3
(14 answers)
Closed 2 years ago.
I want to search a text and count how many times selected words occur. For simplicity, I'll say the text is "Does it fit?" and the words I want to count are "it" and "fit".
I've written the following code:
mystring = 'Does it fit?'
search_words = 'it', 'fit'
for sw in search_words:
frequency = {}
count = mystring.count(sw.strip())
output = (sw + ',{}'.format(count))
print(output)
The output is
it,2
fit,1
because the code counts the 'it' in 'fit' towards the total for 'it'.
The output I want is
it,1
fit,1
I've tried changing line 5 to count = mystring.count('\\b'+sw+'\\b'.strip()) but the count is then zero for each word. How can I get this to work?
that list syntax is off, heres a way to do it though
bad_chars = [';', ':', '!', "*","?","."]
res = {}
for word in ["it","fit"]:
res[word] = 0
string = ''.join((filter(lambda i: i not in bad_chars, "does it fit?")))
for i in string.split(" "):
if word == i: res[word] += 1
print(res)
by using the in keyword you were checking if that string was in another string, in this case it was inside fit, so you were getting 2 occurrences of it
here it directly compares the words after removing punctuation/special characters!
output:
{'it': 1, 'fit': 1}
The issue with the regex pattern that you have tried implementing in your original post is with str.count() rather than the pattern itself.
str.count() (docs) returns the count of non-overlapping occurrences of the str passed as a parameter within the str that the method is applied to - so 'lots of love'.('lo') will return 2 - however, str.count() is for substring identification using string literals only and will not work with regular expression patterns.
The below solution using your original pattern and the built in re module should work nicely for you.
import re
mystring = 'Does it fit?'
search_words = 'it', 'fit'
results = dict()
for sw in search_words:
count = re.findall(rf'\b{sw}\b', mystring)
results[sw] = 0 if not count else len(count)
for k, v in results.items():
print(f'{k}, {v}')
If you want to get matches from search_words regardless of their case - e.g for each occurrence of the substrings 'Fit', 'FIT', 'fIt' etc. present in mystring to be included in the count stored in results['fit'] - you can achieve this by changing the line:
count = re.findall(rf'\b{sw}\b', mystring)
to
count = re.findall(rf'\b{sw}\b', mystring, re.IGNORECASE)
Try this:
def count_words(string, *args):
words = string.split()
search_words = args
frequency_dict = {}
for i in range(len(words)):
if words[i][-1] == '?':
words[i] = words[i][:-1]
for word in search_words:
frequency_dict[word] = words.count(word)
for word, count in frequency_dict.items():
print(f'{word}, {count}')
You can do,
count_words('Does it it it fit fit it?', 'it', 'fit')
And the output is,
it, 4
fit, 2

Remove punctuation items from end of string

I have a seemingly simple problem, which I cannot seem to solve. Given a string containing a DOI, I need to remove the last character if it is a punctuation mark until the last character is letter or number.
For example, if the string was:
sampleDoi = "10.1097/JHM-D-18-00044.',"
I want the following output:
"10.1097/JHM-D-18-00044"
ie. remove .',
I wrote the following script to do this:
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i - 1
else:
print (a)
break
However, this produces 10.1097/JHM-D-18-00 but I would like it to produce 10.1097/JHM-D-18-00044. Why is the 44 removed from the end?
The string function rstrip() is designed to do exactly this:
>>> sampleDoi = "10.1097/JHM-D-18-00044.',"
>>> sampleDoi.rstrip(",.'")
'10.1097/JHM-D-18-00044'
Corrected code:
import string
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i # Well Really this line can just be removed all together.
else:
print (a)
break
This gives the output you want, while keeping the original code mostly the same.
This is one way using next and str.isalnum with a generator expression utilizing enumerate / reversed.
sampleDoi = "10.1097/JHM-D-18-00044.',"
idx = next((i for i, j in enumerate(reversed(sampleDoi)) if j.isalnum()), 0)
res = sampleDoi[:-idx]
print(res)
'10.1097/JHM-D-18-00044'
The default parameter 0is used so that, if no alphanumeric character is found, an empty string is returned.
If you dont wanna use regex:
the_str = "10.1097/JHM-D-18-00044.',"
while the_str[-1] in string.punctuation:
the_str = the_str[:-1]
Removes the last character until it's no longer a punctuation character.

How to split a word in list of two characters

I have a Word: HAPPY
I want to split the word HAPPY like this {"HA", "AP", "PP", "PY"} using python.
I tried the function:
itertools.combinations("HAPPY", 2)
This finds me all the possible combinations from the word HAPPY, which I don't want. All I want is to find all the transitions between the characters.
I would appraciate any suggestions. Thank you in Advance!
You may use a regex:
import re
s = 'HAPPY'
print(re.findall(r'(?=(..))', s))
// => ['HA', 'AP', 'PP', 'PY']
See the Python demo
The (?=(..)) pattern finds a location followed with any 2 chars other than line break chars and captures these 2 chars. Then, the regex engine steps forward to the next location and grabs two more chars, and so on.
As for performance, if you compile the regex the performance difference is not that big, but comprehension should be a bit faster:
import re
import time
s = 'HAPPY'
rx = re.compile(r'(?=(..))', re.DOTALL)
def test_regex():
return rx.findall(s)
def test_comprehension():
return [(s)[i:i+2] for i in range(0,len(s)-1)]
n = 10000
t0 = time.time()
for i in range(n): test_regex()
t1 = time.time()
print('regex: {}'.format(t1-t0))
t0 = time.time()
for i in range(n): test_comprehension()
t1 = time.time()
print('comprehension: {}'.format(t1-t0))
# => regex: 0.00773191452026
# => comprehension: 0.00626182556152
See the online test
Quick and dirty list comprehension
[("HAPPY")[i:i+2] for i in range(0,len("HAPPY")-1)]
You could do something like this:
word = 'HAPPY'
combos = [word[i:i+2] for i in range(len(word) - 1)]
Use a list comprehension to take all the two character slices in the string.
string = "HAPPY"
[string[idx:idx+2] for idx in range(len(string))]

How many common English words of 4 letters or more can you make from the letters of a given word (each letter can only be used once)

On the back of a block calendar I found the following riddle:
How many common English words of 4 letters or more can you make from the letters
of the word 'textbook' (each letter can only be used once).
My first solution that I came up with was:
from itertools import permutations
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
found_words = []
ps = (permutations(given_word, i) for i in range(4, len(given_word)+1))
for p in ps:
for word in map(''.join, p):
if word in words and word != given_word:
found_words.append(word)
print set(found_words)
This gives the result set(['tote', 'oboe', 'text', 'boot', 'took', 'toot', 'book', 'toke', 'betook']) but took more than 7 minutes on my machine.
My next iteration was:
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
print [word for word in words if len(word) >= 4 and sorted(filter(lambda letter: letter in word, given_word)) == sorted(word) and word != given_word]
Which return an answer almost immediately but gave as answer: ['book', 'oboe', 'text', 'toot']
What is the fastest, correct and most pythonic solution to this problem?
(edit: added my earlier permutations solution and its different output).
I thought I'd share this slightly interesting trick although it takes a good bit more code than the rest and isn't really "pythonic". This will take a good bit more code than the other solutions but should be rather quick if I look at the timing the others need.
We're doing a bit preprocessing to speed up the computations. The basic approach is the following: We assign every letter in the alphabet a prime number. E.g. A = 2, B = 3, and so on. We then compute a hash for every word in the alphabet which is simply the product of the prime representations of every character in the word. We then store every word in a dictionary indexed by the hash.
Now if we want to find out which words are equivalent to say textbook we only have to compute the same hash for the word and look it up in our dictionary. Usually (say in C++) we'd have to worry about overflows, but in python it's even simpler than that: Every word in the list with the same index will contain exactly the same characters.
Here's the code with the slightly optimization that in our case we only have to worry about characters also appearing in the given word, which means we can get by with a much smaller prime table than otherwise (the obvious optimization would be to only assign characters that appear in the word a value at all - it was fast enough anyhow so I didn't bother and this way we could pre process only once and do it for several words). The prime algorithm is useful often enough so you should have one yourself anyhow ;)
from collections import defaultdict
from itertools import permutations
PRIMES = list(gen_primes(256)) # some arbitrary prime generator
def get_dict(path):
res = defaultdict(list)
with open(path, "r") as file:
for line in file.readlines():
word = line.strip().upper()
hash = compute_hash(word)
res[hash].append(word)
return res
def compute_hash(word):
hash = 1
for char in word:
try:
hash *= PRIMES[ord(char) - ord(' ')]
except IndexError:
# contains some character out of range - always 0 for our purposes
return 0
return hash
def get_result(path, given_word):
words = get_dict(path)
given_word = given_word.upper()
result = set()
powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x]
for word in (word for word in powerset(given_word) if len(word) >= 4):
hash = compute_hash(word)
for equiv in words[hash]:
result.add(equiv)
return result
if __name__ == '__main__':
path = "dict.txt"
given_word = "textbook"
result = get_result(path, given_word)
print(result)
Runs on my ubuntu word list (98k words) rather quickly, but not what I'd call pythonic since it's basically a port of a c++ algorithm. Useful if you want to compare more than one word that way..
How about this?
from itertools import permutations, chain
with open('/usr/share/dict/words') as fp:
words = set(fp.read().split())
given_word = 'textbook'
perms = (permutations(given_word, i) for i in range(4, len(given_word)+1))
pwords = (''.join(p) for p in chain(*perms))
matches = words.intersection(pwords)
print matches
which gives
>>> print matches
set(['textbook', 'keto', 'obex', 'tote', 'oboe', 'text', 'boot', 'toto', 'took', 'koto', 'bott', 'tobe', 'boke', 'toot', 'book', 'bote', 'otto', 'toke', 'toko', 'oket'])
There is a generator itertools.permutations with which you can gather all permutations of a sequence with a specified length. That makes it easier:
from itertools import permutations
GIVEN_WORD = 'textbook'
with open('/usr/share/dict/words', 'r') as f:
words = [s.strip() for s in f.readlines()]
print len(filter(lambda x: ''.join(x) in words, permutations(GIVEN_WORD, 4)))
Edit #1: Oh! It says "4 or more" ;) Forget what I said!
Edit #2: This is the second version I came up with:
LETTERS = set('textbook')
with open('/usr/share/dict/words') as f:
WORDS = filter(lambda x: len(x) >= 4, [l.strip() for l in f])
matching = filter(lambda x: set(x).issubset(LETTERS) and all([x.count(c) == 1 for c in x]), WORDS)
print len(matching)
Create the whole power set, then check whether the dictionary word is in the set (order of the letters doesn't matter):
powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x]
pw = map(lambda x: sorted(x), powerset(given_word))
filter(lambda x: sorted(x) in pw, words)
The following just checks each word in the dictionary to see if it is of the appropriate length, and then if it is a permutation of 'textbook'. I borrowed the permutation check from
Checking if two strings are permutations of each other in Python
but changed it slightly.
given_word = 'textbook'
with open('/usr/share/dict/words', 'r') as f:
words = [s.strip() for s in f.readlines()]
matches = []
for word in words:
if word != given_word and 4 <= len(word) <= len(given_word):
if all(word.count(char) <= given_word.count(char) for char in word):
matches.append(word)
print sorted(matches)
This finishes almost immediately and gives the correct result.
Permutations get very big for longer words. Try counterrevolutionary for example.
I would filter the dict for words from 4 to len(word) (8 for textbook).
Then I would filter with regular expression "oboe".matches ("[textbook]+").
The remaining words, I would sort, and compare them with a sorted version of your word, ("beoo", "bekoottx") with jumping to the next index of a matching character, to find mismatching numbers of characters:
("beoo", "bekoottx")
("eoo", "ekoottx")
("oo", "koottx")
("oo", "oottx")
("o", "ottx")
("", "ttx") => matched
("bbo", "bekoottx")
("bo", "ekoottx") => mismatch
Since I don't talk python, I leave the implementation as an exercise to the audience.

Categories