Count occurrences of list of strings in text

Count occurrences of list of strings in text - python

I want to count occurrences of list elements in text with Python. I know that I can use .count() but I have read that this can effect on performance. Also, element in list can have more than 1 word.
my_list = ["largest", "biggest", "greatest", "the best"]
my_text = "i have the biggest house and the biggest car. My friend is the best. Best way win this is to make the largest house and largest treehouse and then you will be the greatest"
I can do this:
num = 0
for i in my_list:
num += my_text.lower().count(i.lower())
print(num)
This way works, but what If my list has 500 elements and my string is 3000 words, so in that case, I have very low performance.
Is there a way to do this but with good / fast performance?

Since my_list contains strings with more than one word, you'll have to find the n-grams of my_text to find matches, since splitting on spaces won't do. Also note that your approach is not advisable, as for every single string in my_list, you'll be traversing the whole string my_text by using count. A better way would be to predefine the n-grams that you'll be looking for beforehand.
Here's one approach using nltk's ngram.
I've added another string in my_list to better illustrate the process:
from nltk import ngrams
from collections import Counter, defaultdict
my_list = ["largest", "biggest", "greatest", "the best", 'My friend is the best']
my_text = "i have the biggest house and the biggest car. My friend is the best. Best way win this is to make the largest house and largest treehouse and then you will be the greatest"
The first step is to define a dictionary containing the different lengths of the n-grams that we'll be looking up:
d = defaultdict(list)
for i in my_list:
k = i.split()
d[len(k)].append(tuple(k))
print(d)
defaultdict(list,
{1: [('largest',), ('biggest',), ('greatest',)],
2: [('the', 'best')],
5: [('My', 'friend', 'is', 'the', 'best')]})
Then split my_text into a list, and for each key in d find the corresponding n-grams and build a Counter from the result. Then for each value in that specific key in d, update with the counts from the Counter:
my_text_split = my_text.replace('.', '').split()
match_counts = dict()
for n,v in d.items():
c = Counter(ngrams(my_text_split, n))
for k in v:
if k in c:
match_counts[k] = c[k]
Which will give:
print(match_counts)
{('largest',): 2,
('biggest',): 2,
('greatest',): 1,
('the', 'best'): 1,
('My', 'friend', 'is', 'the', 'best'): 1}

Related

Check if any strings in a list that are next to each other have the same first letter

could you please help me check if any strings in a list that are next to each other have the same first letter. Im pretty new to python and my approach was to first tokenise and make the list lowercased. Then I create a nested list:
import nltk
myStrings = "Bob build a house"
myStrings_words = nltk.word_tokenize(myStrings)
myStings_words_lower = [word.lower() for word in myStrings_words]
nested_list = [list(x) for x in myStings_words_lower]
Now I'm not sure though how to compare each words 1 letter with each other and make sure that they are next to each other in the list. Maybe a for loop and accessing the 1 letters by myString_words_lower[x][1] ?
The output should be the words that have the same letter in the beginning, so in this case bob and build.
Thank you in advance,
Paul

You can use itertools.groupby to help you with this. Let's assume you have your list of lowercase words:
import nltk
myStrings = "Bob build a house"
myStrings_words = nltk.word_tokenize(myStrings)
myStings_words_lower = [word.lower() for word in myStrings_words]
To group them into any neighbours that share a first letter, you can do:
import itertools
# define a grouping helper
first_letter = lambda x: x[0]
# get the groups
grouped_words = itertools.groupby(myStings_words_lower, key=first_letter)
print(f"The number of words is {len(myStings_words_lower)} and the number of groups is {len(list(grouped_words))}")
If the number of groups is equal to the number of words, then no consecutive words share a starting letter. If the number is not equal, then you know there are neighbouring entries that share a starting letter.

Another approach:
In [6]: myString = "Bob build an aeroplane, boat and a haunted house"
In [7]: my_words = [word.lower() for word in myString.split()]
In [8]: my_words
Out[8]: ['bob', 'build', 'an', 'aeroplane,', 'boat', 'and', 'a', 'haunted', 'house']
# Iterate over the words and while iterating, check if present word and
# the next word has the same first letter. (We use len(my_words) - 1 as
# we are using i+1 in the loop and so should stop at the penultimate word)
In [9]: for i in range(len(my_words) - 1):
...: if my_words[i][0] == my_words[i+1][0]:
...: print(my_words[i], my_words[i+1])
...:
bob build
an aeroplane,
and a
haunted house
Cheers!

I'm not able to allocate the sums of integers in a list using a for loop

I'm trying to give a general score to different sentences, some words have different values and I need a general score for each of the sentences:
sentences = ['This is so awesome , fr!', 'Ngl Im bad', 'awesome ! still bad tho' ]
values = {'awesome' : 4, 'bad' : -1 }
This is the for loop I used to get the sentences split:
split = []
for i in range(len(sentences)):
split.append(sentences[i].lower().split(' '))
so far so good, but now I cannot get the general score of each of the sentences in a list, I've tried using for loops but never got the desired output, this was my first attempt:
scores = [0,0,0]
for list_ in split:
for word in list_:
if word in values:
for i in range(len(punctuations)):
scores[i] += values[word]
else:
continue
And this was my second attempt:
count = []
scores = []
for list_ in split:
for word in list_:
if word in values:
count.append(values[word])
else:
continue
scores.append(sum(count))
I've tried some other ways but I cannot find a solution, could you point me in the right direction?

Python is actually a really clean language for writing solutions to these types of problems:
sentences = ['This is so awesome , fr!', 'Ngl Im bad', 'awesome ! still bad tho' ]
values = {'awesome': 4, 'bad': -1}
ratings = [sum(values[word] if word in values else 0 for word in sentence.split()) for sentence in sentences]
print(ratings)
Result:
[4, -1, 3]
A bit of explanation:
the whole expression assigned to ratings is a list comprehension, which constructs a list from an internal generator; a simple example is x = [range(3)]
the generator loops over ever sentence in sentences and processes it
the processing consists of computing the sum of the values from a second generator
the second generator loops over the words in the sentence, by splitting it over spaces using .split() and looks up the value in values or assigns it value 0 if it's not in values
Perhaps a more readable version of the same code:
ratings = [
sum(
values[word] if word in values else 0
for word in sentence.split()
)
for sentence in sentences
]

how to find the longest N words from a list, using python?

I am now studying Python, and I am trying to solve the following exercise:
Assuming there is a list of words in a text file,
My goal is to print the longest N words in this list.
Where there are several important points:
The print order does not matter
Words that appear later in the file are given priority to be selected (when there are several words with the same length, i added an example for it)
assume that each row in the file contains only one single word
Is there a simple and easy solution for a short list of words, as opposed to a more complex solution for a situation where the list contains several thousand words?
I have attached an example of the starting code to a single word with a maximum length,
And an example of output for N = 4, for an explanation of my question.
Thanks for your advice,
word_list1 = open('WORDS.txt', 'r')
def find_longest_word(word_list):
longest_word = ''
for word in word_list:
if len(word) > len(longest_word):
longest_word = word
print(longest_word)
find_longest_word(word_list1)
example(N=4):
WORDS.TXT
---------
Mother
Dad
Cat
Bicycle
House
Hat
The result will be (as i said before, print order dosen't matter):
Hat
House
Bicycle
Mother
thanks in advance!

One alternative is to use a heap to maintain the top-n elements:
import heapq
from operator import itemgetter
def top(lst, n=4):
heap = [(0, i, '') for i in range(n)]
heapq.heapify(heap)
for i, word in enumerate(lst):
item = (len(word), i, word)
if item > heap[0]:
heapq.heapreplace(heap, item)
return list(map(itemgetter(2), heap))
words = ['Mother', 'Dad', 'Cat', 'Bicycle', 'House', 'Hat']
print(top(words))
Output
['Hat', 'House', 'Bicycle', 'Mother']
In the heap we keep items that correspond to length and position, so in case of ties the last one to appear gets selected.

sort the word_list based on length of the words and then based on a counter variable, so that words occurring later gets higher priority
>>> from itertools import count
>>> cnt = count()
>>> n = 4
>>> sorted(word_list, key=lambda word:(len(word), next(cnt)), reverse=True)[:n]
['Bicycle', 'Mother', 'House', 'Hat']

You can use sorted with a custom tuple key and then list slicing.
from io import StringIO
x = StringIO("""Mother
Dad
Cat
Bicycle
House
Hat
Brother""")
def find_longest_word(word_list, n):
idx, words = zip(*sorted(enumerate(word_list), key=lambda x: (-len(x[1]), -x[0]))[:n])
return words
res = find_longest_word(map(str.strip, x.readlines()), 4)
print(*res, sep='\n')
# Brother
# Bicycle
# Mother
# House

Python possible list comprehension

I have a text file and two lists of strings.
The first list is the keyword list
k = [hi, bob]
The second list is the words I want to replace the keywords with
r = [ok, bye]
I want to take the text file as input, where when k appears, it's replaced with r, thus, "hi, how are you bob" would be changed to "ok, how are you bye"

Let's say you have already parsed your sentence:
sentence = ['hi', 'how', 'are', 'you', 'bob']
What you want to do is to check whether each word in this sentence is present in k. If yes, replace it by the corresponding element in r; else, use the actual word. In other words:
if word in k:
word_index = k.index(word)
new_word = r[word_index]
This can be written in a more concise way:
new_word = r[k.index(word)] if word in k else word
Using list comprehensions, here's how you go about processing the whole sentence:
new_sentence = [r[k.index(word)] if word in k else word for word in sentence]
new_sentence is now equal to ['ok', 'how', 'are', 'you', 'bye'] (which is what you want).
Note that in the code above we perform two equivalent search operations: word in k and k.index(word). This is inefficient. These two operations can be reduced to one by catching exceptions from the index method:
def get_new_word(word, k, r):
try:
word_index = k.find(word)
return r[word_index]
except ValueError:
return word
new_sentence = [get_new_word(word, k, r) for word in sentence]
Now, you should also note that searching for word in sentence is a search with O(n) complexity (where n is the number of keywords). Thus the complexity of this algorithm is O(n.m) (where is the sentence length). You can reduce this complexity to O(m) by using a more appropriate data structure, as suggested by the other comments. This is left as an exercise :-p

I'll assume you've got the "reading string from file" part covered, so about that "replacing multiple strings" part: First, as suggested by Martijn, you can create a dictionary, mapping keys to replacements, using dict and zip.
>>> k = ["hi", "bob"]
>>> r = ["ok", "bye"]
>>> d = dict(zip(k, r))
Now, one way to replace all those keys at once would be to use a regular expression, being a disjunction of all those keys, i.e. "hi|bob" in your example, and using re.sub with a replacement function, looking up the respective key in that dictionary.
>>> import re
>>> re.sub('|'.join(k), lambda m: d[m.group()], "hi, how are you bob")
'ok, how are you bye'
Alternatively, you can just use a loop to replace each key-replacement pair one after the other:
s = "hi, how are you bob"
for (x, y) in zip(k, r):
s = s.replace(x, y)

Finding least common elements in a list

I want to generate an ordered list of the least common words within a large body of text, with the least common word appearing first along with a value indicating how many times it appears in the text.
I scraped the text from some online journal articles, then simply assigned and split;
article_one = """ large body of text """.split()
=> ("large","body", "of", "text")
Seems like a regex would be appropriate for the next steps, but being new to programming I'm not well versed-
If the best answer includes a regex, could someone point me to a good regex tutorial other than pydoc?

How about a shorter/simpler version with a defaultdict, Counter is nice but needs Python 2.7, this works from 2.5 and up :)
import collections
counter = collections.defaultdict(int)
article_one = """ large body of text """
for word in article_one.split():
counter[word] += 1
print sorted(counter.iteritems(), key=lambda x: x[::-1])

Finding least common elements in a list. According to Counter class in Collections module
c.most_common()[:-n-1:-1] # n least common elements
So Code for least common element in list is
from collections import Counter
Counter( mylist ).most_common()[:-2:-1]
Two least common elements is
from collections import Counter
Counter( mylist ).most_common()[:-3:-1]
python-3.x

This uses a slightly different approach but it appears to suit your needs. Uses code from this answer.
#!/usr/bin/env python
import operator
import string
article_one = """A, a b, a b c, a b c d, a b c d efg.""".split()
wordbank = {}
for word in article_one:
# Strip word of punctuation and capitalization
word = word.lower().strip(string.punctuation)
if word not in wordbank:
# Create a new dict key if necessary
wordbank[word] = 1
else:
# Otherwise, increment the existing key's value
wordbank[word] += 1
# Sort dict by value
sortedwords = sorted(wordbank.iteritems(), key=operator.itemgetter(1))
for word in sortedwords:
print word[1], word[0]
Outputs:
1 efg
2 d
3 c
4 b
5 a
Works in Python >= 2.4, and Python 3+ if you parenthesize the print statement at the bottom and change iteritems to items.

ready made answer from the mothership.
# From the official documentation ->>
>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
... cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
## ^^^^--- from the standard documentation.
>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall('\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]
>>> def least_common(adict, n=None):
.....: if n is None:
.....: return sorted(adict.iteritems(), key=itemgetter(1), reverse=False)
.....: return heapq.nsmallest(n, adict.iteritems(), key=itemgetter(1))
Obviously adapt to suite :D

If you need a fixed number of least-common words, e.g., the 10 least common, you probably want a solution using a counter dict and a heapq, as suggested by sotapme's answer (with WoLpH's suggestion) or WoLpH's answer:
wordcounter = collections.Counter(article_one)
leastcommon = word counter.nsmallest(10)
However, if you need an unbounded number of them, e.g., all words with fewer than 5 appearances, which could be 6 in one run and 69105 in the next, you might be better of just sorting the list:
wordcounter = collections.Counter(article_one)
allwords = sorted(wordcounter.items(), key=operator.itemgetter(1))
leastcommon = itertools.takewhile(lambda x: x[1] < 5, allwords)
Sorting takes longer than heapifying, but extracting the first M elements is a lot faster with a list than a heap. Algorithmically, the difference is just some log N factors, so the constants are going to be important here. So the best thing to do is test.
Taking my code at pastebin, and a file made by just doing cat reut2* >reut2.sgm on the Reuters-21578 corpus (without processing it to extract the text, so this is obviously not very good for serious work, but should be fine for benchmarking, because none of the SGML tags are going to be in the least common…):
$ python leastwords.py reut2.sgm # Apple 2.7.2 64-bit
heap: 32.5963380337
sort: 22.9287009239
$ python3 leastwords.py reut2.sgm # python.org 3.3.0 64-bit
heap: 32.47026552911848
sort: 25.855643508024514
$ pypy leastwords.py reut2.sgm # 1.9.0/2.7.2 64-bit
heap: 23.95291996
sort: 16.1843900681
I tried various ways to speed up each of them (including: takewhile around a genexp instead of a loop around yield in the heap version, popping optimistic batches with nsmallest and throwing away any excess, making a list and sorting in place, decorate-sort-undecorate instead of a key, partial instead of lambda, etc.), but none of them made more than 5% improvement (and some made things significantly slower).
At any rate, these are closer than I expected, so I'd probably go with whichever one is simpler and more readable. But I think sort beats heap there, as well, so…
Once again: If you just need the N least common, for reasonable N, I'm willing to bet without even testing that the heap implementation will win.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count occurrences of list of strings in text - python

Related

Check if any strings in a list that are next to each other have the same first letter

I'm not able to allocate the sums of integers in a list using a for loop

how to find the longest N words from a list, using python?

Python possible list comprehension

Finding least common elements in a list

Categories

Resources