Count frequency of words and string - python

I need to count number of words in sentence. I do it with
word_matrix[i][j] = sentences[i].count([*words_dict][j])
But it also counts when a word is included in other word, for example 'in' is included in 'interactive'. How to avoid it?

You could use collections.Counter for this:
from collections import Counter
s = 'This is a sentence'
Counter(s.lower().split())
# Counter({'this': 1, 'is': 1, 'a': 1, 'sentence': 1})

You can just do this:
sentence = 'this is a test sentence'
word_count = len(sentence.split(' '))
in this case word_count would be 5.

use split to tokenise the words of statement, then use logic if word exist in dict then increment the value by one otherwise add the word with count as one :
paragraph='Nory was a Catholic because her mother was a Catholic, and Nory’s mother was a Catholic because her father was a Catholic, and her father was a Catholic because his mother was a Catholic, or had been'
words=paragraph.split()
word_count={}
counter=0
for i in words:
if i in word_count:
word_count[i]+=1
else:
word_count[i]=1
print(word_count)

Depending on the situation, the most efficient solution would be using collection.Counter, but you will miss all the words with a symbol:
i.e. in will be different from interactive (as you want), but will also be different from in:.
An alternative solution that consider this problem could be counting the matched pattern of a RegEx:
import re
my_count = re.findall(r"(?:\s|^)({0})(?:[\s$\.,;:])".format([*words_dict][j]), sentences[i])
print(len(my_count))
What is the RegEx doing?
For a given word, you match:
the same word preceded by a space or start of line (\s|^)
and followed by a space, end of the line, a dot, comma, and any symbol in the square brackets ([\s$\.,;:])

Related

How to correctly count the occurrences of a given word in a string without counting the word that is a substring of a different word in Python?

I want to calculate the occurrences of a given word in an article. I tried to use split method to cut the articles into n pieces and calculate the length like this.
def get_occur(str, word):
lst = str.split(word)
return len(lst) - 1
But the problem is, I will always count the word additionally if the word is a substring of another word. For example, I only want to count the number of "sad" in this sentence "I am very sad and she is a saddist". It should be one, but because "sad" is part of "saddist", I will count it accidentally. If I use " sad ", I will omit words that are at the start and end of sentences. Plus, I am dealing with huge number of articles so it is most desirable that I don't have to compare each word. How can I address this? Much appreciated.
You can use regular expressions:
import re
def count(text, pattern):
return len(re.findall(rf"\b{pattern}\b", text, flags=re.IGNORECASE))
\b marks word boundaries and the passed flag makes the matching case insensitive:
>>> count("Sadly, the SAD man is sad.", "sad")
2
If you want to only count lower-case occurrences, just omit the flag.
As mentioned by #schwobaseggl in the comment this will miss the word before the comma and there may be other cases so I have updated the answer.
from nltk.tokenize import word_tokenize
text = word_tokenize(text)
This will give you a list of words. Now use the below code
count = 0
for word in text:
if (word.lower() == 'sad'): # .lower to make it case-insensitive
count += 1

Creating a mapper that find the capitalized words in a text

Implement filescounter, which takes a string in any variety and returns the number of capitalized words in that string, inclusive of the last and first character.
def filescounter(s):
sr=0
for words in text:
#...
return sr
I'm stuck on how to go about this.
Split the text on whitespace then iterate through the words:
def countCapitalized(text):
count = 0
for word in text.split():
if word.isupper():
count += 1
return count
If, by capitalized, you mean only the first letter needs to be capitalized, then you can replace word.isupper() with word[0].isupper().
Use this:
def count_upper_words(text):
return sum(1 for word in text.split() if word.isupper())
Explanation:
split() chops text to words by either spaces or newlines
so called list comprehension works faster than an explicit for-loop and looks nicer

Regex to find the names in every sentence using python

Hii i am new to regex and stuck with this question.
Q- Identify all of words that look like names in the sentence. In other words, those which are capitalized but aren't the first word in the sentence.
sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
Here's what i did ...but not getting any output(Excluding the text from begining till i get any capital letter word which is name)
p = re.compile(r'[^A-Z]\w+[A-Z]\w+')
m = p.finditer(sentence)
for m in m:
print(m)
Assuming there's always only one space after a dot before another sentence begins, you can use a negative lookbehind pattern to exclude names that are preceded by a dot and a space, and another negative lookbehind pattern to exclude the beginning of the string. Also use \b to ensure that a captial letter is matched at a word boundary:
re.findall(r'(?<!\. )(?<!^)\b[A-Z]\w*', sentence)
This returns:
['Harry', 'Susy']
You use a positive lookbehind to look for a capitalization pattern for a word not at the beginning of a sentence.
Like so:
>>> sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
>>> re.findall(r'(?<=[a-z,][ ])([A-Z][a-z]*)', sentence)
['Harry', 'Susy']
Imo best done with nltk:
from nltk import sent_tokenize, word_tokenize
sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
for sent in sent_tokenize(sentence):
words = word_tokenize(sent)
possible_names = [word for word in words[1:] if word[0].isupper()]
print(possible_names)
Or - if you're into comprehensions:
names = [word
for sent in sent_tokenize(sentence)
for word in word_tokenize(sent)[1:]
if word[0].isupper()]
Which will yield
['Harry', 'Susy']
You're overwriting your m variable. Try this:
p = re.compile(r'[^A-Z]\w+[A-Z]\w+')
for m in p.finditer(sentence):
print(m)

Python add space

We have the repetitive words like Mr and Mrs in a text. We would like to add a space before and after the keywords Mr and Mrs. But, the word Mr is getting repetitive in Mrs. Please assist in solving the query:
Input:
Hi This is Mr.Sam. Hello, this is MrsPamela.Mr.Sam, what is your call about? Mrs.Pamela, I have a question for you.
import re
s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
words = ("Mr", "Mrs")
def add_spaces(string, words):
for word in words:
# pattern to match any non-space char before the word
patt1 = re.compile('\S{}'.format(word))
matches = re.findall(patt1, string)
for match in matches:
non_space_char = match[0]
string = string.replace(match, '{} {}'.format(non_space_char, word))
# pattern to match any non-space char after the word
patt2 = re.compile('{}\S'.format(word))
matches = re.findall(patt2, string)
for match in matches:
non_space_char = match[-1]
string = string.replace(match, '{} {}'.format(word, non_space_char))
return string
print(add_spaces(s, words))
Present Output:
Hi This is Mr .Sam. Hello, this is Mr sPamela. Mr .Sam, what is your call about? Mr s.Pamela, I have a question for you.
Expected Output:
Hi This is Mr .Sam. Hello, this is Mrs Pamela. Mr .Sam, what is your call about? Mrs .Pamela, I have a question for you.
You didn't specify anything after the letter 'r' so your pattern will match any starting with a space character followed by 'M' and 'r', so this will capture any ' Mr' even if it's followed by a 's' such as Mrs, that's why your your first pattern adds a space in the middle of Mrs.
A better pattern would be r'\bMr\b'
'\b' captures word boundaries, see the doc for further explanations: https://docs.python.org/3/library/re.html
I do not have a very extense knowledge of re module, but I came up with a solution which is extendable to any number of words and string and that perfectly works (tested in python3), although it is probably a very extense one and you may find something more optimized and much more concise.
On the other hand, it is not very difficult to understand the procedure:
To begin with, the program orders the words list from descending
length.
Then, it finds the matches of the longer words first and takes note
of the sections where the matches were already done in order not to
change them again. (Note that this introduces a limitation, but it
is necessary, due to the program cannot know if you want to allow
that a word in the variable word can be contained in other, anyway
it does not affect you case)
When it has taken note of all matches (in a non-blocked part of the
string) of a word, it adds the corresponding spaces and corrects the
blocked indexes (they have moved due to the insertion of the spaces)
Finally, it does a trim to eliminate multiple spaces
Note: I used a list for the variable words instead of a tuple
import re
def add_spaces(string, words):
# Get the lenght of the longest word
max_lenght = 0
for word in words:
if len(word)>max_lenght:
max_lenght = len(word)
print("max_lenght = ", max_lenght)
# Order words in descending lenght
ordered_words = []
i = max_lenght
while i>0:
for word in words:
if len(word)==i:
ordered_words.append(word)
i -= 1
print("ordered_words = ", ordered_words)
# Iterate over words adding spaces with each match and "blocking" the match section so not to modify it again
blocked_sections=[]
for word in ordered_words:
matches = [match.start() for match in re.finditer(word, string)]
print("matches of ", word, " are: ", matches)
spaces_position_to_add = []
for match in matches:
blocked = False
for blocked_section in blocked_sections:
if match>=blocked_section[0] and match<=blocked_section[1]:
blocked = True
if not blocked:
# Block section and store position to modify after
blocked_sections.append([match,match+len(word)])
spaces_position_to_add.append([match,match+len(word)+1])
# Add the spaces and update the existing blocked_sections
spaces_added = 0
for new_space in spaces_position_to_add:
# Add space before and after the word
string = string[:new_space[0]+spaces_added]+" "+string[new_space[0]+spaces_added:]
spaces_added += 1
string = string[:new_space[1]+spaces_added]+" "+string[new_space[1]+spaces_added:]
spaces_added += 1
# Update existing blocked_sections
for blocked_section in blocked_sections:
if new_space[0]<blocked_section[0]:
blocked_section[0] += 2
blocked_section[1] += 2
# Trim extra spaces
string = re.sub(' +', ' ', string)
return string
### MAIN ###
if __name__ == '__main__':
s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
words = ["Mr", "Mrs"]
print(s)
print(add_spaces(s,words))

Search through a list of strings for a word that has a variable character

Basically, I start with inserting the word "brand" where I replace a single character in the word with an underscore and try and find all words that match the remaining characters. For example:
"b_and" would return: "band", "brand", "bland" .... etc.
I started with using re.sub to substitute the underscore in the character. But I'm really lost on where to go next. I only want words that are different by this underscore, either without the underscore or by replacing it with a letter. Like if the word "under" was to run through the list, i wouldn't want it to return "understood" or "thunder", just a single character difference. Any ideas would be great!
I tried replacing the character with every letter in the alphabet first, then back checking if that word is in the dictionary, but that took such a long time, I really want to know if there's a faster way
from itertools import chain
dictionary=open("Scrabble.txt").read().split('\n')
import re,string
#after replacing the word with "_", we find words in the dictionary that match the pattern
new=[]
for letter in string.ascii_lowercase:
underscore=re.sub('_', letter, word)
if underscore in dictionary:
new.append(underscore)
if new == []:
pass
else:
return new
IIUC this should do it. I'm doing it outside a function so you have a working example, but it's straightforward to do it inside a function.
string = 'band brand bland cat dand bant bramd branding blandisher'
word='brand'
new=[]
for n,letter in enumerate(word):
pattern=word[:n]+'\w?'+word[n+1:]
new.extend(re.findall(pattern,string))
new=list(set(new))
Output:
['bland', 'brand', 'bramd', 'band']
Explanation:
We're using regex to do what you're looking. In this case, in every iteration we're taking one letter out of "brand" and making the algorithm look for any word that matches. So it'll look for:
_rand, b_and, br_nd, bra_d, bran_
For the case of "b_and" the pattern is b\w?and, which means: find a word with b, then any character may or may not appear, and then 'and'.
Then it adds to the list all words that match.
Finally I remove duplicates with list(set(new))
Edit: forgot to add string vairable.
Here's a version of Juan C's answer that's a bit more Pythonic
import re
dictionary = open("Scrabble.txt").read().split('\n')
pattern = "b_and" # change to what you need
pattern = pattern.replace('_', '.?')
pattern += '\\b'
matching_words = [word for word in dictionary if re.match(pattern, word)]
Edit: fixed the regex according to your comment, quick explanation:
pattern = "b_and"
pattern = pattern.replace('_', '.?') # pattern is now b.?and, .? matches any one character (or none at all)
pattern += '\\b' # \b prevents matching with words like "bandit" or words longer than "b_and"

Categories