Implement filescounter, which takes a string in any variety and returns the number of capitalized words in that string, inclusive of the last and first character.
def filescounter(s):
sr=0
for words in text:
#...
return sr
I'm stuck on how to go about this.
Split the text on whitespace then iterate through the words:
def countCapitalized(text):
count = 0
for word in text.split():
if word.isupper():
count += 1
return count
If, by capitalized, you mean only the first letter needs to be capitalized, then you can replace word.isupper() with word[0].isupper().
Use this:
def count_upper_words(text):
return sum(1 for word in text.split() if word.isupper())
Explanation:
split() chops text to words by either spaces or newlines
so called list comprehension works faster than an explicit for-loop and looks nicer
Related
I have a list; list_words_punc which is a list of all the words in an input() using the split(). I then have another list; list_words which is a list of all the words in that same input() but without their punctuation (I.e .,?!). sentence is the input(). I want the program to check for all words in list_words_punc that every letter is a letter and it all gets appended to my new list list_words; any other punctuation is disbanded. The error I'm having is that if I use for s in l: if s.isalpha() and then append that to my new list, the list will be appending the letters as separate words instead of appending the same words from sentence just without punctuation. Is there any way to append the words?
list_words_punc=sentence.split()
list_words=[]
for l in list_words[]:
for s in l:
if s.isalpha():
Example just if I was unclear:
sentence="How, are you?"
list_words_punc=sentence.split()
list_words=[]
for l in list_words[]:
for s in l:
if s.isalpha():
I get:
["H","o","w"," ",...]
You could use regex to achieve this
re.findall() returns all non-overlapping matches of pattern in string, as a list of strings.
\w represents a single word character
\w+ means one or more of a word character
Hope you understood.
Example:
Code
import re
sentence="How, are you?"
list_words = re.findall(r'\w+', sentence)
print(list_words)
Output
['How', 'are', 'you']
Basically what you are doing is appending at the end. The operation of appending always creates a new index and appends that value at that index. So, what you do is something like this.
sentence="How, are you?"
list_words_punc=sentence.split()
new_words=[]
index=0
for l in list_words_punc:
word=''
print(l)
for s in l:
if s.isalpha():
word+=s
else:
new_words.append(word)
word=''
if word!='':
new_words.append(word)
word=''
print(new_words)
So, you don't append character by character, instead create and store in an index (it is inefficient), but it works and removes all the punctuation and the list new_words has the words without punctuation
First of all you are iterating on list assigned none that is wrong itself
Now answering question you just need to append each character at the end of string.
sentence="How, are you?"
list_words_punc=sentence.split()
list_word=[]
for word in list_words_punc:
s=""
for c in word:
if c.isalpha():s=s+c
list_word.append(s)
print (list_word)
I want to calculate the occurrences of a given word in an article. I tried to use split method to cut the articles into n pieces and calculate the length like this.
def get_occur(str, word):
lst = str.split(word)
return len(lst) - 1
But the problem is, I will always count the word additionally if the word is a substring of another word. For example, I only want to count the number of "sad" in this sentence "I am very sad and she is a saddist". It should be one, but because "sad" is part of "saddist", I will count it accidentally. If I use " sad ", I will omit words that are at the start and end of sentences. Plus, I am dealing with huge number of articles so it is most desirable that I don't have to compare each word. How can I address this? Much appreciated.
You can use regular expressions:
import re
def count(text, pattern):
return len(re.findall(rf"\b{pattern}\b", text, flags=re.IGNORECASE))
\b marks word boundaries and the passed flag makes the matching case insensitive:
>>> count("Sadly, the SAD man is sad.", "sad")
2
If you want to only count lower-case occurrences, just omit the flag.
As mentioned by #schwobaseggl in the comment this will miss the word before the comma and there may be other cases so I have updated the answer.
from nltk.tokenize import word_tokenize
text = word_tokenize(text)
This will give you a list of words. Now use the below code
count = 0
for word in text:
if (word.lower() == 'sad'): # .lower to make it case-insensitive
count += 1
Basically, I start with inserting the word "brand" where I replace a single character in the word with an underscore and try and find all words that match the remaining characters. For example:
"b_and" would return: "band", "brand", "bland" .... etc.
I started with using re.sub to substitute the underscore in the character. But I'm really lost on where to go next. I only want words that are different by this underscore, either without the underscore or by replacing it with a letter. Like if the word "under" was to run through the list, i wouldn't want it to return "understood" or "thunder", just a single character difference. Any ideas would be great!
I tried replacing the character with every letter in the alphabet first, then back checking if that word is in the dictionary, but that took such a long time, I really want to know if there's a faster way
from itertools import chain
dictionary=open("Scrabble.txt").read().split('\n')
import re,string
#after replacing the word with "_", we find words in the dictionary that match the pattern
new=[]
for letter in string.ascii_lowercase:
underscore=re.sub('_', letter, word)
if underscore in dictionary:
new.append(underscore)
if new == []:
pass
else:
return new
IIUC this should do it. I'm doing it outside a function so you have a working example, but it's straightforward to do it inside a function.
string = 'band brand bland cat dand bant bramd branding blandisher'
word='brand'
new=[]
for n,letter in enumerate(word):
pattern=word[:n]+'\w?'+word[n+1:]
new.extend(re.findall(pattern,string))
new=list(set(new))
Output:
['bland', 'brand', 'bramd', 'band']
Explanation:
We're using regex to do what you're looking. In this case, in every iteration we're taking one letter out of "brand" and making the algorithm look for any word that matches. So it'll look for:
_rand, b_and, br_nd, bra_d, bran_
For the case of "b_and" the pattern is b\w?and, which means: find a word with b, then any character may or may not appear, and then 'and'.
Then it adds to the list all words that match.
Finally I remove duplicates with list(set(new))
Edit: forgot to add string vairable.
Here's a version of Juan C's answer that's a bit more Pythonic
import re
dictionary = open("Scrabble.txt").read().split('\n')
pattern = "b_and" # change to what you need
pattern = pattern.replace('_', '.?')
pattern += '\\b'
matching_words = [word for word in dictionary if re.match(pattern, word)]
Edit: fixed the regex according to your comment, quick explanation:
pattern = "b_and"
pattern = pattern.replace('_', '.?') # pattern is now b.?and, .? matches any one character (or none at all)
pattern += '\\b' # \b prevents matching with words like "bandit" or words longer than "b_and"
I have tried to build up my first iterator for words in a text:
def words(text):
regex = re.compile(r"""(\w(?:[\w']*\w)?|\S)""", re.VERBOSE)
for line in text:
words = regex.findall(line)
if words:
for word in words:
yield word
if I only use this line words = regex.findall(line) I retrieve a list with all words but if I use the function and do a NEXT() the it will return the text character by character.
Any idea what I do wrong?
I believe that you are passing a string to text because that is the only way it would result in all characters. So, given that, I updated the code to accommodate a string (all I did was remove one of the loops):
import re
import re
def words(text):
regex = re.compile(r"""(\w(?:[\w']*\w)?|\S)""", re.VERBOSE)
words = regex.findall(text)
for word in words:
yield word
print(list(words("I like to test strings")))
Is text a list of strings? If it's on string (even if containing new lines) it explains the result...
So I want to write a regex that matches with a word that is one character less than the word. So for example:
wordList = ['inherit', 'inherent']
for word in wordList:
if re.match('^inhe....', word):
print(word)
And in theory, it would print both inherit and inherent, but I can only get it to print inherent. So how can I match with a word one letter short without just erasing one of the dots (.)
(Edited)
For matching only inherent, you could use .{4}:
re.match('^inhe.{4}', word)
Or ....$:
re.match('^inhe....$')
A regex may not be the best tool here, if you just want to know if word Y starts with the first N-1 letters of word X, do this:
if Y.startswith( X[:-1] ):
# Do whatever you were trying to do.
X[:-1] gets all but the last character of X (or the empty string if X is the empty string).
Y.startswith( 'blah' ) returns true if Y starts with 'blah'.