How to count only the words that I want? - python

I want to count only words of a dictionary.
For example :
There is a text :
Children can bye (paid) by credit card.
I want to count just paid.
But my code counts (paid).
import re, sys
d = {}
m = "children can bye (paid) by credit card."
n = m.split()
for i in n:
d[i] = 0
for j in n:
d[j] = d[j] + 1
Is there any advice ?

You can split the string with the following regex to split by nonword chars:
import re
n = re.split('\W+', m)
You can check the syntax here.

You just need to remove the punctuation from your individual tokens. Assuming you want to remove all the punctuation, take a look at the string module. Then (for example), you can go through each token and remove the punctuation. You can do this with one list comprehension:
words = [''.join(ch for ch in token if ch not in string.punctuation)
for token in m.split()]
All this code does is run through each character (ch) in each token (the results of m.split()). It allows all characters except it'll strip out any characters in string.punctuation. Of course if you want a different set of characters (say, maybe you want to allow apostrophes), you can just define that set of characters and use that instead.

Related

Python - Strip Punctuation from list of words using re.sub and string.punctuation

I am trying to remove punctuation from the list of string.punctuation in a list of words. The issue is that I do not know where to strip the punctuation as I am dealing with a dictionary within a dictionary. My code is below
from collections import Counter
import re
comments = []
ar_lst = []
for review in reviews:
ar_dict = {}
ar_dict["Comments"] = review["Content"]
ar_dict["Author"] = review["Author"]
ar_lst.append(ar_dict)
for review in ar_lst:
# TODO: (1) Get the number of words in the current review variable.
punc= string.punctuation
comments = review['Comments'].lower()
author = review['Author']
unique_words_count = set()
all_words = comments.split(" ")
for word in all_words:
unique_words_count.add(word)
# (2) Print the author's name and the number of (unique) words in his/her review
print(f'{author} used {len(unique_words_count)} unique words.')
The output I am getting is below
But I need the output to look like this
The reason the # of words is off is due to the fact that I can't figure out where to insert the re.sub() expression. I tried putting it into the second 'for-loop' as
comments = re.sub(punc, '', review['Comments']).lower()
But this did not work. Any help would be greatly appreciated!
Also, this is a snippet of what the dictionary looks like
You can either strip out the punctuation from comments before you split into it words (preferable), or you can strip it from word in the loop for word in all_words:. string.punctuation is a string !"#$%&'... but you probably want the character set:
punc = '[%s]' % string.punctuation.replace(']', '\]')

How to make shortcut of first letters of any text?

I need to write a function that returns the first letters (and make it uppercase) of any text like:
shortened = shorten("Don't repeat yourself")
print(shortened)
Expected output:
DRY
and:
shortened = shorten("All terrain armoured transport")
print(shortened)
Expected output:
ATAT
Use list comprehension and join
shortened = "".join([x[0] for x in text.title().split(' ') if x])
Using regex you can match all characters except the first letter of each word, replace them with an empty string to remove them, then capitalize the resulting string:
import re
def shorten(sentence):
return re.sub(r"\B[\S]+\s*","",sentence).upper()
print(shorten("Don't repeat yourself"))
Output:
DRY
text = 'this is a test'
output = ''.join(char[0] for char in text.title().split(' '))
print(output)
TIAT
Let me explain how this works.
My first step is to capitalize the first letter of each work
text.title()
Now I want to be able to separate each word by the space in between, this will become a list
text.title()split(' ')
With that I'd end up with 'This','Is','A','Test' so now I obviously only want the first character of each word in the list
for word in text.title()split(' '):
print(word[0]) # T I A T
Now I can lump all that into something called list comprehension
output = [char[0] for char in text.title().split(' ')]
# ['T','I','A','T']
I can use ''.join() to combine them together, I don't need the [] brackets anymore because it doesn't need to be a list
output = ''.join(char[0] for char in text.title().split(' ')

Find semordnilap(reverse anagram) of words in a string

I'm trying to take a string input, like a sentence, and find all the words that have their reverse words in the sentence. I have this so far:
s = "Although he was stressed when he saw his desserts burnt, he managed to stop the pots from getting ruined"
def semordnilap(s):
s = s.lower()
b = "!##$,"
for char in b:
s = s.replace(char,"")
s = s.split(' ')
dict = {}
index=0
for i in range(0,len(s)):
originalfirst = s[index]
sortedfirst = ''.join(sorted(str(s[index])))
for j in range(index+1,len(s)):
next = ''.join(sorted(str(s[j])))
if sortedfirst == next:
dict.update({originalfirst:s[j]})
index+=1
print (dict)
semordnilap(s)
So this works for the most part, but if you run it, you can see that it's also pairing "he" and "he" as an anagram, but it's not what I am looking for. Any suggestions on how to fix it, and also if it's possible to make the run time faster, if I was to input a large text file instead.
You could split the string into a list of words and then compare lowercase versions of all combinations where one of the pair is reversed. Following example uses re.findall() to split the string into a list of words and itertools.combinations() to compare them:
import itertools
import re
s = "Although he was stressed when he saw his desserts burnt, he managed to stop the pots from getting ruined"
words = re.findall(r'\w+', s)
pairs = [(a, b) for a, b in itertools.combinations(words, 2) if a.lower() == b.lower()[::-1]]
print(pairs)
# OUTPUT
# [('was', 'saw'), ('stressed', 'desserts'), ('stop', 'pots')]
EDIT: I still prefer the solution above, but per your comment regarding doing this without importing any packages, see below. However, note that str.translate() used this way may have unintended consequences depending on the nature of your text (like stripping # from email addresses) - in other words, you may need to deal with punctuation more carefully than this. Also, I would typically import string and use string.punctuation rather than the literal string of punctuation characters I am passing to str.translate(), but avoided that below in keeping with your request to do this without imports.
s = "Although he was stressed when he saw his desserts burnt, he managed to stop the pots from getting ruined"
words = s.translate(None, '!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~').split()
length = len(words)
pairs = []
for i in range(length - 1):
for j in range(i + 1, length):
if words[i].lower() == words[j].lower()[::-1]:
pairs.append((words[i], words[j]))
print(pairs)
# OUTPUT
# [('was', 'saw'), ('stressed', 'desserts'), ('stop', 'pots')]

Counting punctuation in text using Python and regex

I am trying to count the number of times punctuation characters appear in a novel. For example, I want to find the occurrences of question marks and periods along with all the other non alphanumeric characters. Then I want to insert them into a csv file. I am not sure how to do the regex because I don't have that much experience with python. Can someone help me out?
texts=string.punctuation
counts=dict(Counter(w.lower() for w in re.findall(r"\w+", open(cwd+"/"+book).read())))
writer = csv.writer(open("author.csv", 'a'))
writer.writerow([counts.get(fieldname,0) for fieldname in texts])
In [1]: from string import punctuation
In [2]: from collections import Counter
In [3]: counts = Counter(open('novel.txt').read())
In [4]: punctuation_counts = {k:v for k, v in counts.iteritems() if k in punctuation}
from string import punctuation
from collections import Counter
with open('novel.txt') as f: # closes the file for you which is important!
c = Counter(c for line in f for c in line if c in punctuation)
This also avoids loading the whole novel into memory at once.
Btw this is what string.punctuation looks like:
>>> punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
You may want to add or detract symbols from here depending on your needs.
Also Counter defines a __missing__ with simply does return 0. So instead of down-initialising it into a dictionary and then calling .get(x, 0). Just leave it as a counter and access it like c[x], if it doesn't exist, its count is 0. I'm not sure why everybody has the sudden urge to downgrade all their Counters into dicts just because of the scary looking Counter([...]) you see when you print one, when in fact Counters are dictionaries too and deserve respect.
writer.writerow([counts.get(c, 0) for c in punctuation])
If you leave your counter you can just do this:
writer.writerow([counts[c] for c in punctuation])
and that was much easier.
import re
def count_puncts(x):
# sub. punct. with '' and returns the new string with the no. of replacements.
new_str, count = re.subn(r'\W', '', x)
return count
The code you have is very close to what you'd need if you were counting words. If you were trying to count words, the only modification you'd have to make would probably be to change the last line to this:
writer.writerows(counts.items())
Unfortunately, you're not trying to count words here. If you're looking for counts of single characters, I'd avoid using regular expressions and go straight to count. Your code might look like this:
book_text = open(cwd+"/"+book).read()
counts = {}
for character in texts:
counts[character] = book_text.count(character)
writer.writerows(counts.items())
As you might be able to tell, this makes a dictionary with the characters as keys and the number of times that character appears in the text as the value. Then we write it as we would have done for counting words.
Using curses:
import curses.ascii
str1 = "real, and? or, and? what."
t = (c for c in str1 if curses.ascii.ispunct(c))
d = dict()
for p in t:
d[p] = 1 if not p in d else d[p] + 1 for p in t

How do I calculate the number of times a word occurs in a sentence?

So I've been learning Python for some months now and was wondering how I would go about writing a function that will count the number of times a word occurs in a sentence. I would appreciate if someone could please give me a step-by-step method for doing this.
Quick answer:
def count_occurrences(word, sentence):
return sentence.lower().split().count(word)
'some string.split() will split the string on whitespace (spaces, tabs and linefeeds) into a list of word-ish things. Then ['some', 'string'].count(item) returns the number of times item occurs in the list.
That doesn't handle removing punctuation. You could do that using string.maketrans and str.translate.
# Make collection of chars to keep (don't translate them)
import string
keep = string.lowercase + string.digits + string.whitespace
table = string.maketrans(keep, keep)
delete = ''.join(set(string.printable) - set(keep))
def count_occurrences(word, sentence):
return sentence.lower().translate(table, delete).split().count(word)
The key here is that we've constructed the string delete so that it contains all the ascii characters except letters, numbers and spaces. Then str.translate in this case takes a translation table that doesn't change the string, but also a string of chars to strip out.
wilberforce has the quick, correct answer, and I'll give the long winded 'how to get to that conclusion' answer.
First, here are some tools to get you started, and some questions you need to ask yourself.
You need to read the section on Sequence Types, in the python docs, because it is your best friend for solving this problem. Seriously, read it. Once you have read that, you should have some ideas. For example you can take a long string and break it up using the split() function. To be explicit:
mystring = "This sentence is a simple sentence."
result = mystring.split()
print result
print "The total number of words is: " + str(len(result))
print "The word 'sentence' occurs: " + str(result.count("sentence"))
Takes the input string and splits it on any whitespace, and will give you:
["This", "sentence", "is", "a", "simple", "sentence."]
The total number of words is 6
The word 'sentence' occurs: 1
Now note here that you do have the period still at the end of the second 'sentence'. This is a problem because 'sentence' is not the same as 'sentence.'. If you are going to go over your list and count words, you need to make sure that the strings are identical. You may need to find and remove some punctuation.
A naieve approach to this might be:
no_period_string = mystring.replace(".", " ")
print no_period_string
To get me a period-less sentence:
"This sentence is a simple sentence"
You also need to decide if your input going to be just a single sentence, or maybe a paragraph of text. If you have many sentences in your input, you might want to find a way to break them up into individual sentences, and find the periods (or question marks, or exclamation marks, or other punctuation that ends a sentence). Once you find out where in the string the 'sentence terminator' is you could maybe split up the string at that point, or something like that.
You should give this a try yourself - hopefully I've peppered in enough hints to get you to look at some specific functions in the documentation.
Simplest way:
def count_occurrences(word, sentence):
return sentence.count(word)
text=input("Enter your sentence:")
print("'the' appears", text.count("the"),"times")
simplest way to do it
Problem with using count() method is that it not always gives the correct number of occurrence when there is overlapping, for example
print('banana'.count('ana'))
output
1
but 'ana' occurs twice in 'banana'
To solve this issue, i used
def total_occurrence(string,word):
count = 0
tempsting = string
while(word in tempsting):
count +=1
tempsting = tempsting[tempsting.index(word)+1:]
return count
You can do it like this:
def countWord(word):
numWord = 0
for i in range(1, len(word)-1):
if word[i-1:i+3] == 'word':
numWord += 1
print 'Number of times "word" occurs is:', numWord
then calling the string:
countWord('wordetcetcetcetcetcetcetcword')
will return: Number of times "word" occurs is: 2
def check_Search_WordCount(mySearchStr, mySentence):
len_mySentence = len(mySentence)
len_Sentence_without_Find_Word = len(mySentence.replace(mySearchStr,""))
len_Remaining_Sentence = len_mySentence - len_Sentence_without_Find_Word
count = len_Remaining_Sentence/len(mySearchStr)
return (int(count))
I assume that you just know about python string and for loop.
def count_occurences(s,word):
count = 0
for i in range(len(s)):
if s[i:i+len(word)] == word:
count += 1
return count
mystring = "This sentence is a simple sentence."
myword = "sentence"
print(count_occurences(mystring,myword))
explanation:
s[i:i+len(word)]: slicing the string s to extract a word having the same length with the word (argument)
count += 1 : increase the counter whenever matched.

Categories