How to find and manipulate words in sentences in python?

How to find and manipulate words in sentences in python? - python

I am trying to identify words within sentences that are only made up of numbers. Once I find a word only made up of numbers, I have a certain manipulation I would like to do to it. I am able to do this manipulation to a single string of numbers, but I am absolutely at a loss of how to do so if the strings are randomly positioned across a sentence.
To do so to one string, I confirmed it was only numbers and iterated through its characters so that I skipped the first number, changed the rest to certain letter values and added a new character to the end. These specifics aren't necessarily what is important. I am trying to find a way of treating each random "word" of numbers in a sentence the same way. Is this possible?
I am not supposed to use any advanced functions. Only loops, enumerate, if chains, string functions etc. I feel like I am just overthinking something!
NUM_BRAILLE="*"
digits='1234567890'
decade="abcdefhij"
def numstuff(s):
if len(s)==1 and s.isdigit():
s=s+NUM_BRAILLE
elif " " not in s and s.isdigit():
start_s=s[:1]
s=s[1:]
for i in s:
if i in digits:
s=s.replace(i,decade[int(i)-1])
s=start_s+s+NUM_BRAILLE
else:
#if sentence contains many " " (spaces) how to find "words" of numbers and treat them using method above?

You can do something like this to extract numeric values from a sentence and pass the values to your function.
sentence = "This is 234 some text 888 with few words in 33343 numeric"
words = sentence.split(" ")
values= [int(word) if word.isdigit() else 0 for word in words]
print values
Output:

Related

How can you use Python to count the unique words (without special characters/ cases interfering) in a text document

I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:
Total word count
Total count of unique words (without case and special characters interfering)
The number of sentences
Average words in a sentence
Find common used phrases (a phrase of 3 or more words used over 3 times)
A list of words used, in order of descending frequency (without case and special characters interfering)
The ability to accept input from STDIN, or from a file specified on the command line
So far I have this Python program to print total word count:
with open('/Users/name/Desktop/20words.txt', 'r') as f:
p = f.read()
words = p.split()
wordCount = len(words)
print "The total word count is:", wordCount
So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog, dog., "dog, and dog, as different words)
file=open("/Users/name/Desktop/20words.txt", "r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k, v in wordcount.items():
print k, v
Thank you for any help you can give!

Certainly the most difficult part is identifying the sentences. You could use a regular expression for this, but there might still be some ambiguity, e.g. with names and titles, that have a dot followed by an upper case letter. For words, too, you can use a simple regex, instead of using split. The exact expression to use depends on what qualifies as a "word". Finally, you can use collections.Counter for counting all of those instead of doing this manually. Use str.lower to convert either the text as a whole or the individual words to lowercase.
This should help you getting startet:
import re, collections
text = """Sentences start with an upper-case letter. Do they always end
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two,
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""
sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
print n, s
word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
print n, w
For "more power", you could use some natural language toolkit, but this might be a bit much for this task.

If you know what characters you want to avoid, you can use str.strip to remove these characters from the extremities.
word = word.strip().strip("'").strip('"')...
This will remove the occurrence of these characters on the extremities of the word.
This probably isn't as efficient as using some NLP library, but it can get the job done.
str.strip Docs

How do I get a program to print the number of words in a sentence and each word in order

I need to print how many characters there are in a sentence the user specifies, print how many words there are in a sentence the user specifies and print each word, the number of letters in the word, and the first and last letter in the word. Can this be done?

I want you to take your time and understand what is going on in the code below and I suggest you to read these resources.
http://docs.python.org/3/library/re.html
http://docs.python.org/3/library/functions.html#len
http://docs.python.org/3/library/functions.html
http://docs.python.org/3/library/stdtypes.html#str.split
import re
def count_letter(word):
"""(str) -> int
Return the number of letters in a word.
>>> count_letter('cat')
3
>>> count_letter('cat1')
3
"""
return len(re.findall('[a-zA-Z]', word))
if __name__ == '__main__':
sentence = input('Please enter your sentence: ')
words = re.sub("[^\w]", " ", sentence).split()
# The number of characters in the sentence.
print(len(sentence))
# The number of words in the sentence.
print(len(words))
# Print all the words in the sentence, the number of letters, the first
# and last letter.
for i in words:
print(i, count_letter(i), i[0], i[-1])
Please enter your sentence: hello user
10
2
hello 5 h o
user 4 u r

Please read Python's string documentation, it is self explanatory. Here is a short explanation of the different parts with some comments.
We know that a sentence is composed of words, each of which is composed of letters. What we have to do first is to split the sentence into words. Each entry in this list is a word, and each word is stored in a form of a succession of characters and we can get each of them.
sentence = "This is my sentence"
# split the sentence
words = sentence.split()
# use len() to obtain the number of elements (words) in the list words
print('There are {} words in the given sentence'.format(len(words)))
# go through each word
for word in words:
# len() counts the number of elements again,
# but this time it's the chars in the string
print('There are {} characters in the word "{}"'.format(len(word), word))
# python is a 0-based language, in the sense that the first element is indexed at 0
# you can go backward in an array too using negative indices.
#
# However, notice that the last element is at -1 and second to last is -2,
# it can be a little bit confusing at the beginning when we know that the second
# element from the start is indexed at 1 and not 2.
print('The first being "{}" and the last "{}"'.format(word[0], word[-1]))

We don't do your homework for you on stack overflow... but I will get you started.
The most important method you will need is one of these two (depending on the version of python):
Python3.X - input([prompt]),.. If the prompt argument is present, it is written
to standard output without a trailing newline. The function then
reads a line from input, converts it to a string (stripping a
trailing newline), and returns that. When EOF is read, EOFError is
raised. http://docs.python.org/3/library/functions.html#input
Python2.X raw_input([prompt]),... If the prompt argument is
present, it is written to standard output without a trailing newline.
The function then reads a line from input, converts it to a string
(stripping a trailing newline), and returns that. When EOF is read,
EOFError is raised. http://docs.python.org/2.7/library/functions.html#raw_input
You can use them like
>>> my_sentance = raw_input("Do you want us to do your homework?\n")
Do you want us to do your homework?
yes
>>> my_sentance
'yes'
as you can see, the text wrote was stroed in the my_sentance variable
To get the amount of characters in a string, you need to understand that a string is really just a list! So if you want to know the amount of characters you can use:
len(s),... Return the length (the number of items) of an object.
The argument may be a sequence (string, tuple or list) or a mapping
(dictionary). http://docs.python.org/3/library/functions.html#len
I'll let you figure out how to use it.
Finally you're going to need to use a built in function for a string:
str.split([sep[, maxsplit]]),...Return a list of the words in the
string, using sep as the delimiter string. If maxsplit is given, at
most maxsplit splits are done (thus, the list will have at most
maxsplit+1 elements). If maxsplit is not specified or -1, then there
is no limit on the number of splits (all possible splits are made).
http://docs.python.org/2/library/stdtypes.html#str.split

Can't convert 'list'object to str implicitly Python

I am trying to import the alphabet but split it so that each character is in one array but not one string. splitting it works but when I try to use it to find how many characters are in an inputted word I get the error 'TypeError: Can't convert 'list' object to str implicitly'. Does anyone know how I would go around solving this? Any help appreciated. The code is below.
import string
alphabet = string.ascii_letters
print (alphabet)
splitalphabet = list(alphabet)
print (splitalphabet)
x = 1
j = year3wordlist[x].find(splitalphabet)
k = year3studentwordlist[x].find(splitalphabet)
print (j)
EDIT: Sorry, my explanation is kinda bad, I was in a rush. What I am wanting to do is count each individual letter of a word because I am coding a spelling bee program. For example, if the correct word is 'because', and the user who is taking part in the spelling bee has entered 'becuase', I want the program to count the characters and location of the characters of the correct word AND the user's inputted word and compare them to give the student a mark - possibly by using some kind of point system. The problem I have is that I can't simply say if it is right or wrong, I have to award 1 mark if the word is close to being right, which is what I am trying to do. What I have tried to do in the code above is split the alphabet and then use this to try and find which characters have been used in the inputted word (the one in year3studentwordlist) versus the correct word (year3wordlist).

There is a much simpler solution if you use the in keyword. You don't even need to split the alphabet in order to check if a given character is in it:
year3wordlist = ['asdf123', 'dsfgsdfg435']
total_sum = 0
for word in year3wordlist:
word_sum = 0
for char in word:
if char in string.ascii_letters:
word_sum += 1
total_sum += word_sum
# Length of characters in the ascii letters alphabet:
# total_sum == 12
# Length of all characters in all words:
# sum([len(w) for w in year3wordlist]) == 18
EDIT:
Since the OP comments he is trying to create a spelling bee contest, let me try to answer more specifically. The distance between a correctly spelled word and a similar string can be measured in many different ways. One of the most common ways is called 'edit distance' or 'Levenshtein distance'. This represents the number of insertions, deletions or substitutions that would be needed to rewrite the input string into the 'correct' one.
You can find that distance implemented in the Python-Levenshtein package. You can install it via pip:
$ sudo pip install python-Levenshtein
And then use it like this:
from __future__ import division
import Levenshtein
correct = 'because'
student = 'becuase'
distance = Levenshtein.distance(correct, student) # distance == 2
mark = ( 1 - distance / len(correct)) * 10 # mark == 7.14
The last line is just a suggestion on how you could derive a grade from the distance between the student's input and the correct answer.

I think what you need is join:
>>> "".join(splitalphabet)
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

join is a class method of str, you can do
''.join(splitalphabet)
or
str.join('', splitalphabet)

To convert the list splitalphabet to a string, so you can use it with the find() function you can use separator.join(iterable):
"".join(splitalphabet)
Using it in your code:
j = year3wordlist[x].find("".join(splitalphabet))

I don't know why half the answers are telling you how to put the split alphabet back together...
To count the number of characters in a word that appear in the splitalphabet, do it the functional way:
count = len([c for c in word if c in splitalphabet])

import string
# making letters a set makes "ch in letters" very fast
letters = set(string.ascii_letters)
def letters_in_word(word):
return sum(ch in letters for ch in word)
Edit: it sounds like you should look at Levenshtein edit distance:
from Levenshtein import distance
distance("because", "becuase") # => 2

While join creates the string from the split, you would not have to do that as you can issue the find on the original string (alphabet). However, I do not think is what you are trying to do. Note that the find that you are trying attempts to find the splitalphabet (actually alphabet) within year3wordlist[x] which will always fail (-1 result)
If what you are trying to do is to get the indices of all the letters of the word list within the alphabet, then you would need to handle it as
for each letter in the word of the word list, determine the index within alphabet.
j = []
for c in word:
j.append(alphabet.find(c))
print j
On the other hand if you are attempting to find the index of each character within the alphabet within the word, then you need to loop over splitalphabet to get an individual character to find within the word. That is
l = []
for c within splitalphabet:
j = word.find(c)
if j != -1:
l.append((c, j))
print l
This gives the list of tuples showing those characters found and the index.
I just saw that you talk about counting the number of letters. I am not sure what you mean by this as len(word) gives the number of characters in each word while len(set(word)) gives the number of unique characters. On the other hand, are you saying that your word might have non-ascii characters in it and you want to count the number of ascii characters in that word? I think that you need to be more specific in what you want to determine.
If what you are doing is attempting to determine if the characters are all alphabetic, then all you need to do is use the isalpha() method on the word. You can either say word.isalpha() and get True or False or check each character of word to be isalpha()

How do I calculate the number of times a word occurs in a sentence?

So I've been learning Python for some months now and was wondering how I would go about writing a function that will count the number of times a word occurs in a sentence. I would appreciate if someone could please give me a step-by-step method for doing this.

Quick answer:
def count_occurrences(word, sentence):
return sentence.lower().split().count(word)
'some string.split() will split the string on whitespace (spaces, tabs and linefeeds) into a list of word-ish things. Then ['some', 'string'].count(item) returns the number of times item occurs in the list.
That doesn't handle removing punctuation. You could do that using string.maketrans and str.translate.
# Make collection of chars to keep (don't translate them)
import string
keep = string.lowercase + string.digits + string.whitespace
table = string.maketrans(keep, keep)
delete = ''.join(set(string.printable) - set(keep))
def count_occurrences(word, sentence):
return sentence.lower().translate(table, delete).split().count(word)
The key here is that we've constructed the string delete so that it contains all the ascii characters except letters, numbers and spaces. Then str.translate in this case takes a translation table that doesn't change the string, but also a string of chars to strip out.

wilberforce has the quick, correct answer, and I'll give the long winded 'how to get to that conclusion' answer.
First, here are some tools to get you started, and some questions you need to ask yourself.
You need to read the section on Sequence Types, in the python docs, because it is your best friend for solving this problem. Seriously, read it. Once you have read that, you should have some ideas. For example you can take a long string and break it up using the split() function. To be explicit:
mystring = "This sentence is a simple sentence."
result = mystring.split()
print result
print "The total number of words is: " + str(len(result))
print "The word 'sentence' occurs: " + str(result.count("sentence"))
Takes the input string and splits it on any whitespace, and will give you:
["This", "sentence", "is", "a", "simple", "sentence."]
The total number of words is 6
The word 'sentence' occurs: 1
Now note here that you do have the period still at the end of the second 'sentence'. This is a problem because 'sentence' is not the same as 'sentence.'. If you are going to go over your list and count words, you need to make sure that the strings are identical. You may need to find and remove some punctuation.
A naieve approach to this might be:
no_period_string = mystring.replace(".", " ")
print no_period_string
To get me a period-less sentence:
"This sentence is a simple sentence"
You also need to decide if your input going to be just a single sentence, or maybe a paragraph of text. If you have many sentences in your input, you might want to find a way to break them up into individual sentences, and find the periods (or question marks, or exclamation marks, or other punctuation that ends a sentence). Once you find out where in the string the 'sentence terminator' is you could maybe split up the string at that point, or something like that.
You should give this a try yourself - hopefully I've peppered in enough hints to get you to look at some specific functions in the documentation.

Simplest way:
def count_occurrences(word, sentence):
return sentence.count(word)

text=input("Enter your sentence:")
print("'the' appears", text.count("the"),"times")
simplest way to do it

Problem with using count() method is that it not always gives the correct number of occurrence when there is overlapping, for example
print('banana'.count('ana'))
output
1
but 'ana' occurs twice in 'banana'
To solve this issue, i used
def total_occurrence(string,word):
count = 0
tempsting = string
while(word in tempsting):
count +=1
tempsting = tempsting[tempsting.index(word)+1:]
return count

You can do it like this:
def countWord(word):
numWord = 0
for i in range(1, len(word)-1):
if word[i-1:i+3] == 'word':
numWord += 1
print 'Number of times "word" occurs is:', numWord
then calling the string:
countWord('wordetcetcetcetcetcetcetcword')
will return: Number of times "word" occurs is: 2

def check_Search_WordCount(mySearchStr, mySentence):
len_mySentence = len(mySentence)
len_Sentence_without_Find_Word = len(mySentence.replace(mySearchStr,""))
len_Remaining_Sentence = len_mySentence - len_Sentence_without_Find_Word
count = len_Remaining_Sentence/len(mySearchStr)
return (int(count))

I assume that you just know about python string and for loop.
def count_occurences(s,word):
count = 0
for i in range(len(s)):
if s[i:i+len(word)] == word:
count += 1
return count
mystring = "This sentence is a simple sentence."
myword = "sentence"
print(count_occurences(mystring,myword))
explanation:
s[i:i+len(word)]: slicing the string s to extract a word having the same length with the word (argument)
count += 1 : increase the counter whenever matched.

Limit the number of sentences in a string

A beginner's Python question:
I have a string with x number of sentences. How to I extract first 2 sentences (may end with . or ? or !)

Ignoring considerations such as when a . constitutes the end of sentence:
import re
' '.join(re.split(r'(?<=[.?!])\s+', phrase, 2)[:-1])
EDIT: Another approach that just occurred to me is this:
re.match(r'(.*?[.?!](?:\s+.*?[.?!]){0,1})', phrase).group(1)
Notes:
Whereas the first solution lets you replace the 2 with some other number to choose a different number of sentences, in the second solution, you change the 1 in {0,1} to one less than the number of sentences you want to extract.
The second solution isn't quite as robust in handling, e.g., empty strings, or strings with no punctuation. It could be made so, but the regex would be even more complex than it is already, and I would favour the slightly less efficient first solution over an unreadable mess.

I solved it like this: Separating sentences, though a comment on that post also points to NLTK, though I don't know how to find the sentence segmenter on their site...

Here's how yo could do it:
str = "Sentence one? Sentence two. Sentence three? Sentence four. Sentence five."
sentences = str.split(".")
allSentences = []
for sentence in sentences
allSentences.extend(sentence.split("?"))
print allSentences[0:3]
There are probably better ways, I look forward to seeing them.

Here is a step by step explanation of how to disassemble, choose the first two sentences, and reassemble it. As noted by others, this does not take into account that not all dot/question/exclamation characters are really sentence separators.
import re
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
# split the first two sentences by the dot/question/exclamation.
sentences = re.split('([.?!])', testline, 2)
print "result of split: ", sentences
# toss everything else (the last item in the list)
firstTwo = sentences[:-1]
print firstTwo
# put the first two sentences back together
finalLine = ''.join(firstTwo)
print finalLine

Generator alternative using my utility function returning piece of string until any item in search sequence:
from itertools import islice
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
def multis(search_sequence,text,start=0):
""" multisearch by given search sequence values from text, starting from position start
yielding tuples of text before found item and found sequence item"""
x=''
for ch in text[start:]:
if ch in search_sequence:
if x: yield (x,ch)
else: yield ch
x=''
else:
x+=ch
else:
if x: yield x
# split the first two sentences by the dot/question/exclamation.
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation
print "result of split: ", two_sentences
print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.