Missing last word in a sentence when using regular expression - python

Code:
import re
def main():
a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
b=word_find(a)
print(b)
def word_find(sentence_list):
word_list=[]
word_reg=re.compile(r"[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]?(.+?)[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]")
for i in range(len(sentence_list)):
words=re.findall(word_reg,sentence_list[i])
word_list.append(words)
return word_list
main()
What I need is to break every words into single elements of a list
now the output looks like this:
[['the', 'mississippi', 'is', 'well', 'worth', 'reading'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways']]
I found that the last word of the first sentence 'about' and the second sentence 'remarkable'is missing
It might be some problem in my regular expression
word_reg=re.compile(r"[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]?(.+?)[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]")
But if I add a question mark into the last part of this regular expression like this:
[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]**?**")
the result become many single letters instead of words. What can I do with it?
Edit:
The reason why I didn't use string.split is that there might be many ways for people to break words
For example: when people input a--b, there is no space, but we have to break it into 'a','b'

Using the right tools is always the winning strategy. In your case, the right tool is the NLTK word tokenizer, because it was designed to do just that: break sentences into words.
import nltk
a = ['the mississippi is well worth reading about',
' it is not a commonplace river, but on the contrary is in all ways remarkable']
nltk.word_tokenize(a[1])
#['it', 'is', 'not', 'a', 'commonplace', 'river', ',', 'but',
# 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']

Suggest a simpler solution:
b = re.split(r"[\W_]", a)
The regex [\W_] matches any single non-word characters (non-letter and non-digit and non-underline) plus the underline, which is practically enough.
Your current regex requires that the word is followed by one of the characters in your list, but not "end of line", which can be matched with $.

You can use re.split and filter:
filter(None, re.split("[, \-!?:]+", a])
Where I have put the string "[, \-!?:]+", you should put whatever characters it is that are your delimiters. filter will just remove any empty strings because of leading/trailing separators.

You can either find what you don't want and split on that:
>>> a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
>>> [re.split(r'\W+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['', 'it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]
(You may need to filter the '' elements produced by re.split)
Or capture what you do want with re.findall and keep those elements:
>>> [re.findall(r'\b\w+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]

Thanks everybody
From others answers, the solution is to use re.split()
and there is a SUPER STAR NLTK in the uppermost answer
def word_find(sentence_list):
word_list=[]
for i in range(len(sentence_list)):
word_list.append(re.split('\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;',sentence_list[i]))
return word_list

Related

How can you combine a list of tokens and characters (including punctuation and symbols) into a single sentence string in python?

I'm trying to join a of list of words and characters such as the one below list (ls), and convert it into a single, correctly formatted sentence string (sentence) for a collection of lists.
ls = ['"', 'Time', '"', 'magazine', 'said' 'the', 'film', 'was',
'"', 'a', 'multimillion', 'dollar', 'improvisation', 'that',
'does', 'everything', 'but', 'what', 'the', 'title', 'promises',
'"', 'and', 'suggested', 'that', '"', 'writer', 'George',
'Axelrod', '(', '"', 'The', 'Seven', 'Year', 'Itch', '"', ')',
'and', 'director', 'Richard', 'Quine', 'should', 'have', 'taken',
'a', 'hint', 'from', 'Holden', "'s", 'character', 'Richard',
'Benson', 'who', 'writes', 'his', 'movie', ',', 'takes', 'a',
'long', 'sober', 'look', 'at', 'what', 'he', 'has', 'wrought',
',', 'and', 'burns', 'it', '.', '"']
sentence = '"Time" magazine said the film was "a multimillion dollar improvisation that does everything but what the title promises" and suggested that "writer George Axelrod ("The Seven Year Itch") and director Richard Quine should have taken a hint from Holden's character Richard Benson who writes his movie, takes a long sober look at what he has wrought, and burns it."'
I've tried a rule based approach that adds an empty space after an element depending on the contents of the next element but my method ended as a really long piece of code that contains rules for as many cases as I could think of like those for parenthesis or quotations. Is there a way to effectively join this list into a correctly formatted sentence more efficiently and effectively?
I think a simple for should do the trick:
sentence = ""
for word in ls:
if (word == ',' or word == '.') and sentence != '':
sentence = sentence[:-1] #removing the last space added
sentence += word
if word != '\"' or word != '(':
sentence += ' ' #adding a space after each word

Joining only a part of a list of strings depends on its value in python

I have a list of strings, for example:
['this', 'is', 'an', 'example', 'of', 'list', 'of', 'strings']
I want to extract only some of these words and join them together only if they have quotes between them for example
so if my list was like...
['I', 'only', 'want', '"this', 'part"', 'of', 'the', 'list']
it will return "this part" Because this is the part that has quotes in between.
I tried using str.find("\"")
but it found only the first quotation mark so I couldn't really use that much, does anyone have any idea on how to do that? I appreciate all the help :))
You can use the following pattern with re.findall:
l = ['I', 'only', 'want', '"this', 'part"', 'of', 'the', 'list', '"another', 'example"']
re.findall(r'\"(.*?)\"', ' '.join(l))
# ['this part', 'another example']
str has method rfind which return index of last occurence (or -1 if not found), so you might do:
elements = ['I', 'only', 'want', '"this', 'part"', 'of', 'the', 'list']
txt = ' '.join(elements)
part = txt[txt.find('"')+1:txt.rfind('"')]
print(part) # this part
+1 is required due to inclusive-exclusive nature of slicing in Python.

discord.py - Divinding string to list [duplicate]

How do I split a sentence and store each word in a list? For example, given a string like "these are words", how do I get a list like ["these", "are", "words"]?
To split on other delimiters, see Split a string by a delimiter in python.
To split into individual characters, see How do I split a string into a list of characters?.
Given a string sentence, this stores each word in a list called words:
words = sentence.split()
To split the string text on any consecutive runs of whitespace:
words = text.split()
To split the string text on a custom delimiter such as ",":
words = text.split(",")
The words variable will be a list and contain the words from text split on the delimiter.
Use str.split():
Return a list of the words in the string, using sep as the delimiter
... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:
import nltk
words = nltk.word_tokenize(raw_sentence)
This has the added benefit of splitting out punctuation.
Example:
>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',',
'waking', 'it', '.']
This allows you to filter out any punctuation you don't want and use only words.
Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.
[Edited]
How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
I want my python function to split a sentence (input) and store each word in a list
The str().split() method does this, it takes a string, splits it into a list:
>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
If you want all the chars of a word/sentence in a list, do this:
print(list("word"))
# ['w', 'o', 'r', 'd']
print(list("some sentence"))
# ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:
>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']
NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.
Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore's law
def split_into_words(line):
import re
word_regex_improved = r"(\w[\w']*\w|\w)"
word_matcher = re.compile(word_regex_improved)
return word_matcher.findall(line)
#Example 1
input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)
# output
['computational', 'power', 'see', "Moore's", 'law', 'and']
#Example 2
input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""
split_into_words(input_2)
#output
['Oh',
'you',
"can't",
'help',
'that',
'said',
'the',
'Cat',
"we're",
'all',
'mad',
'here',
"I'm",
'mad',
"You're",
'mad']

getting a grammar to read more than one keyword in the text

I still consider myself a newbie to pyparsing. I threw together 2 quick grammars and neither succeeds at what I am trying to do. I am trying to come up with a grammar that seems really simple to do but it turns out to be (at least for me) not so trivial. The language has one basic definition. its broken down by keywords and body text. body's can span multiple lines. keywords are found at the beginning of a line within the first 20 chars or so but are terminated with a ';' (no quotes). So I threw together a quick demo program so I could test with a couple of grammars. However when I try to use them, they always get the first keyword but none after that.
I've attached the source code as an example and the output that is occurring. Even though this is just test code, out of habit i did documentation. In the example below the two keywords are NOW; and lastly; Ideally I wouldn't want the semicolon included in the keyword.
Any ideas what I should do to make this work?
from pyparsing import *
def testString(text,grammar):
"""
#summary: perform a test of a grammar
2type text: text
#param text: text buffer for input (a message to be parsed)
#type grammar: MatchFirst or equivalent pyparsing construct
#param grammar: some grammar defined somewhere else
#type pgm: text
#param pgm: typically name of the program, which invoked this function.
#status: 20130802 CODED
"""
print 'Input Text is %s' % text
print 'Grammar is %s' % grammar
tokens = grammar.parseString(text)
print 'After parse string: %s' % tokens
tokens.dump()
tokens.keys()
return tokens
def getText(msgIndex):
"""
#summary: make a text string suitable for parsing
#returns: returns a text buffer
#type msgIndex: int
#param msgIndex: a number corresponding to a text buffer to retrieve
#status: 20130802 CODED
"""
msg = [ """NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
""",
'',
]
return msg[msgIndex]
def getGrammar(grammarIndex):
"""
#summary: make a grammar given an index
#type: grammarIndex: int
#param grammarIndex: a number corresponding to the grammar to be retrieved
#Note: a good run will return 2 keys: NOW: and lastly: and each key will have an associated body. The body is all
words and text up to the next keyword or eof which ever is first.
"""
kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
kw.setDebug(True)
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
body1.setDebug(True)
g1 = OneOrMore(Group(kw + body1))
# ok start defining a new grammar (borrow kw from grammar).
body2 = SkipTo(~kw, include=False)('BODY')
body2.setDebug(True)
g2 = OneOrMore(Group(kw+body2))
grammar = [g1,
g2,
]
return grammar[grammarIndex]
if __name__ == '__main__':
# list indices [ text, grammar ]
tests = {1: [0,0],
2: [0,1],
}
check = tests.keys()
check.sort()
for testno in check:
print 'STARTING Test %d' % testno
text = getText(tests[testno][0])
grammar = getGrammar(tests[testno][1])
tokens = testString(text, grammar)
print 'Tokens found %s' % tokens
print 'ENDING Test %d' % testno
the output looks like this: (using python 2.7 and pyparsing 2.0.1)
STARTING Test 1
Input Text is NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
Grammar is {Group:({Combine:({W:(abcd...) ";"}) {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]...})}...
Match Combine:({W:(abcd...) ";"}) at loc 0(1,1)
Matched Combine:({W:(abcd...) ";"}) -> ['NOW;']
Match {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]... at loc 4(1,5)
Match Combine:({W:(abcd...) ";"}) at loc 161(4,20)
Exception raised:Expected W:(abcd...) (at char 161), (line:4, col:20)
Matched {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]... -> ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
Match Combine:({W:(abcd...) ";"}) at loc 161(4,20)
Exception raised:Expected W:(abcd...) (at char 161), (line:4, col:20)
After parse string: [['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']]
Tokens found [['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']]
ENDING Test 1
STARTING Test 2
Input Text is NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
Grammar is {Group:({Combine:({W:(abcd...) ";"}) SkipTo:(~{Combine:({W:(abcd...) ";"})})})}...
Match Combine:({W:(abcd...) ";"}) at loc 0(1,1)
Matched Combine:({W:(abcd...) ";"}) -> ['NOW;']
Match SkipTo:(~{Combine:({W:(abcd...) ";"})}) at loc 4(1,5)
Match Combine:({W:(abcd...) ";"}) at loc 4(1,5)
Exception raised:Expected ";" (at char 7), (line:1, col:8)
Matched SkipTo:(~{Combine:({W:(abcd...) ";"})}) -> ['']
Match Combine:({W:(abcd...) ";"}) at loc 5(1,6)
Exception raised:Expected ";" (at char 7), (line:1, col:8)
After parse string: [['NOW;', '']]
Tokens found [['NOW;', '']]
ENDING Test 2
Process finished with exit code 0
I'm good with TDD, but here your whole testing and alternative-selecting infrastructure really gets in the way of seeing just where the grammar is and what's going on with it. If I strip away all the extra machinery, I see your grammar is just:
kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
g1 = OneOrMore(Group(kw + body1))
The first issue I see is your definition of body1:
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
You are on the right track with a negative lookahead, but for it to work in pyparsing, you have to put it at the beginning of the expression, not at the end. Think of it as "before I match another valid word, I will first rule out that it is a keyword.":
body1 = delimitedList(OneOrMore(~kw + Word(alphas + nums)))('Body')
(Why is this a delimitedList, by the way? delimitedList is usually reserved for true lists with delimiters, such as comma-delimited arguments to a program function. All this does is accept any commas that might be mixed into the body, which should be handled more straightforwardly using a list of punctuation.)
Here my test version of your code:
from pyparsing import *
kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = OneOrMore(~kw + Word(alphas + nums))('Body')
g1 = OneOrMore(Group(kw + body1))
msg = [ """NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
""",
'',
][0]
result = g1.parseString(msg)
# we expect multiple groups, each containing "KEY" and "Body" names,
# so iterate over groups, and dump the contents of each
for res in result:
print res.dump()
I still get the same results as you, just the first keyword matches. So to see where the disconnect is happening, I use scanString, which returns not only the matched tokens, but also the start and end of the matched tokens:
result,start,end = next(g1.scanString(msg))
print len(msg),end
Which gives me:
320 161
So I see that we are ending at location 161 in a string whose total length is 320, so I'll add one more print statement:
print msg[end:end+10]
and I get:
.
lastly;
The trailing period in your body text is the culprit. If I remove that from message and try parseString again, I now get:
['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- Body: ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- KEY: NOW;
['lastly;', 'another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- Body: ['another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- KEY: lastly;
If you want to handle punctuation, I suggest you add something like:
PUNC = oneOf(". , ? ! : & $")
and add it to body1:
body1 = OneOrMore(~kw + (Word(alphas + nums) | PUNC))('Body')

Technique to remove common words(and their plural versions) from a string

I am attempting to find tags(keywords) for a recipe by parsing a long string of text. The text contains the recipe ingredients, directions and a short blurb.
What do you think would be the most efficient way to remove common words from the tag list?
By common words, I mean words like: 'the', 'at', 'there', 'their' etc.
I have 2 methodologies I can use, which do you think is more efficient in terms of speed and do you know of a more efficient way I could do this?
Methodology 1:
- Determine the number of times each word occurs(using the library Collections)
- Have a list of common words and remove all 'Common Words' from the Collection object by attempting to delete that key from the Collection object if it exists.
- Therefore the speed will be determined by the length of the variable delims
import collections from Counter
delim = ['there','there\'s','theres','they','they\'re']
# the above will end up being a really long list!
word_freq = Counter(recipe_str.lower().split())
for delim in set(delims):
del word_freq[delim]
return freq.most_common()
Methodology 2:
- For common words that can be plural, look at each word in the recipe string, and check if it partially contains the non-plural version of a common word. Eg; For the string "There's a test" check each word to see if it contains "there" and delete it if it does.
delim = ['this','at','them'] # words that cant be plural
partial_delim = ['there','they',] # words that could occur in many forms
word_freq = Counter(recipe_str.lower().split())
for delim in set(delims):
del word_freq[delim]
# really slow
for delim in set(partial_delims):
for word in word_freq:
if word.find(delim) != -1:
del word_freq[delim]
return freq.most_common()
I'd just do something like this:
from nltk.corpus import stopwords
s=set(stopwords.words('english'))
txt="a long string of text about him and her"
print filter(lambda w: not w in s,txt.split())
which prints
['long', 'string', 'text']
and in terms of complexity should be O(n) in number of words in the string, if you believe the hashed set lookup is O(1).
FWIW, my version of NLTK defines 127 stopwords:
'all', 'just', 'being', 'over', 'both', 'through', 'yourselves', 'its', 'before', 'herself', 'had', 'should', 'to', 'only', 'under', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'did', 'this', 'she', 'each', 'further', 'where', 'few', 'because', 'doing', 'some', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'while', 'does', 'above', 'between', 't', 'be', 'we', 'who', 'were', 'here', 'hers', 'by', 'on', 'about', 'of', 'against', 's', 'or', 'own', 'into', 'yourself', 'down', 'your', 'from', 'her', 'their', 'there', 'been', 'whom', 'too', 'themselves', 'was', 'until', 'more', 'himself', 'that', 'but', 'don', 'with', 'than', 'those', 'he', 'me', 'myself', 'these', 'up', 'will', 'below', 'can', 'theirs', 'my', 'and', 'then', 'is', 'am', 'it', 'an', 'as', 'itself', 'at', 'have', 'in', 'any', 'if', 'again', 'no', 'when', 'same', 'how', 'other', 'which', 'you', 'after', 'most', 'such', 'why', 'a', 'off', 'i', 'yours', 'so', 'the', 'having', 'once'
obviously you can provide your own set; I'm in agreement with the comment on your question that it's probably easiest (and fastest) to just provide all the variations you want to eliminate up front, unless you want to eliminate a lot more words than this but then it becomes more a question of spotting interesting ones than eliminating spurious ones.
Your problem domain is "Natural Language Processing".
If you don't want to reinvent the wheel, use NLTK, search for stemming in the docs.
Given that NLP is one of the hardest subjects in computer science, reinventing this wheel is a lot of work...
You ask about speed, but you should be more concerned with accuracy. Both your suggestions will make a lot of mistakes, removing either too much or too little (for example, there are a lot of words that contain the substring "at"). I second the suggestion to look into the nltk module. In fact, one of the early examples in the NLTK book involves removing common words until the most common remaining ones reveal something about the genre. You'll get not only tools, but instruction on how to go about it.
Anyway you'll spend much longer writing your program than your computer will spend executing it, so focus on doing it well.

Categories