I am processing a large text file and as output I have a list of words:
['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December', ...]
What I want to achieve next is to transform everything to lowercase, remove all the words that belong to a stopset (commonly used words) and remove punctuation. I can do it by doing 3 iterations:
lower=[word.lower() for word in mywords]
removepunc=[word for word in lower if word not in string.punctuation]
final=[word for word in removepunc if word not in stopset]
I tried to use
[word for word in lower if word not in string.punctuation or word not in stopset]
to achieve what last 2 lines of code are supposed to do but it's not working. Where is my error and is there any faster way to achieve this than to iterate through the list 3 times?
If your code is working as intended, I don't think it's a good idea. Now it is well readable and can be easily modified with additional processing. One-liners are good for SO to get more upvotes, you'll get hard time understainding its logic some time later.
You can replace intermediate steps with generators instead of lists, to make your computation work once, and not to generate several lists:
lower = (word.lower() for word in mywords)
removepunc = (word for word in lower if word not in string.punctuation)
final = [word for word in removepunc if word not in stopset]
You can certainly compress the logic:
final = [word for word in map(str.lower, mywords)
if word not in string.punctuation and word not in stopset]
For example, if I define stopset = ['if'] I get:
['today', 'cold', 'outside', '2013', 'december']
Here is the equivalent single list comprehension, although I agree with alko that what you already have is clearer:
final = [lower_word for word in mywords for lower_word in (word.lower(),) if lower_word not in string.punction and lower_word not in stopset]
note that list comprehensions are not the best way to go when it comes to large files, as the entire file will have to be loaded to memory.
instead do something like Read large text files in Python, line by line without loading it in to memory
with open("log.txt") as infile:
for line in infile:
if clause goes here:
....
I'd guess the fastest approach is try to move as much as possible of the computation from Python to C. First precompute the set of forbidden strings. This needs to be done just once.
avoid = set(string.punctuation) | set(x.lower() for x in stopset)
then let the set subtraction operation to do as much of the filtering as possible
final = set(x.lower() for x in mywords) - avoid
Converting the whole source of words at once to lowercase before starting would probably improve speed too. In that case the code would be
final = set(mywords) - avoid
You can use map to fold in the .lower processing
final = [word for word in map(str.lower, mywords) if word not in string.punctuation and word not in stopset]
You can simply add string.punctuation to stopset, then it becomes
final = [word for word in map(str.lower, mywords) if word not in stopset]
Are sure you don't want to preserve the case of the words in the output though?
is there any faster way to achieve this than to iterate through the
list 3 times?
Turn johnsharpe's code into a generator. This may drastically speed up the use and lower memory use as well.
import string
stopset = ['is']
mywords = ['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December']
final = (word.lower() for word in mywords if (word not in string.punctuation and
word not in stopset))
print "final", list(final)
To display results outside of an iterator for debugging, use list as in this example
If you use filter you can do it with one list comprehension and it is easier to read.
final = filter( lambda s: s not in string.punctation and s not in stopset ,[word.lower() for word in mywords])
Related
I need some python advice to implement an algorithm.
What I need is to detect which words from text 1 are in text 2:
Text 1: "Mary had a dog. The dog's name was Ethan. He used to run down
the meadow, enjoying the flower's scent."
Text 2: "Mary had a cat. The cat's name was Coco. He used to run down
the street, enjoying the blue sky."
I'm thinking I could use some pandas datatype to check repetitions, but I'm not sure.
Any ideas on how to implement this would be very helpful. Thank you very much in advance.
Since you do not show any work of your own, I'll just give an overall algorithm.
First, split each text into its words. This can be done in several ways. You could remove any punctuation then split on spaces. You need to decide if an apostrophe as in dog's is part of the word--you probably want to leave apostrophes in. But remove periods, commas, and so forth.
Second, place the words for each text into a set.
Third, use the built-in set operations to find which words are in both sets.
This will answer your actual question. If you want a different question that involves the counts or positions of the words, you should make that clear.
You can use dictionary to first store words from first text and than just simply look up while iterating the second text. But this will take space.
So best way is to use regular expressions.
First extract words from both strings into lists. I assume you would want to ignore any trailing periods or commas. Add one of the lists to a set (for expected constant time lookup). For each word in another list, check if it's also present in the set; That gets you words common in both of the texts. I assumed that duplicate elements are counted only once. Following is the code for doing this:
def get_words(text):
words = text.split()
for i in range(len(words)):
words[i] = words[i].strip('.,')
return words
def common_words(text1, text2):
words1 = get_words(text1)
words2 = set(get_words(text2))
common = set()
for word in words1:
if word in words2:
common.add(word)
return common
For your example, it would return:
{'enjoying', 'had', 'to', 'Mary', 'used', 'the', 'The', 'was', 'down', 'name', 'He', 'run', 'a'}
Note that words "the" and "The" are counted as distinct. If you don't want that, you can convert all words to lower case; words[i] = lower(words[i].strip('.,'))
I have a method to remove punctuation from every word in an array of words and I want to use it a list comprehension. All I can think of with my basic Python knowledge is:
def remove_punctuation(sentence: str) -> str:
return sentence.translate(str.maketrans('', '', string.punctuation))
def letters_only(astr):
return astr.isalpha()
def clean_text(docs):
cleaned_docs = []
for doc in docs:
cleaned_docs.append(' '.join([lemmatizer.lemmatize(remove_punctuation(word.lower()))
for word in doc.split()
if letters_only(word)
and remove_punctuation(word) not in all_names
and remove_punctuation(word) not in all_names_lower]))
return cleaned_docs
As you can see I am using the"remove_punctuation" method in many place. Is there any way to use if only once or more efficiently?
Thanks!
*letters_only - it is from some tutorial and unfortunatelly it sees word "best!" with exclamation mark at the end and removes the word - bu I am trying to make it remove only exclamation mark.
Since you provided the definitions for letters_only and remove_punctuation we can now say that your code is equivalent to:
[lemmatizer.lemmatize(word.lower())
for word in doc.split()
if letters_only(word) and word.lower() not in all_names_lower]
So all the calls to remove_punctuation are useless, because they are done only if letters_only(word) which means word does not have any punctuation.
Not really. The best you can do is zip together the original list with a generator that removes punctuation:
original_words = doc.split()
no_punct_words = map(remove_punctuation, original_words)
cleaned_docs.append(' '.join([lemmatizer.lemmatize(no_punct_word.lower())
for word, no_punct_word in zip(original_words, no_punct_words) if letters_only(word)
and no_punct_word not in all_names
and no_punct_word not in all_names_lower]))
Anyway your conditions do not make much sense. If the if letters_only(word) condition is true I'd expect remove_punctuation to do nothing to word and so you could remove it.
Also: the two conditions:
no_punct_word not in all_names and no_punct_word not in all_names_lower
Could probably become just:
no_punct_word.lower() not in all_names_lower
As an aside: if the conditions you want to apply should always be applied to remove_punctuation(word) then you can do better: you can just map that function:
no_punct_words = map(remove_punctuation, doc.split())
# ...
[lemmatizer.lemmatize(word.lower())
for word in no_punct_words if letters_only(word)
and word.lower() not in all_names_lower]
And maybe you can do the same with .lower():
lower_no_punct_words = map(str.lower, map(remove_punctuation, doc.split()))
# ...
[lemmatizer.lemmatize(word)
for word in lower_no_punct_words if letters_only(word)
and word not in all_names_lower]
Trying to guess the intention (the code seems to have few bugs), I'd say you should be good with something like the below. Note the laziness of the whole thing, it should make the code less greedy on memory consumption.
def normalized_words_of(doc):
for word in doc.split():
if letters_only(word):
yield remove_punctuation(word.lower())
def clean_text(docs):
for doc in docs:
yield ' '.join(word for word in normalized_words_of(doc) if word not in all_names_lower)
print(list(clean_text(['hi there, you', 'good bye - till next time'])))
I would like to do a stopword removal.
I have a list which consists of about 15,000 strings. those strings are little texts. My code is the following:
h = []
for w in clean.split():
if w not in cachedStopWords:
h.append(w)
if w in cachedStopWords:
h.append(" ")
print(h)
I understand that .split() is necessary so that not every whole string is being compared to the list of stopwords. But it does not seem to work because it cannot split lists. (Without any kind of splitting h = clean, because nothing matches obviously.)
Does anyone have an idea how else I could split the different strings in the list while still preserving the different cases?
A very minimal example:
stops = {'remove', 'these', 'words'}
strings = ['please do not remove these words', 'removal is not cool', 'please please these are the bees\' knees', 'there are no stopwords here']
strings_cleaned = [' '.join(word for word in s.split() if word not in stops) for s in strings]
Or you could do:
strings_cleaned = []
for s in strings:
word_list = []
for word in s.split():
if word not in stops:
word_list.append(word)
s_string = ' '.join(word_list)
strings_cleaned.append(s_string)
This is a lot uglier (I think) than the one-liner before it, but perhaps more intuitive.
Make sure you're converting your container of stopwords to a set (a hashable container which makes lookups O(1) instead of lists, whose lookups are O(n)).
Edit: This is just a general, very straightforward example of how to remove stopwords. Your use case might be a little different, but since you haven't provided a sample of your data, we can't help any further.
I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:
Total word count
Total count of unique words (without case and special characters interfering)
The number of sentences
Average words in a sentence
Find common used phrases (a phrase of 3 or more words used over 3 times)
A list of words used, in order of descending frequency (without case and special characters interfering)
The ability to accept input from STDIN, or from a file specified on the command line
So far I have this Python program to print total word count:
with open('/Users/name/Desktop/20words.txt', 'r') as f:
p = f.read()
words = p.split()
wordCount = len(words)
print "The total word count is:", wordCount
So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog, dog., "dog, and dog, as different words)
file=open("/Users/name/Desktop/20words.txt", "r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k, v in wordcount.items():
print k, v
Thank you for any help you can give!
Certainly the most difficult part is identifying the sentences. You could use a regular expression for this, but there might still be some ambiguity, e.g. with names and titles, that have a dot followed by an upper case letter. For words, too, you can use a simple regex, instead of using split. The exact expression to use depends on what qualifies as a "word". Finally, you can use collections.Counter for counting all of those instead of doing this manually. Use str.lower to convert either the text as a whole or the individual words to lowercase.
This should help you getting startet:
import re, collections
text = """Sentences start with an upper-case letter. Do they always end
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two,
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""
sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
print n, s
word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
print n, w
For "more power", you could use some natural language toolkit, but this might be a bit much for this task.
If you know what characters you want to avoid, you can use str.strip to remove these characters from the extremities.
word = word.strip().strip("'").strip('"')...
This will remove the occurrence of these characters on the extremities of the word.
This probably isn't as efficient as using some NLP library, but it can get the job done.
str.strip Docs
#!/usr/bin/python
#this looks for words in dictionary that begin with 'in' and the suffix is a real word
wordlist = [line.strip() for line in open('/usr/share/dict/words')]
newlist = []
for word in wordlist:
if word.startswith("in"):
newlist.append(word)
for word in newlist:
word = word.split('in')
print newlist
how would I get the program to remove the string "in" from all the words that it starts with? right now it does not work
#!/usr/bin/env python
# Look for all words beginning with 'in'
# such that the rest of the word is also
# a valid word.
# load the dictionary:
with open('/usr/share/dict/word') as inf:
allWords = set(word.strip() for word in inf) # one word per line
using 'with' ensures the file is always properly closed;
I make allWords a set; this makes searching it an O(1) operation
then we can do
# get the remainder of all words beginning with 'in'
inWords = [word[2:] for word in allWords if word.startswith("in")]
# filter to get just those which are valid words
inWords = [word for word in inWords if word in allWords]
or run it into a single statement, like
inWords = [word for word in (word[2:] for word in allWords if word.startswith("in")) if word in allWords]
Doing it the second way also lets us use a generator for the inside loop, reducing memory requirements.
split() returns a list of the segments obtained by splitting. Furthermore,
word = word.split('in')
doesn't modify your list, it just modifies the variable being iterated.
Try replacing your second loop with this:
for i in range(len(newlist)):
word = newlist[i].split('in', 1)
newlist[i] = word[1]
It's difficult to tell from your question what you want in newlist if you just want words that start with "in" but with "in" removed then you can use a slice:
newlist = [word[2:] for word in wordlist if word.startswith('in')]
If you want words that start with "in" are still in wordlist once they've had "in" removed (is that what you meant by "real" in your comment?) then you need something a little different:
newlist = [word for word in wordlist if word.startswith('in') and word[2:] in wordlist
Note that in Python we use a list, not an "array".
Suppose that wordlist is the list of words. Following code should do the trick:
for i in range(len(wordlist)):
if wordlist[i].startswith("in"):
wordlist[i] = wordlist[i][2:]
It is better to use while loop if the number of words in the list is quite big.