I have a method to remove punctuation from every word in an array of words and I want to use it a list comprehension. All I can think of with my basic Python knowledge is:
def remove_punctuation(sentence: str) -> str:
return sentence.translate(str.maketrans('', '', string.punctuation))
def letters_only(astr):
return astr.isalpha()
def clean_text(docs):
cleaned_docs = []
for doc in docs:
cleaned_docs.append(' '.join([lemmatizer.lemmatize(remove_punctuation(word.lower()))
for word in doc.split()
if letters_only(word)
and remove_punctuation(word) not in all_names
and remove_punctuation(word) not in all_names_lower]))
return cleaned_docs
As you can see I am using the"remove_punctuation" method in many place. Is there any way to use if only once or more efficiently?
Thanks!
*letters_only - it is from some tutorial and unfortunatelly it sees word "best!" with exclamation mark at the end and removes the word - bu I am trying to make it remove only exclamation mark.
Since you provided the definitions for letters_only and remove_punctuation we can now say that your code is equivalent to:
[lemmatizer.lemmatize(word.lower())
for word in doc.split()
if letters_only(word) and word.lower() not in all_names_lower]
So all the calls to remove_punctuation are useless, because they are done only if letters_only(word) which means word does not have any punctuation.
Not really. The best you can do is zip together the original list with a generator that removes punctuation:
original_words = doc.split()
no_punct_words = map(remove_punctuation, original_words)
cleaned_docs.append(' '.join([lemmatizer.lemmatize(no_punct_word.lower())
for word, no_punct_word in zip(original_words, no_punct_words) if letters_only(word)
and no_punct_word not in all_names
and no_punct_word not in all_names_lower]))
Anyway your conditions do not make much sense. If the if letters_only(word) condition is true I'd expect remove_punctuation to do nothing to word and so you could remove it.
Also: the two conditions:
no_punct_word not in all_names and no_punct_word not in all_names_lower
Could probably become just:
no_punct_word.lower() not in all_names_lower
As an aside: if the conditions you want to apply should always be applied to remove_punctuation(word) then you can do better: you can just map that function:
no_punct_words = map(remove_punctuation, doc.split())
# ...
[lemmatizer.lemmatize(word.lower())
for word in no_punct_words if letters_only(word)
and word.lower() not in all_names_lower]
And maybe you can do the same with .lower():
lower_no_punct_words = map(str.lower, map(remove_punctuation, doc.split()))
# ...
[lemmatizer.lemmatize(word)
for word in lower_no_punct_words if letters_only(word)
and word not in all_names_lower]
Trying to guess the intention (the code seems to have few bugs), I'd say you should be good with something like the below. Note the laziness of the whole thing, it should make the code less greedy on memory consumption.
def normalized_words_of(doc):
for word in doc.split():
if letters_only(word):
yield remove_punctuation(word.lower())
def clean_text(docs):
for doc in docs:
yield ' '.join(word for word in normalized_words_of(doc) if word not in all_names_lower)
print(list(clean_text(['hi there, you', 'good bye - till next time'])))
Related
I am trying to get raw_input from user and then find a required word from that input. If the required word is there, then a function runs. So I tried .split to split the input but how do I find if the required word is in the list.
It's really simple to get this done. Python has an in operator that does exactly what you need. You can see if a word is present in a string and then do whatever else you'd like to do.
sentence = 'hello world'
required_word = 'hello'
if required_word in sentence:
# do whatever you'd like
You can see some basic examples of the in operator in action here.
Depending on the complexity of your input or lack of complexity of your required word, you may run into some problems. To deal with that you may want to be a little more specific with your required word.
Let's take this for example:
sentence = 'i am harrison'
required_word = 'is'
This example will evaluate to True if you were to doif required_word in sentence: because technically the letters is are a substring of the word "harrison".
To fix that you would just simply do this:
sentence = 'i am harrison'
required_word = ' is '
By putting the empty space before and after the word it will specifically look for occurrences of the required word as a separate word, and not as a part of a word.
HOWEVER, if you are okay with matching substrings as well as word occurrences then you can ignore what I previously explained.
If there's a group of words and if any of them is the required one, then what should I do? Like, the required word is either "yes" or "yeah". And the input by user contains "yes" or "yeah".
As per this question, an implementation would look like this:
sentence = 'yes i like to code in python'
required_words = ['yes', 'yeah']
^ ^ ^ ^
# add spaces before and after each word if you don't
# want to accidentally run into a chance where either word
# is a substring of one of the words in sentence
if any(word in sentence for word in required_words):
# do whatever you'd like
This makes use of the any operator. The if statement will evaluate to true as long as at least one of the words in required_words is found in sentence.
Harrison's way is one way. Here are other ways:
Way 1:
sentence = raw_input("enter input:")
words = sentence.split(' ')
desired_word = 'test'
if desired_word in words:
# do required operations
Way 2:
import re
sentence = raw_input("enter input:")
desired_word = 'test'
if re.search('\s' + desired_word + '\s', sentence.strip()):
# do required operations
Way 3 (especially if there are punctuations at the end of the word):
import re
sentence = raw_input("enter input:")
desired_word = 'test'
if re.search('\s' + desired_word + '[\s,:;]', sentence.strip()):
# do required operations
I am processing a large text file and as output I have a list of words:
['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December', ...]
What I want to achieve next is to transform everything to lowercase, remove all the words that belong to a stopset (commonly used words) and remove punctuation. I can do it by doing 3 iterations:
lower=[word.lower() for word in mywords]
removepunc=[word for word in lower if word not in string.punctuation]
final=[word for word in removepunc if word not in stopset]
I tried to use
[word for word in lower if word not in string.punctuation or word not in stopset]
to achieve what last 2 lines of code are supposed to do but it's not working. Where is my error and is there any faster way to achieve this than to iterate through the list 3 times?
If your code is working as intended, I don't think it's a good idea. Now it is well readable and can be easily modified with additional processing. One-liners are good for SO to get more upvotes, you'll get hard time understainding its logic some time later.
You can replace intermediate steps with generators instead of lists, to make your computation work once, and not to generate several lists:
lower = (word.lower() for word in mywords)
removepunc = (word for word in lower if word not in string.punctuation)
final = [word for word in removepunc if word not in stopset]
You can certainly compress the logic:
final = [word for word in map(str.lower, mywords)
if word not in string.punctuation and word not in stopset]
For example, if I define stopset = ['if'] I get:
['today', 'cold', 'outside', '2013', 'december']
Here is the equivalent single list comprehension, although I agree with alko that what you already have is clearer:
final = [lower_word for word in mywords for lower_word in (word.lower(),) if lower_word not in string.punction and lower_word not in stopset]
note that list comprehensions are not the best way to go when it comes to large files, as the entire file will have to be loaded to memory.
instead do something like Read large text files in Python, line by line without loading it in to memory
with open("log.txt") as infile:
for line in infile:
if clause goes here:
....
I'd guess the fastest approach is try to move as much as possible of the computation from Python to C. First precompute the set of forbidden strings. This needs to be done just once.
avoid = set(string.punctuation) | set(x.lower() for x in stopset)
then let the set subtraction operation to do as much of the filtering as possible
final = set(x.lower() for x in mywords) - avoid
Converting the whole source of words at once to lowercase before starting would probably improve speed too. In that case the code would be
final = set(mywords) - avoid
You can use map to fold in the .lower processing
final = [word for word in map(str.lower, mywords) if word not in string.punctuation and word not in stopset]
You can simply add string.punctuation to stopset, then it becomes
final = [word for word in map(str.lower, mywords) if word not in stopset]
Are sure you don't want to preserve the case of the words in the output though?
is there any faster way to achieve this than to iterate through the
list 3 times?
Turn johnsharpe's code into a generator. This may drastically speed up the use and lower memory use as well.
import string
stopset = ['is']
mywords = ['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December']
final = (word.lower() for word in mywords if (word not in string.punctuation and
word not in stopset))
print "final", list(final)
To display results outside of an iterator for debugging, use list as in this example
If you use filter you can do it with one list comprehension and it is easier to read.
final = filter( lambda s: s not in string.punctation and s not in stopset ,[word.lower() for word in mywords])
given the string below,
sentences = "He is a student. She is a teacher. They're students, indeed. Babies sleep much. Tell me the truth. Bell--push it!"
how can i print the words in the "sentences" that contain only one "e", but no other vowels?
so, basically, i want the following:
He She Tell me the
my code below does not give me what i want:
for word in sentences.split():
if re.search(r"\b[^AEIOUaeiou]*[Ee][^AEIOUaeiou]*\b", word):
print word
any suggestions?
You're already splitting out the words, so use anchors (as opposed to word boundaries) in your regular expression:
>>> for word in sentences.split():
... if re.search(r"^[^AEIOUaeiou]*[Ee][^AEIOUaeiou]*$", word):
... print word
He
She
Tell
me
the
>>>
Unless you're going for a "regex-only" solution, some other options could be:
others = set('aiouAIOU')
[w for w in re.split(r"[^\w']", sentence) if w.count('e') == 1 and not others & set(w)]
which will return a list of the matching words. That led me to a more readable version below, which I'd probably prefer to run into in a maintenance situation as it's easier to see (and adjust) the different steps of breaking down the sentence and the discrete business rules:
for word in re.split(r"[^\w']", sentence):
if word.count('e') == 1 and not others & set(word):
print word
Ok, I'm trying to figure out how to make a inputed phrase such as this in python ....
Self contained underwater breathing apparatus
output this...
SCUBA
Which would be the first letter of each word. Is this something to do with index? and maybe a .upper function?
This is the pythonic way to do it:
output = "".join(item[0].upper() for item in input.split())
# SCUBA
There you go. Short and easy to understand.
LE:
If you have other delimiters than space, you can split by words, like this:
import re
input = "self-contained underwater breathing apparatus"
output = "".join(item[0].upper() for item in re.findall("\w+", input))
# SCUBA
Here's the quickest way to get it done
input = "Self contained underwater breathing apparatus"
output = ""
for i in input.upper().split():
output += i[0]
#here is my trial, brief and potent!
str = 'Self contained underwater breathing apparatus'
reduce(lambda x,y: x+y[0].upper(),str.split(),'')
#=> SCUBA
Pythonic Idioms
Using a generator expression over str.split()
Optimize the inner loop by moving upper() to one call at outside of the loop.
Implementation:
input = 'Self contained underwater breathing apparatus'
output = ''.join(word[0] for word in input.split()).upper()
Another way
input = 'Self contained underwater breathing apparatus'
output = ''.join(item[0].capitalize() for item in input.split())
def myfunction(string):
return (''.join(map(str, [s[0] for s in string.upper().split()])))
myfunction("Self contained underwater breathing apparatus")
Returns SCUBA
s = "Self contained underwater breathing apparatus"
for item in s.split():
print item[0].upper()
Some list comprehension love:
"".join([word[0].upper() for word in sentence.split()])
Another way which may be more easy for total beginners to apprehend:
acronym = input('Please give what names you want acronymized: ')
acro = acronym.split() #acro is now a list of each word
for word in acro:
print(word[0].upper(),end='') #prints out the acronym, end='' is for obstructing capitalized word to be stacked one below the other
print() #gives a line between answer and next command line's return
I believe you can get it done this way as well.
def get_first_letters(s):
return ''.join(map(lambda x:x[0].upper(),s.split()))
Why no one is using regex? In JavaScript, I would use regex so I don't need to use the loop, please find Python example below.
import re
input = "Self-contained underwater & breathing apparatus google.com"
output = ''.join(re.findall(r"\b(\w)", input.upper()))
print(output)
# SCUBAGC
Please note that the above regex /\b(\w)/g uses \b word boundary and \w word so it will match position between an alphanumeric word character and non-word character so for example ā&ā is not matched and ".com" ācā is matched and also "s" and "c" is matched on "self-contained"
Alternative Regex using lookahead and lookbehind:
/(?!a\s)\b[\w]|&/g Excluding " a " and including "&"
/(?<=\s)[\w&]|^./g Any word character and "&" after every whitespace. This prevents matching "c" on .com but also prevents matching "c" on "self-contained"
Code snippet
Regex 1, Regex 2, Regex 3
#!/usr/bin/python
#this looks for words in dictionary that begin with 'in' and the suffix is a real word
wordlist = [line.strip() for line in open('/usr/share/dict/words')]
newlist = []
for word in wordlist:
if word.startswith("in"):
newlist.append(word)
for word in newlist:
word = word.split('in')
print newlist
how would I get the program to remove the string "in" from all the words that it starts with? right now it does not work
#!/usr/bin/env python
# Look for all words beginning with 'in'
# such that the rest of the word is also
# a valid word.
# load the dictionary:
with open('/usr/share/dict/word') as inf:
allWords = set(word.strip() for word in inf) # one word per line
using 'with' ensures the file is always properly closed;
I make allWords a set; this makes searching it an O(1) operation
then we can do
# get the remainder of all words beginning with 'in'
inWords = [word[2:] for word in allWords if word.startswith("in")]
# filter to get just those which are valid words
inWords = [word for word in inWords if word in allWords]
or run it into a single statement, like
inWords = [word for word in (word[2:] for word in allWords if word.startswith("in")) if word in allWords]
Doing it the second way also lets us use a generator for the inside loop, reducing memory requirements.
split() returns a list of the segments obtained by splitting. Furthermore,
word = word.split('in')
doesn't modify your list, it just modifies the variable being iterated.
Try replacing your second loop with this:
for i in range(len(newlist)):
word = newlist[i].split('in', 1)
newlist[i] = word[1]
It's difficult to tell from your question what you want in newlist if you just want words that start with "in" but with "in" removed then you can use a slice:
newlist = [word[2:] for word in wordlist if word.startswith('in')]
If you want words that start with "in" are still in wordlist once they've had "in" removed (is that what you meant by "real" in your comment?) then you need something a little different:
newlist = [word for word in wordlist if word.startswith('in') and word[2:] in wordlist
Note that in Python we use a list, not an "array".
Suppose that wordlist is the list of words. Following code should do the trick:
for i in range(len(wordlist)):
if wordlist[i].startswith("in"):
wordlist[i] = wordlist[i][2:]
It is better to use while loop if the number of words in the list is quite big.