Here is the code to find the words start with a in the sentence: "This is an apple tree."
st = 'This is an apple tree'
for word in st.split():
if word[0]=='a':
print(word)
I want to make it to function, and takes in any sentence I want, how to to that?
Here is the code I came up, but is not doing what I want.
def find_words(text):
for word in find_words.split():
if word[0]=='a':
print(word)
return find_words
find_words('This is an apple tree')
Thank you.
You can use the below code. It will provide the list of words which has a word starts with 'a'.
This is simple list comprehension with if clause. Split without argument by default splits the sentence by space and startswith method helps to filter 'a'.
sentence = 'This is an apple tree'
words = [word for word in sentence.split() if word.startswith('a')]
The problem is how you are defining the for loop. It should be:
for word in text.split(' '):
...
Just because text is the parameter in your defined function
If you want to print the result try this:
st = 'This is an apple tree'
def find_words(text):
for word in text.split():
if word.startswith('a'):
print(word)
find_words(st)
Related
I have allow_wd as words that I want to search.
The sentench is an array of the main database.
The output need:
Newsentench = ['one three','']
Please help
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
It is difficult to understand what you are asking. Assuming you want any word in sentench to be kept if it contains anything in allow_wd, something like the following will work:
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
result = []
for sentence in sentench:
filtered = []
for word in sentence.split():
for allowed_word in allow_wd:
if allowed_word.lower() in word.lower():
filtered.append(word)
result.append(" ".join(filtered))
print(result)
If you want the word in the word to be exactly equal to an allowed word instead of just contain, change if allowed_word.lower() in word.lower(): to if allowed_word.lower() == word.lower()
Using regex boundaries with \b will ensure that two will be strictly matched and won't match twoo.
import re
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
newsentench = []
for sent in sentench:
output = []
for wd in allow_wd:
if re.findall('\\b' + wd + '\\b',sent):
output.append(wd)
newsentench.append(' '.join(word for word in output))
print(newsentench)
Thanks for your clarification, this should be what you want.
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
print([" ".join([word for word in s.split(" ") if word in allow_wd]) for s in sentench])
returning: ['one three', '']
I have a list of sentences (~100k sentences total) and a list of "infrequent words" (length ~20k). I would like to run through each sentence and replace any word that matches an entry in "infrequent_words" with the tag "UNK".
(so as a small example, if
infrequent_words = ['dog','cat']
sentence = 'My dog likes to chase after cars'
Then after applying the transformation it should be
sentence = 'My unk likes for chase after cars'
I am having trouble finding an efficient way to do this. This function below (applied to each sentence) works, but it is very slow and I know there must be something better. Any suggestions?
def replace_infrequent_words(text,infrequent_words):
for word in infrequent_words:
text = text.replace(word,'unk')
return text
Thank you!
infrequent_words = {'dog','cat'}
sentence = 'My dog likes to chase after cars'
def replace_infrequent_words(text, infrequent_words):
words = text.split()
for i in range(len(words)):
if words[i] in infrequent_words:
words[i] = 'unk'
return ' '.join(words)
print(replace_infrequent_words(sentence, infrequent_words))
Two things that should improve performance:
Use a set instead of a list for storing infrequent_words.
Use a list to store each word in text so you don't have to scan the entire text string with each replacement.
This doesn't account for grammar and punctuation but this should be a performance improvement from what you posted.
I would like the program to detect whether a certain word is before the search word and if it is not to add it to a list.
This is what I have come up with myself:
sentence = "today i will take my dog for a walk, tomorrow i will not take my dog for a walk"
all = ["take", "take"]
all2= [w for w in all if not(re.search(r'not' + w + r'\b', sentence))]
print(all2)
The excpected output is ["take"], but it remains the same with ["take, "take]
Watch how it should be formulated: gather all take word occurrences that aren't preceded with word not:
import re
sentence = "today i will take my dog for a walk, tomorrow i will not take my dog for a walk"
search_word = 'take'
all_takes_without_not = re.findall(fr'(?<!\bnot)\s+({search_word})\b', sentence)
print(all_takes_without_not)
The output:
['take']
It may be simpler to first convert you sentence to a list of words.
from itertools import chain
# Get individual words from the string
words = sentence.split()
# Create an iterator which yields the previous word at each position
previous = chain([None], words)
output = [word for prev, word in zip(previous, words) if word=='take' and prev != 'not']
I have a corpus of sentences in a specific domain.
I am looking for an open-source code/package, that I can give the data and it will generate a good, reliable language model. (Meaning, given a context, know the probability for each word).
Is there such a code/project?
I saw this github repo: https://github.com/rafaljozefowicz/lm, but it didn't work.
I recommend writing your own basic implementation. First, we need some sentences:
import nltk
from nltk.corpus import brown
words = brown.words()
total_words = len(words)
sentences = list(brown.sents())
sentences is now a list of lists. Each sublist represents a sentence with each word as an element. Now you need to decide whether or not you want to include punctuation in your model. If you want to remove it, try something like the following:
punctuation = [",", ".", ":", ";", "!", "?"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = [word for word in sentence if word not in punctuation]
sentences[i] = new_sentence
Next, you need to decide whether or not you care about capitalization. If you don't care about it, you could remove it like so:
for i, sentence in enumerate(sentences.copy()):
new_sentence = list()
for j, word in enumerate(sentence.copy()):
new_word = word.lower() # Lower case all characters in word
new_sentence.append(new_word)
sentences[i] = new_sentence
Next, we need special start and end words to represent words that are valid at the beginning and end of sentences. You should pick start and end words that don't exist in your training data.
start = ["<<START>>"]
end = ["<<END>>"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = start + sentence + end
sentences[i] = new_sentence
Now, let's count unigrams. A unigram is a sequence of one word in a sentence. Yes, a unigram model is just a frequency distribution of each word in the corpus:
new_words = list()
for sentence in sentences:
for word in sentence:
new_words.append(word)
unigram_fdist = nltk.FreqDist(new_words)
And now it's time to count bigrams. A bigram is a sequence of two words in a sentence. So, for the sentence "i am the walrus", we have the following bigrams: "<> i", "i am", "am the", "the walrus", and "walrus <>".
bigrams = list()
for sentence in sentences:
new_bigrams = nltk.bigrams(sentence)
bigrams += new_bigrams
Now we can create a frequency distribution:
bigram_fdist = nltk.ConditionalFreqDist(bigrams)
Finally, we want to know the probability of each word in the model:
def getUnigramProbability(word):
if word in unigram_fdist:
return unigram_fdist[word]/total_words
else:
return -1 # You should figure out how you want to handle out-of-vocabulary words
def getBigramProbability(word1, word2):
if word1 not in bigram_fdist:
return -1 # You should figure out how you want to handle out-of-vocabulary words
elif word2 not in bigram_fdist[word1]:
# i.e. "word1 word2" never occurs in the corpus
return getUnigramProbability(word2)
else:
bigram_frequency = bigram_fdist[word1][word2]
unigram_frequency = unigram_fdist[word1]
bigram_probability = bigram_frequency / unigram_frequency
return bigram_probability
While this isn't a framework/library that just builds the model for you, I hope seeing this code has demystified what goes on in a language model.
You might try word_language_model from PyTorch examples. There just might be an issue if you have a big corpus. They load all data in memory.
I have a set of words as follows:
['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
In the above sentences i need to identify all sentences ending with ? or . or 'gy'. and print the final word.
My approach is as follows:
# words will contain the string i have pasted above.
word = [w for w in words if re.search('(?|.|gy)$', w)]
for i in word:
print i
The result i get is:
Hey, how are you?
My name is Mathews.
I hate vegetables
French fries came out soggy
The expected result is:
you?
Mathews.
soggy
Use endswith() method.
>>> for line in testList:
for word in line.split():
if word.endswith(('?', '.', 'gy')) :
print word
Output:
you?
Mathews.
soggy
Use endswith with a tuple.
lines = ['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
for line in lines:
for word in line.split():
if word.endswith(('?', '.', 'gy')):
print word
Regular expression alternative:
import re
lines = ['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
for line in lines:
for word in re.findall(r'\w+(?:\?|\.|gy\b)', line):
print word
You were close.
You just need to escape the special characters (? and .) in the pattern:
re.search(r'(\?|\.|gy)$', w)
More details in the documentation.