String comparison in python words ending with - python

I have a set of words as follows:
['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
In the above sentences i need to identify all sentences ending with ? or . or 'gy'. and print the final word.
My approach is as follows:
# words will contain the string i have pasted above.
word = [w for w in words if re.search('(?|.|gy)$', w)]
for i in word:
print i
The result i get is:
Hey, how are you?
My name is Mathews.
I hate vegetables
French fries came out soggy
The expected result is:
you?
Mathews.
soggy

Use endswith() method.
>>> for line in testList:
for word in line.split():
if word.endswith(('?', '.', 'gy')) :
print word
Output:
you?
Mathews.
soggy

Use endswith with a tuple.
lines = ['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
for line in lines:
for word in line.split():
if word.endswith(('?', '.', 'gy')):
print word
Regular expression alternative:
import re
lines = ['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
for line in lines:
for word in re.findall(r'\w+(?:\?|\.|gy\b)', line):
print word

You were close.
You just need to escape the special characters (? and .) in the pattern:
re.search(r'(\?|\.|gy)$', w)
More details in the documentation.

Related

How to filter a sentence based on list of the allowed words in python?

I have allow_wd as words that I want to search.
The sentench is an array of the main database.
The output need:
Newsentench = ['one three','']
Please help
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
It is difficult to understand what you are asking. Assuming you want any word in sentench to be kept if it contains anything in allow_wd, something like the following will work:
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
result = []
for sentence in sentench:
filtered = []
for word in sentence.split():
for allowed_word in allow_wd:
if allowed_word.lower() in word.lower():
filtered.append(word)
result.append(" ".join(filtered))
print(result)
If you want the word in the word to be exactly equal to an allowed word instead of just contain, change if allowed_word.lower() in word.lower(): to if allowed_word.lower() == word.lower()
Using regex boundaries with \b will ensure that two will be strictly matched and won't match twoo.
import re
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
newsentench = []
for sent in sentench:
output = []
for wd in allow_wd:
if re.findall('\\b' + wd + '\\b',sent):
output.append(wd)
newsentench.append(' '.join(word for word in output))
print(newsentench)
Thanks for your clarification, this should be what you want.
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
print([" ".join([word for word in s.split(" ") if word in allow_wd]) for s in sentench])
returning: ['one three', '']

Python Code to find the letter start with a in a sentence

Here is the code to find the words start with a in the sentence: "This is an apple tree."
st = 'This is an apple tree'
for word in st.split():
if word[0]=='a':
print(word)
I want to make it to function, and takes in any sentence I want, how to to that?
Here is the code I came up, but is not doing what I want.
def find_words(text):
for word in find_words.split():
if word[0]=='a':
print(word)
return find_words
find_words('This is an apple tree')
Thank you.
You can use the below code. It will provide the list of words which has a word starts with 'a'.
This is simple list comprehension with if clause. Split without argument by default splits the sentence by space and startswith method helps to filter 'a'.
sentence = 'This is an apple tree'
words = [word for word in sentence.split() if word.startswith('a')]
The problem is how you are defining the for loop. It should be:
for word in text.split(' '):
...
Just because text is the parameter in your defined function
If you want to print the result try this:
st = 'This is an apple tree'
def find_words(text):
for word in text.split():
if word.startswith('a'):
print(word)
find_words(st)

Using .replace effectively on text

I'm attempting to capitalize all words in a section of text that only appear once. I have the bit that finds which words only appear once down, but when I go to replace the original word with the .upper version, a bunch of other stuff gets capitalized too. It's a small program, so here's the code.
from collections import Counter
from string import punctuation
path = input("Path to file: ")
with open(path) as f:
word_counts = Counter(word.strip(punctuation) for line in f for word in line.replace(")", " ").replace("(", " ")
.replace(":", " ").replace("", " ").split())
wordlist = open(path).read().replace("\n", " ").replace(")", " ").replace("(", " ").replace("", " ")
unique = [word for word, count in word_counts.items() if count == 1]
for word in unique:
print(word)
wordlist = wordlist.replace(word, str(word.upper()))
print(wordlist)
The output should be 'Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan., as sojournings is the first word that only appears once. Instead, it outputs GenesIs 37:1 Jacob lIved In the land of hIs FATher's SOJOURNINGS, In the land of Canaan. Because some of the other letters appear in keywords, it tries to capitalize them as well.
Any ideas?
I rewrote the code pretty significantly since some of the chained replace calls might prove to be unreliable.
import string
# The sentence.
sentence = "Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan."
rm_punc = sentence.translate(None, string.punctuation) # remove punctuation
words = rm_punc.split(' ') # split spaces to get a list of words
# Find all unique word occurrences.
single_occurrences = []
for word in words:
# if word only occurs 1 time, append it to the list
if words.count(word) == 1:
single_occurrences.append(word)
# For each unique word, find it's index and capitalize the letter at that index
# in the initial string (the letter at that index is also the first letter of
# the word). Note that strings are immutable, so we are actually creating a new
# string on each iteration. Also, sometimes small words occur inside of other
# words, e.g. 'an' inside of 'land'. In order to make sure that our call to
# `index()` doesn't find these small words, we keep track of `start` which
# makes sure we only ever search from the end of the previously found word.
start = 0
for word in single_occurrences:
try:
word_idx = start + sentence[start:].index(word)
except ValueError:
# Could not find word in sentence. Skip it.
pass
else:
# Update counter.
start = word_idx + len(word)
# Rebuild sentence with capitalization.
first_letter = sentence[word_idx].upper()
sentence = sentence[:word_idx] + first_letter + sentence[word_idx+1:]
print(sentence)
Text replacement by patters calls for regex.
Your text is a bit tricky, you have to
remove digits
remove punktuations
split into words
care about capitalisation: 'It's' vs 'it's'
only replace full matches 'remote' vs 'mote' when replacing mote
etc.
This should do this - see comments inside for explanations:
bible.txt is from your link
from collections import Counter
from string import punctuation , digits
import re
from collections import defaultdict
with open(r"SO\AllThingsPython\P4\bible.txt") as f:
s = f.read()
# get a set of unwanted characters and clean the text
ps = set(punctuation + digits)
s2 = ''.join( c for c in s if c not in ps)
# split into words
s3 = s2.split()
# create a set of all capitalizations of each word
repl = defaultdict(set)
for word in s3:
repl[word.upper()].add(word) # f.e. {..., 'IN': {'In', 'in'}, 'THE': {'The', 'the'}, ...}
# count all words _upper case_ and use those that only occure once
single_occurence_upper_words = [w for w,n in Counter( (w.upper() for w in s3) ).most_common() if n == 1]
text = s
# now the replace part - for all upper single words
for upp in single_occurence_upper_words:
# for all occuring capitalizations in the text
for orig in repl[upp]:
# use regex replace to find the original word from our repl dict with
# space/punktuation before/after it and replace it with the uppercase word
text = re.sub(f"(?<=[{punctuation} ])({orig})(?=[{punctuation} ])",upp, text)
print(text)
Output (shortened):
Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan.
2 These are the GENERATIONS of Jacob.
Joseph, being seventeen years old, was pasturing the flock with his brothers. He was a boy with the sons of Bilhah and Zilpah, his father's wives. And Joseph brought a BAD report of them to their father. 3 Now Israel loved Joseph more than any other of his sons, because he was the son of his old age. And he made him a robe of many colors. [a] 4 But when his brothers saw that their father loved him more than all his brothers, they hated him
and could not speak PEACEFULLY to him.
<snipp>
The regex uses lookahead '(?=...)' and lookbehind '(?<=...)'syntax to make sure we replace only full words, see regex syntax.

filtering stopwords near punctuation

I am trying to filter out stopwords in my text like so:
clean = ' '.join([word for word in text.split() if word not in (stopwords)])
The problem is that text.split() has elements like 'word.' that don't match to the stopword 'word'.
I later use clean in sent_tokenize(clean), however, so I don't want to get rid of the punctuation altogether.
How do I filter out stopwords while retaining punctuation, but filtering words like 'word.'?
I thought it would be possible to change the punctuation:
text = text.replace('.',' . ')
and then
clean = ' '.join([word for word in text.split() if word not in (stop words)] or word == ".")
But is there a better way?
Tokenize the text first, than clean it from stopwords. A tokenizer usually recognizes punctuation.
import nltk
text = 'Son, if you really want something in this life,\
you have to work for it. Now quiet! They are about\
to announce the lottery numbers.'
stopwords = ['in', 'to', 'for', 'the']
sents = []
for sent in nltk.sent_tokenize(text):
tokens = nltk.word_tokenize(sent)
sents.append(' '.join([w for w in tokens if w not in stopwords]))
print sents
['Son , if you really want something this life , you have work it .', 'Now quiet !', 'They are about announce lottery numbers .']
You could use something like this:
import re
clean = ' '.join([word for word in text.split() if re.match('([a-z]|[A-Z])+', word).group().lower() not in (stopwords)])
This pulls out everything except lowercase and uppercase ascii letters and matches it to words in your stopcase set or list. Also, it assumes that all of your words in stopwords are lowercase, which is why I converted the word to all lowercase. Take that out if I made to great of an assumption
Also, I'm not proficient in regex, sorry if there's a cleaner or robust way of doing this.

Python: Auto-correct

I have two files check.txt and orig.txt. I want to check every word in check.txt and see if it matches with any word in orig.txt. If it does match then the code should replace that word with its first match otherwise it should leave the word as it is. But somehow its not working as required. Kindly help.
check.txt looks like this:
ukrain
troop
force
and orig.txt looks like:
ukraine cnn should stop pretending & announce: we will not report news while it reflects bad on obama #bostonglobe #crowleycnn #hardball
rt #cbcnews: breaking: .#vice journalist #simonostrovsky, held in #ukraine now free and safe http://t.co/sgxbedktlu http://t.co/jduzlg6jou
russia 'outraged' at deadly shootout in east #ukraine - moscow:... http://t.co/nqim7uk7zg
#groundtroops #russianpresidentvladimirputin
http://pastebin.com/XJeDhY3G
f = open('check.txt','r')
orig = open('orig.txt','r')
new = open('newfile.txt','w')
for word in f:
for line in orig:
for word2 in line.split(" "):
word2 = word2.lower()
if word in word2:
word = word2
else:
print('not found')
new.write(word)
There are two problems with your code:
when you loop over the words in f, each word will still have a new line character, so your in check does not work
you want to iterate orig for each of the words from f, but files are iterators, being exhausted after the first word from f
You can fix those by doing word = word.strip() and orig = list(orig), or you can try something like this:
# get all stemmed words
stemmed = [line.strip() for line in f]
# set of lowercased original words
original = set(word.lower() for line in orig for word in line.split())
# map stemmed words to unstemmed words
unstemmed = {word: None for word in stemmed}
# find original words for word stems in map
for stem in unstemmed:
for word in original:
if stem in word:
unstemmed[stem] = word
print unstemmed
Or shorter (without that final double loop), using difflib, as suggested in the comments:
unstemmed = {word: difflib.get_close_matches(word, original, 1) for word in stemmed}
Also, remember to close your files, or use the with keyword to close them automatically.

Categories