Using a keyword to print a sentence in Python - python

Hello I am writing a Python program that reads through a given .txt file and looks for keywords. In this program once I have found my keyword (for example 'data') I would like to print out the entire sentence the word is associated with.
I have read in my input file and used the split() method to rid of spaces, tabs and newlines and put all the words into an array.
Here is the code I have thus far.
text_file = open("file.txt", "r")
lines = []
lines = text_file.read().split()
keyword = 'data'
for token in lines:
if token == keyword:
//I have found my keyword, what methods can I use to
//print out the words before and after the keyword
//I have a feeling I want to use '.' as a marker for sentences
print(sentence) //prints the entire sentence
file.txt Reads as follows
Welcome to SOF! This website securely stores data for the user.
desired output:
This website securely stores data for the user.

We can just split text on characters that represent line endings and then loop trough those lines and print those who contain our keyword.
To split text on multiple characters , for example line ending can be marked with ! ? . we can use regex:
import re
keyword = "data"
line_end_chars = "!", "?", "."
example = "Welcome to SOF! This website securely stores data for the user?"
regexPattern = '|'.join(map(re.escape, line_end_chars))
line_list = re.split(regexPattern, example)
# line_list looks like this:
# ['Welcome to SOF', ' This website securely stores data for the user', '']
# Now we just need to see which lines have our keyword
for line in line_list:
if keyword in line:
print(line)
But keep in mind that: if keyword in line: matches a sequence of
characters, not necessarily a whole word - for example, 'data' in
'datamine' is True. If you only want to match whole words, you ought
to use regular expressions:
source explanation with example
Source for regex delimiters

My approach is similar to Alberto Poljak but a little more explicit.
The motivation is to realise that splitting on words is unnecessary - Python's in operator will happily find a word in a sentence. What is necessary is the splitting of sentences. Unfortunately, sentences can end with ., ? or ! and Python's split function does not allow multiple separators. So we have to get a little complicated and use re.
re requires us to put a | between each delimiter and escape some of them, because both . and ? have special meanings by default. Alberto's solution used re itself to do all this, which is definitely the way to go. But if you're new to re, my hard-coded version might be clearer.
The other addition I made was to put each sentence's trailing delimiter back on the sentence it belongs to. To do this I wrapped the delimiters in (), which captures them in the output. I then used zip to put them back on the sentence they came from. The 0::2 and 1::2 slices will take every even index (the sentences) and concatenate them with every odd index (the delimiters). Uncomment the print statement to see what's happening.
import re
lines = "Welcome to SOF! This website securely stores data for the user. Another sentence."
keyword = "data"
sentences = re.split('(\.|!|\?)', lines)
sentences_terminated = [a + b for a,b in zip(sentences[0::2], sentences[1::2])]
# print(sentences_terminated)
for sentence in sentences_terminated:
if keyword in sentence:
print(sentence)
break
Output:
This website securely stores data for the user.

This solution uses a fairly simple regex in order to find your keyword in a sentence, with words that may or may not be before and after it, and a final period character. It works well with spaces and it's only one execution of re.search().
import re
text_file = open("file.txt", "r")
text = text_file.read()
keyword = 'data'
match = re.search("\s?(\w+\s)*" + keyword + "\s?(\w+\s?)*.", text)
print(match.group().strip())

Another Solution:
def check_for_stop_punctuation(token):
stop_punctuation = ['.', '?', '!']
for i in range(len(stop_punctuation)):
if token.find(stop_punctuation[i]) > -1:
return True
return False
text_file = open("file.txt", "r")
lines = []
lines = text_file.read().split()
keyword = 'data'
sentence = []
stop_punctuation = ['.', '?', '!']
i = 0
while i < len(lines):
token = lines[i]
sentence.append(token)
if token == keyword:
found_stop_punctuation = check_for_stop_punctuation(token)
while not found_stop_punctuation:
i += 1
token = lines[i]
sentence.append(token)
found_stop_punctuation = check_for_stop_punctuation(token)
print(sentence)
sentence = []
elif check_for_stop_punctuation(token):
sentence = []
i += 1

Related

Replacing/Removing paranthetical comments in a TEXT FILE using python

I am trying to remove all parenthetical comments that are in a text file. Here is an very brief example called "sample.txt":
Sentence one (comment 1). Second sentence (second comment).
I would like it to look like this instead:
Sentence one . Second sentence .
I have tried re.sub in the form below, but can only get it to work for strings, not text files. Here one of the many things I've tried:
intext = 'C:\\Users\\Sarah\\PycharmProjects\\pythonProject1\\sample.txt'
outtext = 'C:\\Users\\Sarah\\PycharmProjects\\pythonProject1\\EDITEDsample.txt'
with open(intext, 'r') as f, open(outtext, 'w') as fo:
for line in f:
fo.write(re.sub(r'\([^()]*\)', ''))
This doesn't get me an error message but it also doesn't do anything to the text.
with open (intext, 'r') as f, open(outtext, 'w') as fo:
for line in f:
fo.write(line.replace('(', " ").replace(')', " "))
This successfully removes the parenthesis, but since .replace doesn't handle regex, I don't see how I can use it to also remove any text that is between parenthesis.
I also tried
with open (intext, 'r') as f, open(outtext, 'w') as fo:
for line in f:
re.sub(r'\([^()]*\)', '', outtext)
but I get an error saying I'm missing a string, which is expected since re.sub requires strings. What can I use to remove/replace parenthetical comments from a TEXT file?
re.sub takes three parameters (pattern, replacement, string), and the result must be assigned back to a variable:
import re
# Read entire file as a string into a variable
with open('input.txt') as f:
data = f.read()
# Replace all parenthesized items.
# Make sure to use a non-greedy match or it will replace everything
# from the very first parenthesis to the very last parenthesis.
# Note that this will NOT handle nested parentheses correctly,
# e.g. "a (commented (nested)) sentence" -> "a ) sentence".
# Regular expressions don't handle nesting well. Use a parser for that.
data = re.sub(r'\(.*?\)','',data)
# write file back out
with open('output.txt','w') as f:
f.write(data)
Input file:
sentence (comment) sentence (comment)
sentence (comment) sentence (comment)
sentence (comment) sentence (comment)
Output file:
sentence sentence
sentence sentence
sentence sentence
The thing with regular expression is that you can't handle nested parentheses.
something like "(())" is bound to fail.
So assuming each '(' is followed by an ')', It would be better if you could handle it this way :
infile = ["Sentence one (comment 2). Second sentence (second comment).\n",
"Sentence one (comment 2). Second sentence (second (comment)((((((())))))))."]
open_parenthese_counter = 0
for line in infile:
for char in line:
if open_parenthese_counter == 0:
print(char, end='') # write into the output file.
elif char == '(':
open_parenthese_counter += 1
elif char == ')':
open_parenthese_counter -= 1 # = max(open_parenthese_counter-1, 0)
and then make changes as u see fits.
OUTPUT:
Sentence one . Second sentence .
Sentence one . Second sentence .
You may use the newer regex module with
\((?:[^()]+|(?R))+\)
See a demo on regex101.com.
In Python this could be
import regex as re
rx = re.compile(r'\((?:[^()]+|(?R))+\)')
data = """
Sentence one (comment 1). Second sentence (second comment).
Or even ((nested) ones).
"""
data = rx.sub('', data)

Python - how to separate paragraphs from text?

I need to separate texts into paragraphs and be able to work with each of them. How can I do that? Between every 2 paragraphs can be at least 1 empty line. Like this:
Hello world,
this is an example.
Let´s program something.
Creating new program.
Thanks in advance.
This sould work:
text.split('\n\n')
Try
result = list(filter(lambda x : x != '', text.split('\n\n')))
Not an entirely trivial problem, and the standard library doesn't seem to have any ready solutions.
Paragraphs in your example are split by at least two newlines, which unfortunately makes text.split("\n\n") invalid. I think that instead, splitting by regular expressions is a workable strategy:
import fileinput
import re
NEWLINES_RE = re.compile(r"\n{2,}") # two or more "\n" characters
def split_paragraphs(input_text=""):
no_newlines = input_text.strip("\n") # remove leading and trailing "\n"
split_text = NEWLINES_RE.split(no_newlines) # regex splitting
paragraphs = [p + "\n" for p in split_text if p.strip()]
# p + "\n" ensures that all lines in the paragraph end with a newline
# p.strip() == True if paragraph has other characters than whitespace
return paragraphs
# sample code, to split all script input files into paragraphs
text = "".join(fileinput.input())
for paragraph in split_paragraphs(text):
print(f"<<{paragraph}>>\n")
Edited to add:
It is probably cleaner to use a state machine approach. Here's a fairly simple example using a generator function, which has the added benefit of streaming through the input one line at a time, and not storing complete copies of the input in memory:
import fileinput
def split_paragraph2(input_lines):
paragraph = [] # store current paragraph as a list
for line in input_lines:
if line.strip(): # True if line is non-empty (apart from whitespace)
paragraph.append(line)
elif paragraph: # If we see an empty line, return paragraph (if any)
yield "".join(paragraph)
paragraph = []
if paragraph: # After end of input, return final paragraph (if any)
yield "".join(paragraph)
# sample code, to split all script input files into paragraphs
for paragraph in split_paragraph2(fileinput.input()):
print(f"<<{paragraph}>>\n")
I usually split then filter out the '' and strip. ;)
a =\
'''
Hello world,
this is an example.
Let´s program something.
Creating new program.
'''
data = [content.strip() for content in a.splitlines() if content]
print(data)
this is worked for me:
text = "".join(text.splitlines())
text.split('something that is almost always used to separate sentences (i.e. a period, question mark, etc.)')
Easier. I had the same problem.
Just replace the double \n\n entry by a term that you seldom see in the text (here ¾):
a ='''
Hello world,
this is an example.
Let´s program something.
Creating new program.'''
a = a.replace("\n\n" , "¾")
splitted_text = a.split('¾')
print(splitted_text)

Automate the Boring Stuff With Python Madlibs: Trouble with Replacing Matched Regex (Losing Punctuation Marks)

This is my code:
import os, re
def madLibs():
madLibsDirectory = 'madLibsFiles'
os.chdir(madLibsDirectory)
madLibsFile = 'panda.txt'
madLibsFile = open(madLibsFile)
file = madLibsFile.read()
madLibsFile.close()
wordRegex = re.compile(r"ADJECTIVE|VERB|ADVERB|NOUN")
file = file.split() # split the madlib into a list with each word.
for word in file:
# check if word matches regex
if wordRegex.match(word):
foundWord = wordRegex.search(word) # create regex object on word
newWord = input(f'Please Enter A {foundWord.group()}: ') # recieve word
file[file.index(word)] = wordRegex.sub(newWord, foundWord.group(), 1)
file = ' '.join(file)
print(file)
def main():
madLibs()
if __name__ == '__main__':
main()
The problem line is file[file.index(word)] = wordRegex.sub(newWord, foundWord.group(), 1).
When my program runs across the word ADJECTIVE, VERB, ADVERB, and NOUN it will prompt the user for a word and replace this placeholder with the input. Currently this code correctly replaces the word HOWEVER, it does not keep punctuation.
For example here is panda.txt:
The ADJECTIVE panda walked to the NOUN and then VERB. A nearby NOUN
was unaffected by these events.
When I replace VERB with say "ate" it will do so but remove the period: "...and then ate A nearby...".
I'm sure this answer isn't too complicated but my REGEX knowledge is not fantastic yet unfortunately.
Thanks!
You've correctly identified the line that has the problem:
file[file.index(word)] = wordRegex.sub(newWord, foundWord.group(), 1)
The problem with this line is that you're replacing only a part of foundWord.group(), which only contains the matched word, not any punctuation marks that appear around it.
One easy fix is to drop foundWord completely and just use word as the text to do your replacement in. The line above would become:
file[file.index(word)] = wordRegex.sub(newWord, word, 1)
That should work! You can however improve your code in a number of other ways. For instance, rather than needing to search file for word to get the correct index for the assignment, you should use enumerate to get the index of each word as you go:
for i, word in enumerate(file):
if ...
...
file[i] = ...
Or you could make a bigger change. The re.sub function (and the equivalent method of compiled pattern objects) can make multiple substitutions in a single pass, and it can take a function, rather than a string to use as the replacement. The function will be called with a match object each time the pattern matches in the text. So why not use a function to prompt the user for the replacement word, and replace all the keywords in a single go?
def madLibs():
madLibsDirectory = 'madLibsFiles'
os.chdir(madLibsDirectory)
filename = 'panda.txt' # changed this variable name, to avoid duplication
with open(filename) as file: # a with statement will automatically close the file
text = file.read() # renamed this variable too
wordRegex = re.compile(r"ADJECTIVE|VERB|ADVERB|NOUN")
modified_text = wordRegex.sub(lambda match: input(f'Please Enter A {match.group()}: '),
text) # all the substitutions happen in this one call
print(modified_text)
The lambda in the call to wordRegex.sub is equivalent to this named function:
def func(match):
return input(f'Please Enter A {match.group()}: ')

Parsing a huge dictionary file with python. Simple task I cant get my head around

I just got a giant 1.4m line dictionary for other programming uses, and i'm sad to see notepad++ is not powerful enough to do the parsing job to the problem. The dictionary contains three types of lines:
<ar><k>-aaltoiseen</k>
yks.ill..ks. <kref>-aaltoinen</kref></ar>
yks.nom. -aaltoinen; yks.gen. -aaltoisen; yks.part. -aaltoista; yks.ill. -aaltoiseen; mon.gen. -aaltoisten -aaltoisien; mon.part. -aaltoisia; mon.ill. -aaltoisiinesim. Lyhyt-, pitkäaaltoinen.</ar>
and I want to extract every word of it to a list of words without duplicates. Lets start by my code.
f = open('dic.txt')
p = open('parsed_dic.txt', 'r+')
lines = f.readlines()
for line in lines:
#<ar><k> lines
#<kref> lines
#ending to ";" - lines
for word in listofwordsfromaline:
p.write(word,"\n")
f.close()
p.close()
Im not particulary asking you how to do this whole thing, but anything would be helpful. A link to a tutorial or one type of line parsing method would be highly appreciated.
For the first two cases you can see that any word starts and ends with a specific tag , if we see it closely , then we can say that every word must have a ">-" string preceding it and a "
# First and second cases
start = line.find(">-")+2
end = line.find("</")+1
required_word = line[start:end]
In the last case you can use the split method:
word_lst = line.split(";")
ans = []
for word in word_list:
start = word.find("-")
ans.append(word[start:])
ans = set(ans)
First find what defines a word for you.
Make a regular expression to capture those matches. For example - word break '\b' will match word boundaries (non word characters).
https://docs.python.org/2/howto/regex.html
If the word definition in each type of line is different - then if statements to match the line first, then corresponding regular expression match for the word, and so on.
Match groups in Python

file.replace('abcd') also replaces 'abcde' How do I only replace exact value?

def censor2(filename):
infile = open(filename,'r')
contents = infile.read()
contentlist = contents.split()
print (contents)
print (contentlist)
for letter in contentlist:
if len(letter) == 4:
print (letter)
contents = contents.replace(letter,'xxxx')
outfile = open('censor.txt','w')
outfile.write(contents)
infile.close()
outfile.close()
This code works in Python. It accepts a file 'example.txt', reads it and loops through replacing all 4 letter words with the string 'xxxx' and outputting this into a new file (keeping original format!) called censored.txt.
I used the replace function and find the words to be replaced. However, the word 'abcd' is replaced and the next word 'abcde' is turned into 'xxxxe'
How do i prevent 'abcde' from being changed?
I could not get the below examples to work, but after working with the re.sub module i found that the following code works to replace only 4 letter words and not 5 letter words.
contents = re.sub(r"(\b)\w{4}(\b)", r"\1xxxxx\2", contents)
how about:
re.sub(r'\babcd\b','',my_text)
this will require it to have word boundaries on either side
This is where regular expressions can be helpful. You would want something like this:
import re
...
contents = re.sub(r'\babcd\b', 'xxxx', contents)
....
The \b is the "word boundary" marker. It matches the change from a word to whitespace characters, punctuation, etc.
You'll need the r'' style string for the regex pattern so that the backslashes are not treated as escape characters.

Categories