Extract phrase count from text files based on a keyword - python

I have a set of text files with blurbs of text and I need to search these for a particular keyword such that a set of words before and/or after the keyword (i.e. phrases) are returned along with a count of the phrases across the files. For example, contents of a few of files are:
File 1: This is a great day. I wish I could go to a beautiful green park today but unfortunately, we are in a lockdown!
File 2: Having a beautiful green park close to your house is great.
File 3: I visited a green park today. My friend also visited a green park today.
So if I search for the keyword park, I'm looking for the output to be a set of phrases (let's say one word before & after park), ranked based on how many times the phrase occurs across files. So in this example, the output should be:
green park today: 2
green park close: 1
Is there a way I can achieve this in Python, maybe using some NLP libraries or even without them. I have some code in my post here but that doesn't solve the purpose (I'll perhaps delete that post once I get a response to this one).
Thank you

Based on your expected output above, it looks like you only want to add one to the count for a single phrase per file (even if it appears several times in the same file). Below is an example of how you can do this without any special NLP libraries, just defining "words" as chains of non-space characters delimited by spaces (I'm assuming you know how to read text from a file so leaving that part out).
from collections import Counter
str1 = "This is a great day. I wish I could go to a beautiful green park today but unfortunately, we are in a lockdown!"
str2 = "Having a beautiful green park close to your house is great."
str3 = "I visited a green park today. My friend also visited a green park today."
str1_words = ["START"] + str1.split(" ") + ["END"]
str2_words = ["START"] + str2.split(" ") + ["END"]
str3_words = ["START"] + str3.split(" ") + ["END"]
print(str1_words)
all_phrases = []
SEARCH_WORD = "park"
for words in [str1_words, str2_words, str3_words]:
phrases = []
for i in range(1, len(words) - 1):
if words[i] == SEARCH_WORD:
phrases.append(" ".join(words[i-1:i+2]))
# Only count each phrase once for this text
phrases = set(phrases)
all_phrases.extend(phrases)
phrase_count = Counter(all_phrases)
print(phrase_count.most_common())
The output is:
[('green park today', 1), ('green park close', 1), ('green park today.', 1)]
This perfectly demonstrates the problem with the definition of a "word" above - punctuation is treated as part of the word. For a better way to do it, look into the NLTK library, specifically methods for "word tokenization".
Hopefully the above gives you an idea of how to get started on this.

Related

How to count the number of words in the list, provided there is more than one?

For example I have a text,
text = '''
Wales’ greatest moment. Lille is so close to the Belgian border, this was
essentially a home game for one of the tournament favourites. Their confident supporters mingled
with their new Welsh fans on the streets, buying into the carnival spirit - perhaps
more relaxed than some might have been before a quarter-final because they
thought this was their time.
In the driving rain, Wales produced the best performance in their history to carry
the nation into uncharted territory. Nobody could quite believe it.
'''
I need to get the number of words in this text, we enter the words with input().
Type will be a list, dict, set this required condition.
It is also not clear how to remove the attention to punctuation marks.
My solution, but perhaps there is a cleaner way.
text = list(text.split(' '))
word = input('Enter a word: ')
for i in text:
if text.count(word) < 2:
break
if word in text:
print(f'{word} - {text.count(word)}')
break
Output:
this - 2
the - 7
The 'moment' occurs only once in the text, we do not deduce it
You can think of this as two steps:
Clean the input
Find the count
A fast way to clean the input is to strip all the punctuation first using translate combined with string.punctuation:
import string
clean = text.translate(str.maketrans('', '', string.punctuation)).split()
Now you have all the text with no punctuation and can split it into words and count:
import string
clean = text.translate(str.maketrans('', '', string.punctuation)).split()
word = "this"
count = clean.count(word)
if count > 1:
print(f'{word} - {count}')
# prints: this - 2
Since you are using count you don't need to loop. Just be careful not to call count multiple times if you don't need to. Each time you do, it needs to look through the whole list. Notice above the code calls it once and saves it so we can use the count in multiple places.
You can use collections.Counter() to get a dictionary of the number of occurrences of each element in a list:
import collections
text = '''
Wales’ greatest moment. Lille is so close to the Belgian border, this was
essentially a home game for one of the tournament favourites. Their confident supporters mingled
with their new Welsh fans on the streets, buying into the carnival spirit - perhaps
more relaxed than some might have been before a quarter-final because they
thought this was their time.
In the driving rain, Wales produced the best performance in their history to carry
the nation into uncharted territory. Nobody could quite believe it.
'''
word = input('Enter a word: ')
# Remove punctuation from text
for char in text:
if char.lower() not in "abcdefghijklmnopqrstuvwxyz ":
text = text.replace(char, "")
wordcount = collections.Counter(text.split())
print(f"{word} - {wordcount[word]}")

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000
You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']
Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']
I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

Iterate over a text and find the distance between predefined substrings

I decided I wanted to take a text and find how close some labels were in the text. Basically, the idea is to check if two persons are less than 14 words apart and if they are we say that they are related.
My naive implementation is working, but only if the person is a single word, because I iterate over words.
text = """At this moment Robert who rises at seven and works before
breakfast came in He glanced at his wife her cheek was
slightly flushed he patted it caressingly What s the
matter my dear he asked She objects to my doing nothing
and having red hair said I in an injured tone Oh of
course he can t help his hair admitted Rose It generally
crops out once in a generation said my brother So does the
nose Rudolf has got them both I must premise that I am going
perforce to rake up the very scandal which my dear Lady
Burlesdon wishes forgotten--in the year 1733 George II
sitting then on the throne peace reigning for the moment and
the King and the Prince of Wales being not yet at loggerheads
there came on a visit to the English Court a certain prince
who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third',
'a Knight of the Garter', 'James', 'Lady Burlesdon']
# my naive implementation
ws = text.split()
l = len(ws)
for wi,w in enumerate(ws):
# Skip if the word is not a person
if w not in involved:
continue
# Check next x words for any involved person
x = 14
for i in range(wi+1,wi+x):
# Avoid list index error
if i >= l:
break
# Skip if the word is not a person
if ws[i] not in involved:
continue
# Print related
print(ws[wi],ws[i])
Now I would like to upgrade this script to allow for multi-word names such as 'Lady Burlesdon'. I am not entirely sure what is the best way to proceed. Any hints are welcome.
You could first preprocess your text so that all the names in text are replaced with single-word ids. The ids would have to be strings that you would not expect to appear as other words in the text. As you preprocess the text, you could keep a mapping of ids to names to know which name corresponds to which id. This would allow to keep your current algorithm as is.

Error: match word in file

There are two sentences in "test_tweet1.txt"
#francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
#mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
In "Personal.txt"
The Game (rapper)
The Notorious B.I.G.
The Undertaker
Thor
Tiësto
Timbaland
T.I.
Tom Cruise
Tony Romo
Trajan
Triple H
My codes:
import re
popular_person = open('C:/Users/Personal.txt')
rpopular_person = popular_person.read()
file1 = open("C:/Users/test_tweet1.txt").readlines()
array = []
count1 = 0
for line in file1:
array.append(line)
count1 = count1 + 1
print "\n",count1, line
ltext1 = line.split(" ")
for i,text in enumerate(ltext1):
if text in rpopular_person:
print text
text2 = ' '.join(ltext1)
Results from the codes showed:
1 #francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
Tony
The
man
to
the
the
2 #mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
aga
I tried to match word from "test_tweet1.txt" with "Personal.txt".
Expected result:
Tony
Romo
Any suggestion?
Your problem seems to be that rpopular_person is just a single string. Therefore, when you ask things like 'to' in rpopular_person, you get a value of True, because the characters 't', 'o' appear in sequence. I am assuming that the same goes for 'the' elsewhere in Personal.txt.
What you want to do is split up Personal.txt into individual words, the way you're splitting your tweets. You can also make the resulting list of words into a set, since that'll make your lookup much faster. Something like this:
people = set(popular_person.read().split())
It's also worth noticing that I'm calling split(), with no arguments. This splits on all whitespace--newlines, tabs, and so on. This way you get everything separately like you intend. Or, if you don't actually want this (since this will give you results of "The" all the time based on your edited contents of Personal.txt), make it:
people = set(popular_person.read().split('\n'))
This way you split on newlines, so you only look for full name matches.
You're not getting "Romo" because that's not a word in your tweet. The word in your tweet is "Romo." with a period. This is very likely to remain a problem for you, so what I would do is rearrange your logic (assuming speed isn't an issue). Rather than looking at each word in your tweet, look at each name in your Personal.txt file, and see if it's in your full tweet. This way you don't have to deal with punctuation and so on. Here's how I'd rewrite your functionality:
rpopular_person = set(personal.split())
with open("Personal.txt") as p:
people = p.read().split('\n') # Get full names rather than partial names
with open("test_tweet1.txt") as tweets:
for tweet in tweets:
for person in people:
if person in tweet:
print person
you need to split rpopular_person to get it to match words instead of substrings
rpopular_person = open('C:/Users/Personal.txt').read().split()
this gives:
Tony
The
the reason Romo isn't showing up is that on your line split you have "Romo." Maybe you should look for rpopular_person in the lines, instead of the other way around. Maybe something like this
popular_person = open('C:/Users/Personal.txt').read().split("\n")
file1 = open("C:/Users/test_tweet1.txt")
array = []
for count1, line in enumerate(file1):
print "\n", count1, line
for person in popular_person:
if person in line:
print person

Finding the surrounding sentence of a char/word in a string

I am trying to get sentences from a string that contain a given substring using python.
I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:
{
abstract: "...long abstract here..."
highlights: [
{
concept: 'a word',
start: 1,
end: 10
}
{
concept: 'cancer',
start: 123,
end: 135
}
]
}
I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.
I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.
How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?
You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):
Start with,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
Most intuitive way:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
sentencesWithHighlights.append(sentence)
break
But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.
We can get better performance since we know the start index for each highlight:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
In either case we get:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']
I assume that all your sentences end with one of these three characters: !?.
What about looping over the list of highlights, creating a regexp group:
(?:list|of|your highlights)
Then matching your whole abstract against this regexp:
/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig
This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).
Another option (though it's tough to say how reliable it would be with variably defined text), would be to split the text into a list of sentences and test against them:
re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)

Categories