Python most frequent sentences in large text - python

My goal is to extract the x most frequent sentences that include the word y. My solution right now works this way:
Extract all sentences that contains the word y
Count the most frequent words in these sentences and store them in a list
Extract sentences that includes z amount of words from the list
In order to get x sentences then I simply increase or decrease z
So I get the most frequent sentences by looking for the most frequent words. This method works on smaller amount of data. But will take forever on larger amount of data.
EDIT - code
# Extract all sentences from data containing the word
def getSentences(word):
sentences = []
for x in data_lemmatized:
if word in x:
sentences.append(x)
return sentences
# Get the most frequent words from all the sentences
def getSentenceWords(sentences):
cnt = Counter()
for x in sentences:
for y in x:
cnt[y] += 1
words = []
for x,y in cnt.most_common(30):
if(x not in exclude and x != ' '):
words.append(x)
return words
# Get sentences that contains as many words as possible
def countWordshelp(allSentences, words, amountWords):
tempList = []
for sentence in allSentences:
temp = len(words[:amountWords])
count = 0
for word in words[:amountWords]:
if(word in sentence):
count += 1
if(count == temp):
tempList.append(sentence)
return tempList
def countWords(allSentences, words, nrSentences):
tempList = []
prevList = []
amountWords = 1
tempList = countWords2help(allSentences, words, amountWords)
while(len(tempList) > nrSentences):
amountWords += 1
newAllSentences = tempList
prevList = tempList
tempList = countWords2help(newAllSentences,words, amountWords)
if(len(tempList) < nrSentences):
return prevList[:nrSentences]
return tempList
if __name__ == '__main__':
for x in terms:
for y in x:
allSentences = getSentences(y)
words = getSentenceWords(allSentences)
test = countWords2(allSentences,words,nrSentences)
allTest.append(test)
terms will be a list of list each containing 10 words, data_lemmatized will be the large data in lemmatized form.

Related

How can I iterate over a file looking for keywords defined within a list

I have a defined list of keywords and a text file. I would like to search the text file and count how many times each of the keywords within my list appear. Example:
kw = ['max speed', 'time', 'distance', 'travel', 'down', 'up']
with open("file.txt", "r") as f:
data_file = f.read()
d = dict()
for line in data_file:
line = line.strip()
line = line.lower()
words = line.split(" ")
for word in words:
if word in d:
d[word] = d[word] + 1
else:
d[word] = 1
for key in list(d.keys()):
print(key, ":", d[key])
Now lets say we run the code, it should search the file.txt and loop through the list. If it finds a keyword in the list, then print that word and how many times it was found. If no word is found then it doesn't report.
Example output:
Keywords Found:
max speed: 3
travel: 7
distance: 3
Can't quite get this to work like I want. Any feedback would be great! Thank you in advanced!
There are several algorithms which you can use. There are special algorithms for finding a specific words in texts. The easiest one will be the naive algorithm, here is code that I wrote:
def naive_string_matching(text, pattern):
txt_len, pat_len = len(text), len(pattern)
result = []
for s in range(txt_len - pat_len + 1):
if pattern == text[s:s+pat_len]:
result.append(s)
return result
This naive algorithm takes as an input a text and one word as a pattern to search for. The complexity of this algorithm is O((n − m + 1)m) where m is the length of pattern and n is the length of a text.
The next algorithm which you can use and has the better complexity than the naive algorithm is Finite automation algorithm. Here you can read more about it if you are interested in it. Here is also my implementation of this algorithm:
def transition_table(pattern):
alphabet = set(pattern)
ptt_len = len(pattern)
result = []
for q in range(ptt_len+1):
result.append({})
for l in alphabet:
k = min(len(pattern), q+1)
while True:
if k == 0 or pattern[:k] == (pattern[:q] + l)[-k:]:
break
k -= 1
result[q][l] = k
return result
def fa_string_matching(text, pattern):
q = 0
delta = transition_table(pattern)
txt_len = len(text)
result = []
for s in range(txt_len):
if text[s] in delta[q]:
q = delta[q][text[s]]
if q == len(delta) - 1:
result.append(s+1-q)
else:
q = 0
return result
The complexity of this algorithm is O(n) but pre-processing time (transition_table function) takes O(m) where again n is the length of a text and m the length of a pattern.
And the last one algorithm I can propose to you is the KMP (Knuth–Morris–Pratt) algorithm which is the fastest of all 3 of them. Again my implementation of it:
def prefix_function(pattern):
pat_len = len(pattern)
pi = [0]
k = 0
for q in range(1, pat_len):
while k > 0 and pattern[k] != pattern[q]:
k = pi[k-1]
if pattern[k] == pattern[q]:
k += 1
pi.append(k)
return pi
def kmp_string_matching(text, pattern):
txt_len, pat_len = len(text), len(pattern)
pi = prefix_function(pattern)
q = 0
result = []
for i in range(txt_len):
while q > 0 and pattern[q] != text[i]:
q = pi[q-1]
if pattern[q] == text[i]:
q += 1
if q == pat_len:
result.append(i - q + 1)
q = pi[q-1]
return result
As an input it takes full text and a pattern that you are looking for. The complexity of KMP algorithm is similar to the Finite automation algorithm and it is O(n) but the pre-processing time is faster (prefix_function).
If you are interested in such topics like pattern matching or finding the occurrences of a pattern in a text, I highly recommend you to become acquainted with all of them.
To open a file you can simply run:
with open(file_name) as file:
text = file.read()
result = naive_string_matching(text, pattern)
where file_name is the name of your file, pattern is the phrase that you want to search for in the text. To search for patterns in an array you can try:
example_patterns = ['max speed', 'time', 'distance', 'travel', 'down', 'up']
with open(file_name) as file:
text = file.read()
for pattern in example_patterns:
result = kmp_string_matching(text, pattern)
import re
keywords = ['max speed', 'time', 'distance', 'travel', 'down', 'up']
keywords = [x.replace(' ', r'\s') for x in keywords] # replaces spaces with whitespace indicator
with open('file.txt', 'r') as file:
data = file.read()
keywords_found = {}
for key in keywords:
found = re.findall(key, data, re.I) # re.I means it'll ignore case.
if found:
keywords_found[key] = len(found)
print(keywords_found)

Count of sub-strings that contain character X at least once. E.g Input: str = “abcd”, X = ‘b’ Output: 6

This question was asked in an exam but my code (given below) passed just 2 cases out of 7 cases.
Input Format : single line input seperated by comma
Input: str = “abcd,b”
Output: 6
“ab”, “abc”, “abcd”, “b”, “bc” and “bcd” are the required sub-strings.
def slicing(s, k, n):
loop_value = n - k + 1
res = []
for i in range(loop_value):
res.append(s[i: i + k])
return res
x, y = input().split(',')
n = len(x)
res1 = []
for i in range(1, n + 1):
res1 += slicing(x, i, n)
count = 0
for ele in res1:
if y in ele:
count += 1
print(count)
When the target string (ts) is found in the string S, you can compute the number of substrings containing that instance by multiplying the number of characters before the target by the number of characters after the target (plus one on each side).
This will cover all substrings that contain this instance of the target string leaving only the "after" part to analyse further, which you can do recursively.
def countsubs(S,ts):
if ts not in S: return 0 # shorter or no match
before,after = S.split(ts,1) # split on target
result = (len(before)+1)*(len(after)+1) # count for this instance
return result + countsubs(ts[1:]+after,ts) # recurse with right side
print(countsubs("abcd","b")) # 6
This will work for single character and multi-character targets and will run much faster than checking all combinations of substrings one by one.
Here is a simple solution without recursion:
def my_function(s):
l, target = s.split(',')
result = []
for i in range(len(l)):
for j in range(i+1, len(l)+1):
ss = l[i] + l[i+1:j]
if target in ss:
result.append(ss)
return f'count = {len(result)}, substrings = {result}'
print(my_function("abcd,b"))
#count = 6, substrings = ['ab', 'abc', 'abcd', 'b', 'bc', 'bcd']
Here you go, this should help
from itertools import combinations
output = []
initial = input('Enter string and needed letter seperated by commas: ') #Asking for input
list1 = initial.split(',') #splitting the input into two parts i.e the actual text and the letter we want common in output
text = list1[0]
final = [''.join(l) for i in range(len(text)) for l in combinations(text, i+1)] #this is the core part of our code, from this statement we get all the available combinations of the set of letters (all the way from 1 letter combinations to nth letter)
for i in final:
if 'b' in i:
output.append(i) #only outputting the results which have the required letter/phrase in it

NLTK wordnet calculating path similarity of words in two lists

I'm trying to find the similarity of words in a text file. I have attached the code below where i read from a text file and split the contents into two lists but now i would like to compare the words in list 1 to list 2.
file = open('M:\ThirdYear\CE314\Assignment2\sim_data\Assignment_Additional.txt', 'r')
word1 = []
word2 = []
split = [line.strip() for line in file]
count = 0
for line in split:
if count == (len(split) - 1):
break
else:
word1.append(line.split('\t')[0])
word2.append(line.split('\t')[1])
count = count + 1
print(word1)
print(word2)
for x, y in zip(word1, word2):
w1 = wordnet.synset(x + '.n.1')
w2 = wordnet.synset(y + '.n.1')
print(w1.path_similarity(w2))
I want to iterate through both lists and print their path_similarity but only when they abide to the rules wordnet.synset(x + '.n.1') meaning any words that do not have '.n.1' i want to ignore and skip but i'm not entirely sure how to make this check in python

How to implement mapreduce pairs pattern in python

I am trying to attempt the mapreduce pairs pattern in python. Need to check if a word is in a text file and then find the word next to it and yield a pair of both words. keep running into either:
neighbors = words[words.index(w) + 1]
ValueError: substring not found
or
ValueError: ("the") is not in list
file cwork_trials.py
from mrjob.job import MRJob
class MRCountest(MRJob):
# Word count
def mapper(self, _, document):
# Assume document is a list of words.
#words = []
words = document.strip()
w = "the"
neighbors = words.index(w)
for word in words:
#searchword = "the"
#wor.append(str(word))
#neighbors = words[words.index(w) + 1]
yield(w,1)
def reducer(self, w, values):
yield(w,sum(values))
if __name__ == '__main__':
MRCountest.run()
Edit:
Trying to use the pairs pattern to search a document for every instance of a specific word and then find the word next to it each time. Then yielding a pair result for each instance i.e. find instances of "the" and the word next to it i.e. [the], [book], [the], [cat] etc.
from mrjob.job import MRJob
class MRCountest(MRJob):
# Word count
def mapper(self, _, document):
# Assume document is a list of words.
#words = []
words = document.split(" ")
want = "the"
for w, want in enumerate(words, 1):
if (w+1) < len(words):
neighbors = words[w + 1]
pair = (want, neighbors)
for u in neighbors:
if want is "the":
#pair = (want, neighbors)
yield(pair),1
#neighbors = words.index(w)
#for word in words:
#searchword = "the"
#wor.append(str(word))
#neighbors = words[words.index(w) + 1]
#yield(w,1)
#def reducer(self, w, values):
#yield(w,sum(values))
if __name__ == '__main__':
MRCountest.run()
As it stands I get yields of every word pair with multiples of the same pairing.
When you use words.index("the") then you will only get the first instance of "the" in your list or string, and as you have found, you will get an error if "the" isn't present.
Also you mention that you are trying to produce pairs, but only yield a single word.
I think what you are trying to do is something more like this:
def get_word_pairs(words):
for i, word in enumerate(words):
if (i+1) < len(words):
yield (word, words[i + 1]), 1
if (i-1) > 0:
yield (word, words[i - 1]), 1
assuming you are interested in neighbours in both directions. (If not, you only need the first yield.)
Lastly, since you use document.strip(), I suspect that document is in fact a string and not a list. If that's the case, you can use words = document.split(" ") to get the word list, assuming you don't have any punctuation.

Function won't work when using a list created from a file

I am trying to create a list of words from a file is being read as then delete all words that contain duplicate letters. I was able to do it successfully with a list of words that I entered however when I try to use the function on the list created from a file the function still includes words with duplicates.
This works:
words = ['word','worrd','worrrrd','wordd']
alpha = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
x = 0
while x in range(0, len(alpha)):
i = 0
while i in range(0, len(words)):
if words[i].count(alpha[x]) > 1:
del(words[i])
i = i - 1
else:
i = i + 1
x = x + 1
print(words)
This is how I'm trying to do it when reading the file:
words = []
length = 5
file = open('dictionary.txt')
for word in file:
if len(word) == length+1:
words.append(word.splitlines())
alpha = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
x = 0
while x in range(0, len(alpha)):
i = 0
while i in range(0, len(words)):
if words[i].count(alpha[x]) > 1:
del(words[i])
i = i - 1
else:
i = i + 1
x = x + 1
print(words)
Try something like this. First, the string module is not quite deprecated, but it's unpopular. Lucky for you, it defines some useful constants to save you a bunch of typing. So you don't have to type all those quotes and commas.
Next, use the with open('filespec') as ... context for reading files: it's what it was put there for!
Finally, be aware of how iteration works for text files: for line in file: reads lines, including any trailing newlines. Strip those off. If you don't have one-word-per-line, you'll have to split the lines after you read them.
# Read words (possibly >1 per line) from dictionary.txt into lexicon[].
# Convert the words to lower case.
import string
Lexicon = []
with open('dictionary.txt') as file:
for line in file:
words = line.strip().lower().split()
Lexicon.extend(words)
for ch in string.ascii_lowercase:
for i in range(len(Lexicon)):
word = Lexicon[i]
if word.count(ch) > 1:
del Lexicon[i]
i -= 1
print('\n'.join(Lexicon))
How about this:
#This more comprehensive sample allows me to reproduce the file-reading
# problem in the script itself (before I changed the code "tee" would
# print, for instance)
words = ['green','word','glass','worrd','door','tee','wordd']
outlist = []
for word in words:
chars = [c for c in word]
# a `set` only contains unique characters, so if it is shorter than the
# `word` itself, we found a word with duplicate characters, so we keep
# looping
if len(set(chars)) < len(chars):
continue
else:
outlist.append(word)
print(outlist)
Result:
['word']
import string
words = ['word','worrd','worrrrd','wordd','5word']
new_words = [x for x in words if len(x) == len(set(x)) if all(i in string.ascii_letters for i in x)]
print(new_words)

Categories