Having trouble with two of my functions for text analysis - python

I'm having trouble trying to find the amount of unique words in a speech text file (well actually 3 files), I'm just going to give you my full code so there is no misunderstandings.
#This program will serve to analyze text files for the number of words in
#the text file, number of characters, sentances, unique words, and the longest
#word in the text file. This program will also provide the frequency of unique
#words. In particular, the text will be three political speeches which we will
#analyze, building on searching techniques in Python.
def main():
harper = readFile("Harper's Speech.txt")
newWords = cleanUpWords(harper)
print(numCharacters(harper), "Characters.")
print(numSentances(harper), "Sentances.")
print(numWords(newWords), "Words.")
print(uniqueWords(newWords), "Unique Words.")
print("The longest word is: ", longestWord(newWords))
obama1 = readFile("Obama's 2009 Speech.txt")
newWords = cleanUpWords(obama1)
print(numCharacters(obama1), "Characters.")
print(numSentances(obama1), "Sentances.")
print(numWords(obama1), "Words.")
print(uniqueWords(newWords), "Unique Words.")
print("The longest word is: ", longestWord(newWords))
obama2 = readFile("Obama's 2008 Speech.txt")
newWords = cleanUpWords(obama2)
print(numCharacters(obama2), "Characters.")
print(numSentances(obama2), "Sentances.")
print(numWords(obama2), "Words.")
print(uniqueWords(newWords), "Unique Words.")
print("The longest word is: ", longestWord(newWords))
def readFile(filename):
'''Function that reads a text file, then prints the name of file without
'.txt'. The fuction returns the read file for main() to call, and print's
the file's name so the user knows which file is read'''
inFile1 = open(filename, "r")
fileContentsList = inFile1.read()
inFile1.close()
print("\n", filename.replace(".txt", "") + ":")
return fileContentsList
def numCharacters(file):
'''Fucntion returns the length of the READ file (not readlines because it
would only read the amount of lines and counting characters would be wrong),
which will be the correct amount of total characters in the text file.'''
return len(file)
def numSentances(file):
'''Function returns the occurances of a period, exclamation point, or
a question mark, thus counting the amount of full sentances in the text file.'''
return file.count(".") + file.count("!") + file.count("?")
def cleanUpWords(file):
words = (file.replace("-", " ").replace(" ", " ").replace("\n", " "))
onlyAlpha = ""
for i in words:
if i.isalpha() or i == " ":
onlyAlpha += i
return onlyAlpha.replace(" ", " ")
def numWords(newWords):
'''Function finds the amount of words in the text file by returning
the length of the cleaned up version of words from cleanUpWords().'''
return len(newWords.split())
def uniqueWords(newWords):
unique = sorted(newWords.split())
unique = set(unique)
return str(len(unique))
def longestWord(file):
max(file.split())
main()
So, my last two functions uniqueWords, and longestWord will not work properly, or at least my output is wrong. for unique words, i'm supposed to get 527, but i'm actually getting 567 for some odd reason. Also, my longest word function is always printing none, no matter what i do. I've tried many ways to get the longest word, the above is just one of those ways, but all return none. Please help me with my two sad functions!

Try to do it this way:
def longestWord(file):
return sorted(file.split(), key = len)[-1]
Or it would be even easier to do in uniqueWords
def uniqueWords(newWords):
unique = set(newWords.split())
return (str(len(unique)),max(unique, key=len))
info = uniqueWords("My name is Harper")
print("Unique words" + info[0])
print("Longest word" + info[1])
and you don't need sorted before set to get all unique words
because set it's an Unordered collections of unique elements
And look at cleanUpWords. Because if you will have string like that Hello I'm Harper. Harper I am.
After cleaning it up you will get 6 unique words, because you will have word Im.

Related

Counting word occurrences from specific part of txt files using python3

I have a folder with a number of txt files.
I want to count the number of occurrences of a set of words in a certain part of a each txt file and export the results to a new excel file.
Specifically, I want to look for the occurrences of words only in part of text that begins after the word "Company A" and ends in the word "Company B."
For example:
I want to look for the words "Corporation" and "Board" in the bold part of the following text:
...the Board of Company A oversees the management of risks inherent in the operation of the Corporation businesses and the implementation of its strategic plan. The Board reviews the risks associated with the Corporation strategic plan at an annual strategic planning session and periodically throughout the year as part of its consideration of the strategic direction of Company B. In addition, the Board addresses the primary risks associated with...
I have managed to count the occurrences of the set of words but from the whole txt file and not the part from Company A up to Company B.
import os
import sys
import glob
for filename in glob.iglob('file path' + '**/*', recursive=True):
def countWords(filename, list_words):
try:
reading = open(filename, "r+", encoding="utf-8")
check = reading.readlines()
reading.close()
for each in list_words:
lower = each.lower()
count = 0
for string in check:
word_check = string.split()
for word in word_check:
lowerword = word.lower()
line = lowerword.strip("!##$%^&*()_+?><:.,-'\\ ")
if lower == line:
count += 1
print(lower, ":", count)
except FileNotFoundError:
print("This file doesn't exist.")
for zero in list_words:
if zero != "":
print(zero, ":", "0")
else:
pass
print('----')
print(os.path.basename(filename))
countWords(filename, ["Corporation", "Board"])
The final output for the example text should be like this:
txtfile1
Corporation: 2
Board: 1
And the above process should be replicated for all txt files of the folder and exported as an excel file.
Thanks for the consideration and I apologize in advance for the length of the question.
you might try regexp, assuming you want the whole string if you see repetitions of company a before you see company b.
re.findall('company a.*?company b', 'company a did some things in agreement with company b')
That will provide a list of all the text strings starting with company a and ending with company b.

how to get the number of occurrence of an expression in a file using python

I have a code that read files and find the matching expression with the user input and highlight it, using findall function in regular expression.
also i am trying to save in json file several information based on this matching.
like :
file name
matching expression
number of occurrence
the problem is that the program read the file and display the text with highlighted expression but in the json file it save the number of occurrence as the number of lines.
in this example the word this is the searched word it exist in the text file twice
the result in the json file is = 12 ==> that is the number of text lines
result of the json file and the highlighted text
code:
def MatchFunc(self):
self.textEdit_PDFpreview.clear()
x = self.lineEditSearch.text()
TextString=self.ReadingFileContent(self.FileListSelected())
d = defaultdict(list)
filename = os.path.basename(self.FileListSelected())
RepX='<u><b style="color:#FF0000">'+x+'</b></u>'
for counter , myLine in enumerate(filename):
self.textEdit_PDFpreview.clear()
thematch=re.sub(x,RepX,TextString)
thematchFilt=re.findall(x,TextString,re.M|re.I)
if thematchFilt:
d[thematchFilt[0]].append(counter + 1)
self.textEdit_PDFpreview.insertHtml(str(thematch))
else:
self.textEdit_PDFpreview.insertHtml('No Match Found')
OutPutListMetaData = []
for match , positions in d.items():
print ("this is match {}".format(match))
print("this is position {}".format(positions))
listMetaData = {"File Name":filename,"Searched Word":match,"Number Of Occurence":len(positions)}
OutPutListMetaData.append(listMetaData)
for p in positions:
print("on line {}".format(p))
jsondata = json.dumps(OutPutListMetaData,indent=4)
print(jsondata)
folderToCreate = "search_result"
today = time.strftime("%Y%m%d__%H-%M")
jsonFileName = "{}_searchResult.json".format(today)
if not(os.path.exists(os.getcwd() + os.sep + folderToCreate)):
os.mkdir("./search_result")
fpJ = os.path.join(os.getcwd()+os.sep+folderToCreate,jsonFileName)
print(fpJ)
with open(fpJ,"a") as jsf:
jsf.write(jsondata)
print("finish writing")
It's straightforward using Counter. Once you pass an iterable, it returns each one of them along with the number of occurrences as tuples.
As the re.findall function returns a list you can just do len(result).

list index out of range python decompressing text

The code I currently have is shown below. What it does first is ask the user to input a sentence. The program then finds the position of each word in the sentence and also splits the word into a list to get individual words. The program then gets rid of any repeated words to make the words in the list unique. The program then proceeds to save (using son)the position of words in the sentence (e.g 1,2,3,4,1,1,2,3,5) and unique words to a separate file (which the user can name). The next part of the program tries to decompress the unique text from the separate file and tries to recreate the original sentence from the position of words in the sentence and unique words. I know this stage works as I have tested it separately. However when i run the program now, I keep getting this error message:
File "/Users/Sid/Desktop/Task3New.py", line 70, in OutputDecompressed
decompression.append(orgwords[i])
IndexError: list index out of range
I have no idea why this isn't working, anyone care to help? All help appreciated, thanks.
import json
import os.path
def InputSentence():
global sentence
global words
sentence = input("Enter a sentence: ")
words = sentence.split(' ')
def Validation():
if sentence == (""):
print ("No sentence was inputted. \nPlease input a sentence...")
Error()
def Uniquewords():
print ("Words in the sentence: " + str(words))
for i in range(len(words)):
if words[i] not in unilist:
unilist.append(words[i])
print ("Unique words: " + str(unilist))
def PosText():
global find
global pos
find = dict((sentence, words.index(sentence)+1) for sentence in list(words))
pos = (list(map(lambda sentence: find [sentence], words)))
return (pos)
def OutputText():
print ("The positions of the word(s) in the sentence are: " + str(pos))
def SaveFile():
filename = input("We are now going to save the contents of this program into a new file. \nWhat would you like to call the new file? ")
newfile = open((filename)+'.txt', 'w')
json.dump([unilist, pos], newfile)
newfile.close
def InputFile():
global compfilename
compfilename = input("Please enter an existing compressed file to be decompressed: ")
def Validation2():
if compfilename == (""):
print ("Nothing was entered for the filename. Please re-enter a valid filename.")
Error()
if os.path.exists(filename + ".txt") == False:
print ("No such file exists. Please enter a valid existing file.")
Error()
def OutputDecompressed():
newfile = open((compfilename)+'.txt', 'r')
saveddata = json.load(newfile)
orgpos = saveddata[1]
orgwords = saveddata[0]
print ("Unique words in the original sentence: " + str(orgwords) + "\nPosition of words in the sentence: " + str(orgpos))
decompression = []
prev = orgpos[0]
x=0
#decomposing the index locations
for cur in range(1,len(orgpos)):
if (prev == orgpos[cur]): x+= 1
else:
orgpos[cur]-=x
x=0
prev = orgpos[cur]
#Getting the output
for i in orgpos:
decompression.append(orgwords[i-1])
finalsentence = (' '.join(decompression))
print ("Original sentence from file: " + finalsentence)
def Error():
MainCompression()
def MainCompression():
global unilist
unilist = []
InputSentence()
Uniquewords()
PosText()
OutputText()
SaveFile()
InputFile()
Validation()
OutputDecompressed()
MainCompression()
The problem is that you are using the indices form words as indices for unilist/orgwords.
Let's take a look at the problem:
def PosText():
global find
global pos
find = dict((sentence, words.index(sentence)+1) for sentence in list(words))
pos = (list(map(lambda sentence: find [sentence], words)))
return (pos)
Here find maps every word to its position in the list words. (BTW why is the variable that iterates over words called sentence?) Then, for every word this position is stored in a new list. This process could be expressed in one line: pos = [words.index(word)+1 for word in words]
When you now look at OutputDecompressed, you see:
for i in orgpos:
decompression.append(orgwords[i-1])
Here orgpos is pos and orgwords is the list of unique words. Now every stored index is used to get back the original words, but this is flawed because orgpos contains indices of words even though they are used to access orgwords.
The solution to this problem is to rewrite PosText and parts of OutputDecompressed:
def PosText():
global pos
pos = [unilist.index(word)+1 for word in words]
return pos
def OutputDecompressed():
newfile = open((compfilename)+'.txt', 'r')
saveddata = json.load(newfile)
orgpos = saveddata[1]
orgwords = saveddata[0]
print ("Unique words in the original sentence: " + str(orgwords) + "\nPosition of words in the sentence: " + str(orgpos))
decompression = []
# I could not figure out what this middle part was doing, so I left it out
for i in orgpos:
decompression.append(orgwords[i-1])
finalsentence = (' '.join(decompression))
print ("Original sentence from file: " + finalsentence)
Some comments on your code:
After InputSentence() Validation() should be called to validate it
After InputFile() you must call Validation2() and not Validation()
In Validation2() it should be compfilename and not filename
You should use parameters instead of global variables. This makes it more clear what the functions are supposed to do. For example Uniquewords could accept the list of words and return the list of unique words. It also makes the program much easier to debug by just testing every function one-by-one, which is currently not possible.
To make it easier for other Python programmers to read your code you could use the Python coding style specified in PEP 8

Search in matrix - python

I need to write a function that will search for words in a matrix. For the moment i'm trying to search line by line to see if the word is there. This is my code:
def search(p):
w=[]
for i in p:
w.append(i)
s=read_wordsearch() #This is my matrix full of letters
for line in s:
l=[]
for letter in line:
l.append(letter)
if w==l:
return True
else:
pass
This code works only if my word begins in the first position of a line.
For example I have this matrix:
[[a,f,l,y],[h,e,r,e],[b,n,o,i]]
I want to find the word "fly" but can't because my code only works to find words like "here" or "her" because they begin in the first position of a line...
Any form of help, hint, advice would be appreciated. (and sorry if my english is bad...)
You can convert each line in the matrix to a string and try to find the search work in it.
def search(p):
s=read_wordsearch()
for line in s:
if p in ''.join(line):
return True
I'll give you a tip to search within a text for a word. I think you will be able to extrapolate to your data matrix.
s = "xxxxxxxxxhiddenxxxxxxxxxxx"
target = "hidden"
for i in xrange(len(s)-len(target)):
if s[i:i+len(target)] == target:
print "Found it at index",i
break
If you want to search for words of all length, if perhaps you had a list of possible solutions:
s = "xxxxxxxxxhiddenxxxtreasurexxxxxxxx"
targets = ["hidden","treasure"]
for i in xrange(len(s)-1):
for j in xrange(i+1,len(s)):
if s[i:j] in targets:
print "Found",s[i:j],"at index",
def search(p):
w = ''.join(p)
s=read_wordsearch() #This is my matrix full of letters
for line in s:
word = ''.join(line)
if word.find(w) >= 0:
return True
return False
Edit: there is already lot of string functions available in Python. You just need to use strings to be able to use them.
join the characters in the inner lists to create a word and search with in.
def search(word, data):
return any(word in ''.join(characters) for characters in data)
data = [['a','f','l','y'], ['h','e','r','e'], ['b','n','o','i']]
if search('fly', data):
print('found')
data contains the matrix, characters is the name of each individual inner list. any will stop after it has found the first match (short circuit).

Creating a simple searching program

Decided to delete and ask again, was just easier! Please do not vote down as have taken on board what people have been saying.
I have two nested dictionaries:-
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':{1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
The first dictionary links words a file number and the number of times they appear in that file. The second contains searches linking a word to the number of times it appears in the current search.
I want to extract certain values so that for each search I can calculate the scalar product between the number of times words appear in a file and number of times they appear in a search divided by their magnitudes, then see which file is most similar to the current search i.e. (word 1 appearances in search * word 1 appearances in file) + (word 2 appearances in search * word 2 appearances in file) etc. And then return a dictionary of searches to list of file numbers, most similar first, least similar last.
Expected output is a dictionary:
{1:[4,3,1,2],2:[1,2,4,3]}
etc.
The key is the search number, the value is a list of files most relevant first.
(These may not actually be right.)
This is what I have:-
def retrieve():
results = {}
for word in search:
numberOfAppearances = wordFrequency.get(word).values()
for appearances in numberOfAppearances:
results[fileNumber] = numberOfAppearances.dot()
return sorted (results.iteritems(), key=lambda (fileNumber, appearances): appearances, reverse=True)
Sorry no it just says wdir = and then the directory the .py file is in.
Edit
The entire Retrieve.py file:
from collections import Counter
def retrieve():
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog': {1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
results = {}
for search_number, words in search.iteritems():
file_relevancy = Counter()
for word, num_appearances in words.iteritems():
for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[file_id] += num_appearances * appear_in_file
results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]
return results
I am using the Spyder GUI / IDE for Anaconda Python 2.7, just press the green play button and output is:
wdir='/Users/danny/Desktop'
Edit 2
In regards to the magnitude, for example, for search number 3 and file 1 it would be:
sqrt (2^2 + 3^2 + 0^2) * sqrt (3^2 + 0^2 + 3^2)
Here is a start:
from collections import Counter
def retrieve():
results = {}
for search_number, words in search.iteritems():
file_relevancy = Counter()
for word, num_appearances in words.iteritems():
for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[file_id] += num_appearances * appear_in_file
results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]
return results
print retrieve()

Categories