Creating a simple searching program - python

Decided to delete and ask again, was just easier! Please do not vote down as have taken on board what people have been saying.
I have two nested dictionaries:-
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':{1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
The first dictionary links words a file number and the number of times they appear in that file. The second contains searches linking a word to the number of times it appears in the current search.
I want to extract certain values so that for each search I can calculate the scalar product between the number of times words appear in a file and number of times they appear in a search divided by their magnitudes, then see which file is most similar to the current search i.e. (word 1 appearances in search * word 1 appearances in file) + (word 2 appearances in search * word 2 appearances in file) etc. And then return a dictionary of searches to list of file numbers, most similar first, least similar last.
Expected output is a dictionary:
{1:[4,3,1,2],2:[1,2,4,3]}
etc.
The key is the search number, the value is a list of files most relevant first.
(These may not actually be right.)
This is what I have:-
def retrieve():
results = {}
for word in search:
numberOfAppearances = wordFrequency.get(word).values()
for appearances in numberOfAppearances:
results[fileNumber] = numberOfAppearances.dot()
return sorted (results.iteritems(), key=lambda (fileNumber, appearances): appearances, reverse=True)
Sorry no it just says wdir = and then the directory the .py file is in.
Edit
The entire Retrieve.py file:
from collections import Counter
def retrieve():
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog': {1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
results = {}
for search_number, words in search.iteritems():
file_relevancy = Counter()
for word, num_appearances in words.iteritems():
for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[file_id] += num_appearances * appear_in_file
results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]
return results
I am using the Spyder GUI / IDE for Anaconda Python 2.7, just press the green play button and output is:
wdir='/Users/danny/Desktop'
Edit 2
In regards to the magnitude, for example, for search number 3 and file 1 it would be:
sqrt (2^2 + 3^2 + 0^2) * sqrt (3^2 + 0^2 + 3^2)

Here is a start:
from collections import Counter
def retrieve():
results = {}
for search_number, words in search.iteritems():
file_relevancy = Counter()
for word, num_appearances in words.iteritems():
for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[file_id] += num_appearances * appear_in_file
results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]
return results
print retrieve()

Related

How can I pull out text snippets around specific words?

I have a large txt file and I'm trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above.
def occurs(word1, word2, filename):
import os
infile = open(filename,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(filename, 'w')
f.write("start")
f.write(os.linesep)
for word in wordlist:
matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]
for m in matches:
l = " ".join(words[m-15:m+16])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
f.close
So far, when two of the same word are too close together, the program just doesn't run on one of them. Instead, I want to get out a longer chunk of text that extends 15 words behind and in front of furthest back and forward words
This snippet will get number of words around the chosen keyword. If there are some keywords together, it will join them:
s = '''xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above. xxx'''
words = s.split()
from itertools import groupby, chain
word = 'xxx'
def get_snippets(words, word, l):
snippets, current_snippet, cnt = [], [], 0
for v, g in groupby(words, lambda w: w != word):
w = [*g]
if v:
if len(w) < l:
current_snippet += [w]
else:
current_snippet += [w[:l] if cnt % 2 else w[-l:]]
snippets.append([*chain.from_iterable(current_snippet)])
current_snippet = [w[-l:] if cnt % 2 else w[:l]]
cnt = 0
cnt += 1
else:
if current_snippet:
current_snippet[-1].extend(w)
else:
current_snippet += [w]
if current_snippet[-1][-1] == word or len(current_snippet) > 1:
snippets.append([*chain.from_iterable(current_snippet)])
return snippets
for snippet in get_snippets(words, word, 15):
print(' '.join(snippet))
Prints:
xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15
other, which I'm trying to get as one large snippet of text. I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working
topic. So far, I have working code for all instances except the scenario mentioned above. xxx
With the same data and different lenght:
for snippet in get_snippets(words, word, 2):
print(' '.join(snippet))
Prints:
xxx and I'm
I have xxx trying to
trying to xxx get chunks
mentioned above. xxx
As always, a variety of solutions avaliable here. A fun one would a be a recursive wordFind, where it searches the next 15 words and if it finds the target word it can call itself.
A simpler, though perhaps not efficient, solution would be to add words one at a time:
for m in matches:
l = " ".join(words[m-15:m])
i = 1
while i < 16:
if (words[m+i].lower() == word):
i=1
else:
l.join(words[m+(i++)])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
Or if you're wanting the subsequent uses to be removed...
bExtend = false;
for m in matches:
if (!bExtend):
l = " ".join(words[m-15:m])
f.write("...")
bExtend = false
i = 1
while (i < 16):
if (words[m].lower() == word):
l.join(words[m+i])
bExtend = true
break
else:
l.join(words[m+(i++)])
f.write(l)
if (!bExtend):
f.write("...")
f.write(os.linesep)
Note, have not tested so may require a bit of debugging. But the gist is clear: add words piecemeal and extend the addition process when a target word is encountered. This also allows you to extend with other target words other than the current one with a bit of addition to to the second conditional if.

Join the results of two MapReduce jobs together

I am trying to join the results I get from two MapReduce jobs. The first job returns the 5 most influential papers. Below is the code for the first reducer.
import sys
import operator
current_word = None
current_count = 0
word = None
topFive = {}
# input comes from stdin
for line in sys.stdin:
line = line.strip()
# parse the input we got from mapper.py
word, check = line.split('\t')
if check != None:
count = 1
if current_word == word:
current_count += count
else:
if current_word:
topFive.update({current_word: current_count})
#print(current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print(current_word, current_count)
t = sorted(topFive.iteritems(), key=lambda x:-x[1])[:6]
print("Top five most cited papers")
count = 1
for x in t:
if x[0] != 'nan' and count <= 5:
print("{0}: {1}".format(*x))
count = count + 1
The second job finds the 5 most influential authors and the code is more or less the same as the code above. I want to take the results from these two jobs and join them so that I can determine for each author, the average number of citation of their 3 most influential papers. I cannot figure out how to do this, it seems I need to somehow join the results?
So far you will end up with two output directories, one for the authors and one for the papers.
Now you want to do a JOIN operation (as in the DBs lingo) with both of the files. To do so, the MapReduce way is to make a third job with performs this operation with the two outputs files.
JOIN operations in Hadoop are well studied. One way to do it is the reducer-side join pattern. The pattern consists in the mapper creating a composite key of two subkeys (One the original key + a boolean key specifying whether is table 0 or 1).
Before getting to the reducer you need to make a partitioner that separates those composite keys. The reducers will just get all the same keys from every table.
Let me know if you need further clarification, I wrote this one pretty fast.

Finding document frequency using Python

Hey everyone I know that this has been asked a couple times here already but I am having a hard time finding document frequency using python. I am trying to find TF-IDF then find the cosin scores between them and a query but am stuck at finding document frequency. This is what I have so far:
#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter
#number of command line argument checker
if len(sys.argv) != 3:
print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
sys.exit(1)
#Read in the directory to the files
path = sys.argv[1]
#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec
#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))
if os.path.exists(path) and os.path.isfile(y):
word_TF = []
word_IDF = {}
TFvec = []
IDFvec = []
#this is my attempt at finding IDF
for filename in glob.glob(os.path.join(path, '*.txt')):
words_IDF = re.findall(r'\w+', open(filename).read().lower())
doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]
word_IDF = doc_IDF
#psudocode!!
"""
for key in word_idf:
if key in word_idf:
word_idf[key] =+1
else:
word_idf[key] = 1
print word_IDF
"""
#goes to that directory and reads in the files there
for filename in glob.glob(os.path.join(path, '*.txt')):
words_TF = re.findall(r'\w+', open(filename).read().lower())
#scans each document for words greater or equal to 3 in length
doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]
#this assigns values to each term this is my TF for each vector
TFvec = Counter(doc_TF)
#weighing the Tf with a log function
for key in TFvec:
TFvec[key] = 1 + math.log10(TFvec[key])
#placed here so I dont get a command line full of text
print TFvec
#Error checker
else:
print "That path does not exist"
I am using python 2 and so far I don't really have any idea how to count how many documents a term appears in. I can find the total number of documents but I am really stuck on finding the number of documents a term appears in. I was just going to create one large dictionary that held all of the terms from all of the documents that could be fetched later when a query needed those terms. Thank you for any help you can give me.
DF for a term x is a number of documents in which x appears. In order to find that, you need to iterate over all documents first. Only then you can compute IDF from DF.
You can use a dictionary for counting DF:
Iterate over all documents
For each document, retrieve the set of it's words (without repetitions)
Increase the DF count for each word from stage 2. Thus you will increase the count exactly by one, regardless of how many times the word was in document.
Python code could look like this:
from collections import defaultdict
import math
DF = defaultdict(int)
for filename in glob.glob(os.path.join(path, '*.txt')):
words = re.findall(r'\w+', open(filename).read().lower())
for word in set(words):
if len(word) >= 3 and word.isalpha():
DF[word] += 1 # defaultdict simplifies your "if key in word_idf: ..." part.
# Now you can compute IDF.
IDF = dict()
for word in DF:
IDF[word] = math.log(doccounter / float(DF[word])) # Don't forget that python2 uses integer division.
PS It's good for learning to implement things manually, but if you ever get stuck, I suggest you to look at NLTK package. It provides useful functions for working with corpora (collection of texts).

Storing 3 different variables (dict. or list) while iterating through documents?

I am iterating through hundreds of thousands of words in several documents, looking to find the frequencies of contractions in English. I have formatted the documents appropriately, and it's now a matter of writing the correct function and storing the data properly. I need to store information for each document on which contractions were found and how frequently they were used in the document. Ideally, my data frame would look something like the following:
filename contraction count
file1 it's 34
file1 they're 13
file1 she's 9
file2 it's 14
file2 we're 15
file3 it's 4
file4 it's 45
file4 she's 13
How can I best go about this?
Edit: Here's my code, thus far:
for i in contractions_list: # for each of the 144 contractions in my list
for l in every_link: # for each speech
count = 0
word_count = 0
content_2 = processURL_short(l)
for word in content2.split():
word = word.strip(p)
word_count = word_count + 1
if i in contractions:
count = count + 1
Where processURL_short() is a function I wrote that scrapes a website and returns a speech as str.
Edit2:
link_store = {}
for i in contractions_list_test: # for each of the 144 contractions
for l in every_link_test: # for each speech
link_store[l] = {}
count = 0
word_count = 0
content_2 = processURL_short(l)
for word in content_2.split():
word = word.strip(p)
word_count = word_count + 1
if word == i:
count = count + 1
if count: link_store[l][i] = count
print i,l,count
Here's my file-naming code:
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president,speech_num)
Opening and reading are slow operations: don't cycle through the entire file list 144 times.
Exceptions are slow: throwing an exception for every non-contraction in every speech will be ponderous.
Don't cycle through your list of contractions checking against words. Instead, use the built-in in function to see whether that contraction is on the list, and then use a dictionary to tally the entries, just as you might do by hand.
Go through the files, word by word. When you see a word on the contraction list, see whether it's already on your tally sheet. If so, add a mark, if not, add it to the sheet with a count of 1.
Here's an example. I've made very short speeches and a trivial processURL_short function.
def processURL_short(string):
return string.lower()
every_link = [
"It's time for going to Sardi's",
"We're in the mood; it's about DST",
"They're he's it's don't",
"I'll be home for Christmas"]
contraction_list = [
"it's",
"don't",
"can't",
"i'll",
"he's",
"she's",
"they're"
]
for l in every_link: # for each speech
contraction_count = {}
content = processURL_short(l)
for word in content.split():
if word in contraction_list:
if word in contraction_count:
contraction_count[word] += 1
else:
contraction_count[word] = 1
for key, value in contraction_count.items():
print key, '\t', value
you can have your structure set up like this:
links = {}
for l in every_link:
links[l] = {}
for i in contractions_list:
count = 0
... #here is where you do your count, which you seem to know how to do
... #note that in your code, i think you meant if i in word/ if i == word for your final if statement
if count: links[l][i] = count #only adds the value if count is not 0
you would end up with a data structure like this:
links = {
'file1':{
"it's":34,
"they're":14,
...,
},
'file2':{
....,
},
...,
}
which you could easily iterate through to write the necessary data to your file (which i again assume you know how to do since its seemingly not part of the question)
Dictionaries seems to be the best option here, because they will allow
you easier manipulation of your data. Your goal should be indexing
results by filename extracted form the link (the URL to your speech
text) to a mapping of contraction and its count.
Something like:
{"file1": {"it's": 34, "they're": 13, "she's": 9},
"file2": {"it's": 14, "we're": 15},
"file3": {"it's": 4},
"file4": {"it's": 45, "she's": 13}}
Here's the full code:
ret = {}
for link, text in ((l, processURL_short(l)) for l in every_link):
contractions = {c:0 for c in contractions_list}
for word in text.split():
try:
contractions[word] += 1
except KeyError:
# Word or contraction not found.
pass
ret[file_naming_code(link)] = contractions
Let's go into each step.
First we intialize ret, it will be the resulting dictionary. Then we use
generator expressions
to perform processURL_short() for each step (instead of going though
all link list at once). We return a list of tuple (<link-name>, <speech-test>) so we can use the link name later.
Next That's the contractions count mapping, intialize to 0s, it
will be used to count contractions.
Then we split the text into words, for each word we search for it
in the contractions mapping, if found we count it, otherwise
KeyError will be raise for each key not found.
(Another question stated that this will perform poorly, another
possiblity is checking with in, like word in contractions.)
Finally:
ret[file_naming_code(link)] = contractions
Now ret is a dictionary of filename mapping to contractions
occurrences. Now you can easily create your table using:
Here's how you would get your output:
print '\t'.join(('filename', 'contraction', 'count'))
for link, counts in ret.items():
for name, count in counts.items():
print '\t'.join((link, name, count))

Having trouble with two of my functions for text analysis

I'm having trouble trying to find the amount of unique words in a speech text file (well actually 3 files), I'm just going to give you my full code so there is no misunderstandings.
#This program will serve to analyze text files for the number of words in
#the text file, number of characters, sentances, unique words, and the longest
#word in the text file. This program will also provide the frequency of unique
#words. In particular, the text will be three political speeches which we will
#analyze, building on searching techniques in Python.
def main():
harper = readFile("Harper's Speech.txt")
newWords = cleanUpWords(harper)
print(numCharacters(harper), "Characters.")
print(numSentances(harper), "Sentances.")
print(numWords(newWords), "Words.")
print(uniqueWords(newWords), "Unique Words.")
print("The longest word is: ", longestWord(newWords))
obama1 = readFile("Obama's 2009 Speech.txt")
newWords = cleanUpWords(obama1)
print(numCharacters(obama1), "Characters.")
print(numSentances(obama1), "Sentances.")
print(numWords(obama1), "Words.")
print(uniqueWords(newWords), "Unique Words.")
print("The longest word is: ", longestWord(newWords))
obama2 = readFile("Obama's 2008 Speech.txt")
newWords = cleanUpWords(obama2)
print(numCharacters(obama2), "Characters.")
print(numSentances(obama2), "Sentances.")
print(numWords(obama2), "Words.")
print(uniqueWords(newWords), "Unique Words.")
print("The longest word is: ", longestWord(newWords))
def readFile(filename):
'''Function that reads a text file, then prints the name of file without
'.txt'. The fuction returns the read file for main() to call, and print's
the file's name so the user knows which file is read'''
inFile1 = open(filename, "r")
fileContentsList = inFile1.read()
inFile1.close()
print("\n", filename.replace(".txt", "") + ":")
return fileContentsList
def numCharacters(file):
'''Fucntion returns the length of the READ file (not readlines because it
would only read the amount of lines and counting characters would be wrong),
which will be the correct amount of total characters in the text file.'''
return len(file)
def numSentances(file):
'''Function returns the occurances of a period, exclamation point, or
a question mark, thus counting the amount of full sentances in the text file.'''
return file.count(".") + file.count("!") + file.count("?")
def cleanUpWords(file):
words = (file.replace("-", " ").replace(" ", " ").replace("\n", " "))
onlyAlpha = ""
for i in words:
if i.isalpha() or i == " ":
onlyAlpha += i
return onlyAlpha.replace(" ", " ")
def numWords(newWords):
'''Function finds the amount of words in the text file by returning
the length of the cleaned up version of words from cleanUpWords().'''
return len(newWords.split())
def uniqueWords(newWords):
unique = sorted(newWords.split())
unique = set(unique)
return str(len(unique))
def longestWord(file):
max(file.split())
main()
So, my last two functions uniqueWords, and longestWord will not work properly, or at least my output is wrong. for unique words, i'm supposed to get 527, but i'm actually getting 567 for some odd reason. Also, my longest word function is always printing none, no matter what i do. I've tried many ways to get the longest word, the above is just one of those ways, but all return none. Please help me with my two sad functions!
Try to do it this way:
def longestWord(file):
return sorted(file.split(), key = len)[-1]
Or it would be even easier to do in uniqueWords
def uniqueWords(newWords):
unique = set(newWords.split())
return (str(len(unique)),max(unique, key=len))
info = uniqueWords("My name is Harper")
print("Unique words" + info[0])
print("Longest word" + info[1])
and you don't need sorted before set to get all unique words
because set it's an Unordered collections of unique elements
And look at cleanUpWords. Because if you will have string like that Hello I'm Harper. Harper I am.
After cleaning it up you will get 6 unique words, because you will have word Im.

Categories