I am trying to join the results I get from two MapReduce jobs. The first job returns the 5 most influential papers. Below is the code for the first reducer.
import sys
import operator
current_word = None
current_count = 0
word = None
topFive = {}
# input comes from stdin
for line in sys.stdin:
line = line.strip()
# parse the input we got from mapper.py
word, check = line.split('\t')
if check != None:
count = 1
if current_word == word:
current_count += count
else:
if current_word:
topFive.update({current_word: current_count})
#print(current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print(current_word, current_count)
t = sorted(topFive.iteritems(), key=lambda x:-x[1])[:6]
print("Top five most cited papers")
count = 1
for x in t:
if x[0] != 'nan' and count <= 5:
print("{0}: {1}".format(*x))
count = count + 1
The second job finds the 5 most influential authors and the code is more or less the same as the code above. I want to take the results from these two jobs and join them so that I can determine for each author, the average number of citation of their 3 most influential papers. I cannot figure out how to do this, it seems I need to somehow join the results?
So far you will end up with two output directories, one for the authors and one for the papers.
Now you want to do a JOIN operation (as in the DBs lingo) with both of the files. To do so, the MapReduce way is to make a third job with performs this operation with the two outputs files.
JOIN operations in Hadoop are well studied. One way to do it is the reducer-side join pattern. The pattern consists in the mapper creating a composite key of two subkeys (One the original key + a boolean key specifying whether is table 0 or 1).
Before getting to the reducer you need to make a partitioner that separates those composite keys. The reducers will just get all the same keys from every table.
Let me know if you need further clarification, I wrote this one pretty fast.
Related
This is the CSV in question. I got this data from Extra History, which is a video series that talks about many historical topics through many mini-series, such as 'Rome: The Punic Wars', or 'Europe: The First Crusades' (in the CSV). Episodes in these mini-series are numbered #1 up to #6 (though Justinian Theodora has two series, the first numbered #1-#6, the second #7-#12).
I would like to do some statistical analysis on these mini-series, which entails sorting these numbered episodes (e.g. episodes #2-#6) into their appropriate series, i.e. end result look something like this; I can then easily automate sorting into the appropriate python list.
My python code matches the #2-#6 episodes to the #1 episode correctly 99% of the time, with only 1 big error in red, and 1 slight error in yellow (because the first episode of that series is #7, not #1). However, I get the nagging feeling that there is an easier and foolproof way since the strings are well organized, with regular patterns in their names. Is that possible? And can I achieve that with my current code, or should I change and approach it from a different angle?
import csv
eh_csv = '/Users/Work/Desktop/Extra History Playlist Video Data.csv'
with open(eh_csv, newline='', encoding='UTF-8') as f:
reader = csv.reader(f)
data = list(reader)
import re
series_first = []
series_rest = []
singles_music_lies = []
all_episodes = [] #list of name of all episodes
#seperates all videos into 3 non-overlapping list: first episodes of a series,
#the other numbered episodes of the series, and singles/music/lies videos
for video in data:
all_episodes.append(video[0])
#need regex b/c normall string search of #1 also matched (Justinian &) Theodora #10
if len(re.findall('\\b1\\b', video[0])) == 1:
series_first.append(video[0])
elif '#' not in video[0]:
singles_music_lies.append(video[0])
else:
series_rest.append(video[0])
#Dice's Coefficient
#got from here; John Rutledge's answer with NinjaMeTimbers modification
#https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings
#------------------------------------------------------------------------------------------
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
s = string.lower()
return [s[i:i+2] for i in list(range(len(s) - 1))]
def string_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
pairs2.remove(y)
break
return (2.0 * hit_count) / union
#-------------------------------------------------------------------------------------------
#only take couple words of the episode's names for comparison, b/c the first couple words are 99% of the
#times the name of the series; can't make too short or words like 'the, of' etc will get matched (now or
#in future), or too long because will increase chance of superfluous match; does much better than w/o
#limitting to first few words
def first_three_words(name_string):
#eg ''.join vs ' '.join
first_three = ' '.join(name_string.split()[:5]) #-->'The Haitian Revolution' slightly worse
#first_three = ''.join(name_string.split()[:5]) #--> 'TheHaitianRevolution', slightly better
return first_three
#compared given episode with all first videos, and return a list of comparison scores
def compared_with_first(episode, series_name = series_first):
episode_scores = []
for i in series_name:
x = first_three_words(episode)
y = first_three_words(i)
#comparison_score = round(string_similarity(episode, i),4)
comparison_score = round(string_similarity(x,y),4)
episode_scores.append((comparison_score, i))
return episode_scores
matches = []
#go through video number 2,3,4 etc in a series and compare them with the first episode
#of all series, then get a comparison score
for episode in series_rest:
scores_list = compared_with_first(episode)
similarity_score = 0
most_likely_match = []
#go thru list of comparison scores returned from compared_with_first,
#then append the currentepisode/highest score/first episode to
#most_likely_match; repeat for all non-first episodes
for score in scores_list:
if score[0] > similarity_score:
similarity_score = score[0]
most_likely_match.clear() #MIGHT HAVE BEEN THE CRUCIAL KEY
most_likely_match.append((episode,score))
matches.append(most_likely_match)
final_match = []
for i in matches:
final_match.append((i[0][0], i[0][1][1], i[0][1][0]))
#just to get output in desired presentation
path = '/Users/Work/Desktop/'
with open('EH Sorting Episodes.csv', 'w', newline='',encoding='UTF-8') as csvfile:
csvwriter = csv.writer(csvfile)
for currentRow in final_match:
csvwriter.writerow(currentRow)
#print(currentRow)
I'm an amateur with basic coding skills in python, I'm working on a data frame that has a column as below. The intent is to group the output of nltk.FreqDist by the first word
What I have so far
t_words = df_tech['message']
data_analysis = nltk.FreqDist(t_words)
# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
for key in sorted(filter_words):
print("%s: %s" % (key, filter_words[key]))
sample current output:
click full refund showing currently viewed rr number: 1
click go: 1
click post refund: 1
click refresh like replace tokens sending: 1
click refund: 1
click refund order: 1
click resend email confirmation: 1
click responsible party: 1
click send right: 1
click tick mark right: 1
I have 10000+ rows in my output.
My Expected Output
I would like to group the output by the first word and extract it as a dataframe
What I have tried among other solutions
I have tried adapting solutions given here and here, but no satisfactory results.
Any help/guidance appreciated.
Try the following (documentation is inside the code):
# I assume the input, t_words is a list of strings (Each containing multiple words)
t_words = ...
# This creates a counter from a string to it's occurrences
input_frequencies = nltk.FreqDist(t_words)
# Taking inputs only if they appear 3 or more times.
# This is similar to your code, but looks at the frequency. Your previous code
# did len(m) where m was the message. If you want to filter by the string length,
# you can restore it to len(input_str) > 3
frequent_inputs = {
input_str: count
for input_str, count in input_frequencies.items()
if count > 3
}
# We will apply this function on each string to get the first word (to be
# used as the key for the grouping)
def first_word(value):
# You can replace this by a better implementation from nltk
return value.split(' ')[0]
# Now we will use itertools.groupby for the grouping, as documented in
# https://docs.python.org/3/library/itertools.html#itertools.groupby
first_word_to_inputs = itertools.groupby(
# Take the strings from the above dictionary
frequent_inputs.keys(),
# And key by the first word
first_word)
# If you would also want to keep the count of each word, we can map from
# first word to a list of (string, count) pairs:
first_word_to_inpus_and_counts = itertools.groupby(
# Pairs of words and count
frequent_inputs.items(),
# Extract the string from the pair, and then take the first word
lambda pair: first_word(pair[0])
)
I managed to do it like below. There could be an easier implementation. But for now, this gives me what I had expected.
temp = pd.DataFrame(sorted(data_analysis.items()), columns=['word', 'frequency'])
temp['word'] = temp['word'].apply(lambda x: x.strip())
#Removing emtpy rows
filter = temp["word"] != ""
dfNew = temp[filter]
#Splitting first word
dfNew['first_word'] = dfNew.word.str.split().str.get(0)
#New column with setences split without first word
dfNew['rest_words'] = dfNew['word'].str.split(n=1).str[1]
#Subsetting required columns
dfNew = dfNew[['first_word','rest_words']]
# Grouping by first word
dfNew= dfNew.groupby('first_word').agg(lambda x: x.tolist()).reset_index()
#Transpose
dfNew.T
Sample Output
I have a large txt file and I'm trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above.
def occurs(word1, word2, filename):
import os
infile = open(filename,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(filename, 'w')
f.write("start")
f.write(os.linesep)
for word in wordlist:
matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]
for m in matches:
l = " ".join(words[m-15:m+16])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
f.close
So far, when two of the same word are too close together, the program just doesn't run on one of them. Instead, I want to get out a longer chunk of text that extends 15 words behind and in front of furthest back and forward words
This snippet will get number of words around the chosen keyword. If there are some keywords together, it will join them:
s = '''xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above. xxx'''
words = s.split()
from itertools import groupby, chain
word = 'xxx'
def get_snippets(words, word, l):
snippets, current_snippet, cnt = [], [], 0
for v, g in groupby(words, lambda w: w != word):
w = [*g]
if v:
if len(w) < l:
current_snippet += [w]
else:
current_snippet += [w[:l] if cnt % 2 else w[-l:]]
snippets.append([*chain.from_iterable(current_snippet)])
current_snippet = [w[-l:] if cnt % 2 else w[:l]]
cnt = 0
cnt += 1
else:
if current_snippet:
current_snippet[-1].extend(w)
else:
current_snippet += [w]
if current_snippet[-1][-1] == word or len(current_snippet) > 1:
snippets.append([*chain.from_iterable(current_snippet)])
return snippets
for snippet in get_snippets(words, word, 15):
print(' '.join(snippet))
Prints:
xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15
other, which I'm trying to get as one large snippet of text. I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working
topic. So far, I have working code for all instances except the scenario mentioned above. xxx
With the same data and different lenght:
for snippet in get_snippets(words, word, 2):
print(' '.join(snippet))
Prints:
xxx and I'm
I have xxx trying to
trying to xxx get chunks
mentioned above. xxx
As always, a variety of solutions avaliable here. A fun one would a be a recursive wordFind, where it searches the next 15 words and if it finds the target word it can call itself.
A simpler, though perhaps not efficient, solution would be to add words one at a time:
for m in matches:
l = " ".join(words[m-15:m])
i = 1
while i < 16:
if (words[m+i].lower() == word):
i=1
else:
l.join(words[m+(i++)])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
Or if you're wanting the subsequent uses to be removed...
bExtend = false;
for m in matches:
if (!bExtend):
l = " ".join(words[m-15:m])
f.write("...")
bExtend = false
i = 1
while (i < 16):
if (words[m].lower() == word):
l.join(words[m+i])
bExtend = true
break
else:
l.join(words[m+(i++)])
f.write(l)
if (!bExtend):
f.write("...")
f.write(os.linesep)
Note, have not tested so may require a bit of debugging. But the gist is clear: add words piecemeal and extend the addition process when a target word is encountered. This also allows you to extend with other target words other than the current one with a bit of addition to to the second conditional if.
Decided to delete and ask again, was just easier! Please do not vote down as have taken on board what people have been saying.
I have two nested dictionaries:-
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':{1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
The first dictionary links words a file number and the number of times they appear in that file. The second contains searches linking a word to the number of times it appears in the current search.
I want to extract certain values so that for each search I can calculate the scalar product between the number of times words appear in a file and number of times they appear in a search divided by their magnitudes, then see which file is most similar to the current search i.e. (word 1 appearances in search * word 1 appearances in file) + (word 2 appearances in search * word 2 appearances in file) etc. And then return a dictionary of searches to list of file numbers, most similar first, least similar last.
Expected output is a dictionary:
{1:[4,3,1,2],2:[1,2,4,3]}
etc.
The key is the search number, the value is a list of files most relevant first.
(These may not actually be right.)
This is what I have:-
def retrieve():
results = {}
for word in search:
numberOfAppearances = wordFrequency.get(word).values()
for appearances in numberOfAppearances:
results[fileNumber] = numberOfAppearances.dot()
return sorted (results.iteritems(), key=lambda (fileNumber, appearances): appearances, reverse=True)
Sorry no it just says wdir = and then the directory the .py file is in.
Edit
The entire Retrieve.py file:
from collections import Counter
def retrieve():
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog': {1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
results = {}
for search_number, words in search.iteritems():
file_relevancy = Counter()
for word, num_appearances in words.iteritems():
for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[file_id] += num_appearances * appear_in_file
results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]
return results
I am using the Spyder GUI / IDE for Anaconda Python 2.7, just press the green play button and output is:
wdir='/Users/danny/Desktop'
Edit 2
In regards to the magnitude, for example, for search number 3 and file 1 it would be:
sqrt (2^2 + 3^2 + 0^2) * sqrt (3^2 + 0^2 + 3^2)
Here is a start:
from collections import Counter
def retrieve():
results = {}
for search_number, words in search.iteritems():
file_relevancy = Counter()
for word, num_appearances in words.iteritems():
for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[file_id] += num_appearances * appear_in_file
results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]
return results
print retrieve()
I'm trying to write a MapReduce program for computing Trigrams using the mrjob framework in Python. So far, this is what I have:
from mrjob.job import MRJob
class MRTrigram(MRJob):
def mapper(self, _, line):
w = line.split()
for idx,word in enumerate(w):
if idx < len(w) - 2:
# Generate a trigram using the current word and next 2 words
trigram = w[idx] + " " + w[idx + 1] + " " + w[idx + 2]
yield trigram, 1
def reducer(self, key, values):
yield sum(values), key
# ignore this part - its just standard bolierplate for mrjob!
if __name__ == '__main__':
MRTrigram.run()
As it can be seen, I've not handled the case where a trigram is split across lines (say, "it was" at the end of line 3, "the best of times" at beginning of line 4 - but my code would not capture the trigram "it was the" in this case!).
How do I go about preserving states across multiple map calls, ensuring that no matter however the mappers are assigned jobs by the underlying runtime, only trigrams across consecutive lines are counted? I thought of storing the last 2 words of each line in a persistent data structure inside the MRTrigram class, but then I realized I could not guarantee if I was comparing words across lines i and i+1 (and not lines i, j, where j can be line anywhere in the document!).
Any ideas to set me on the right track?
You might get a hint as to how this could be done by writing a custom protocol, but I believe mrjob takes stream input delimited by the new line character before you can add a customized behavior (i.e., forming key and value), so it might not be possible with mrjob.
If you are using Hadoop (i.e., native Java), then you can write a custom input format that takes multiline text and parse a key-value pair out of it.