Map Reduce program to calculate the average and count

Map Reduce program to calculate the average and count - python

I am trying to calculate the taxi and its trip using map reduce python program.
In Map program I have written the following code where it will assign each row a key.
import sys
for line in sys.stdin:
line = line.strip()
words = line.split(",")
trip = words[0]
km = words[1]
print('%s\t%s\t%s' % (trip, km, "1"))
Next while in reducer below is the program.
#!/usr/bin/env python3
import sys
current_trip = None
current_km = 0
current_count = 0
trip = None
gender = None
for line in sys.stdin:
line = line.strip()
trip,gender,count = line.split(",")
try:
count = int(count)
except ValueError:
continue
if current_trip == trip:
current_km = (km + current_km)
current_count += count
print('%s\t%s' % (current_trip,current_count, {current_km/current_count}))
current_trip = trip
current_count = count
current_km = 0
else:
if current_trip == trip:
current_count += count
print('%s\t%s' % (current_trip, current_count,km))
Here I am getting the error saying
Traceback (most recent call last):
File "reducer.py", line 23, in <module>
print('%s\t%s\t%s' % (current_trip, current_count, {current_km / current_count}))
ZeroDivisionError: division by zero
and I am not able to debug properly because if I include the print statement it is not printing in output.
Can someone please help

If the first line contains a count 0, or you have negative counts and at some point the current_count is 0, you will get this error. Try to add a condition before your print method to debug the problem:
if current_count != 0:
print('%s\t%s' % (current_trip,current_count, {current_km/current_count}))
else:
print(f"error: the current_count is 0 and the count is {count}")

Related

IndexError: list index out of range when using tuples

I'm very confused. I get an error on line 43 saying that the list index is out of range. Any help is appreciated.
def tokenize(lines):
words = []
for line in lines:
start = 0
end = start + 1
while start < len(line):
character = line[start]
if character.isspace():
end += 1
elif character.isalpha():
end = start + 1
while end < len(line) and line[end].isalpha():
end += 1
words.append(line[start:end].lower())
elif character.isdigit():
end = start + 1
while end < len(line) and line[end].isdigit():
end += 1
words.append(line[start:end])
else:
end += 1
words.append(line[start:end])
start = end
return words
def countWords(words, stopWords):
wordDict = {}
for word in words:
if word in stopWords:
continue
elif not word in wordDict:
wordDict[word] = 1
else:
frequency = wordDict.get(word)
wordDict[word] = frequency + 1
return wordDict
def printTopMost(frequencies, n):
listOfTuples = sorted(frequencies.items(), key=lambda x:x[1], reverse=True)
for x in range(n):
pair = listOfTuples[x]
word = pair[0]
frequency = str(pair[1])
print(word.ljust(20), frequency.rjust(5))
pair = listOfTuples[x] gives me an error. Please help me why do i have to add this much text it says mostly code please.
This is how the function is called: (test.py) there are instructions for the other functions I have created like tokenize and countWords also, but the error I'm getting is not a part of that which is why i've left those out.
def printTopMost(freq,n):
saved = sys.stdout
sys.stdout = io.StringIO()
wordfreq.printTopMost(freq,n)
out = sys.stdout.getvalue()
sys.stdout = saved
return out
test(printTopMost,({"horror": 5, "happiness": 15},0),"")
test(printTopMost,({"C": 3, "python": 5, "haskell": 2, "java":
1},3),"python 5\nC 3\nhaskell
2\n")
Full error message
Traceback (most recent call last):
File "C:/Users/Daniel/Documents/Scripts/Chalmers/lab1/test.py", line 81, in <module>
run()
File "C:/Users/Daniel/Documents/Scripts/Chalmers/lab1/test.py", line 70, in run
test(printTopMost,({},10),"")
File "C:/Users/Daniel/Documents/Scripts/Chalmers/lab1/test.py", line 8, in test
z = fun(*x)
File "C:/Users/Daniel/Documents/Scripts/Chalmers/lab1/test.py", line 41, in printTopMost
wordfreq.printTopMost(freq,n)
File "C:\Users\Daniel\Documents\Scripts\Chalmers\lab1\wordfreq.py", line 4, in printTopMost
pair = listOfTuples[x]
IndexError: list index out of range
Condition failed:
printTopMost({'horror': 5, 'happiness': 15}, 0) == ''
printTopMost returned/printed:
happiness 15
horror 5
Condition failed:
printTopMost({'C': 3, 'python': 5, 'haskell': 2, 'java': 1}, 3) == 'python 5\nC 3\nhaskell 2\n'
printTopMost returned/printed:
python 5
C 3
haskell 2
java 1
https://i.imgur.com/9SciXtx.png

def printTopMost(frequencies, n):
listOfTuples = sorted(frequencies.items(), key=lambda x:x[1], reverse=True)
n=min(n,len(listOfTuples))
for x in range(n):
pair = listOfTuples[x]
word = pair[0]
frequency = str(pair[1])
print(word.ljust(20), frequency.rjust(5))
Simple hack

read multiple file and compare with the fixed files

I have 50 files in a directory that are suppose to compare with one file, e.g., original.txt. I have the following code. It works well when I give the file name one-by-one, manually. I want to automate it for this I used 'glob.blog'
folder = "files/"
path = '*.rbd'
path = folder + path
files=sorted(glob.glob(path))
Here the complete code:
import glob
from itertools import islice
import linecache
num_lines_nonbram = 1891427
bits_perline = 32
total_bit_flips = 0
num_bit_diff_flip_zero = 0
num_bit_diff_flip_ones = 0
folder = "files/"
path = '*.rbd'
path = folder + path
files=sorted(glob.glob(path))
original=open('files/mull-original-readback.rbd','r')
#source1 = open(file1, "r")
for filename in files:
del_lines = 101
with open(filename,'r') as f:
i=1
while i <= del_lines:
line1 = f.readline()
lineoriginal=original.readline()
i+=1
i=0
num_bit_diff_flip_zero = 0
num_bit_diff_flip_ones = 0
num_lines_diff =0
i=0
j=0
k=0
a_write2 = ""
while i < (num_lines_nonbram-del_lines):
line1 = f.readline()
lineoriginal = original.readline()
while k < bits_perline:
if ((lineoriginal[k] == line1[k])):
a_write2 += " "
else:
if (lineoriginal[k]=="0"):
#if ((line1[k]=="0" and line1[k]=="1")):
num_bit_diff_flip_zero += 1
if (lineoriginal[k]=="1"):
#if ((line1[k]=="0" and line1[k]=="1")):
num_bit_diff_flip_ones += 1
#if ((line1[k]==1 and line1[k]==0)):
#a_write_file2 = str(i+1) + " " + str(31-k) + "\n" + a_write_file2
#a_write2 += "^"
#num_bit_diff_flip_one += 1
# else:
# a_write2 += " "
k+=1
total_bit_flips=num_bit_diff_flip_zero+num_bit_diff_flip_ones
i+=1
k=0
i = 0
print files
print "Number of bits flip zero= %d" %num_bit_diff_flip_zero +"\n" +"Number of bits flip one= %d" %num_bit_diff_flip_ones +"\n" "Total bit flips = %d " %total_bit_flips
f.close()
original.close()
I got the error:
Traceback (most recent call last):
File "random-ones-zeros.py", line 65, in <module>
if ((lineoriginal[k] == line1[k])):
IndexError: string index out of range
I guess there is some issue with the reading the file automatically, instead giving name manually. But, didn't able to find the solution.

For this the string index is out of range because the value k is iterated once more than intended so the value of the variable exceeds the scope of the program. This should be able to be fixed by using substituting it to
if ((lineoriginal[k-1] == line1[k-1])):
Hope this helps, but I can't access Python right now so I can't test it out :-)

Value Error : need more than 1 value to unpack

I am working on a word count program.
#!/usr/bin/env python
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
try:
word2count[word] = word2count[word]+count
except:
word2count[word] = count
for word in word2count.keys():
print '%s\t%s'% ( word, word2count[word] )
Error for this code:
word, count = line.split('\t', 1)
ValueError : need more than 1 value to unpack

Moving word, count = line.split('\t', 1) in the try-except should work:
for line in sys.stdin:
line = line.strip()
try:
word, count = line.split('\t', 1)
count = int(count)
except ValueError:
continue
This would skip all the lines that do not have a number at the beginning of the line that is separated with a tab from the rest of the line.

putting functions that read user input files into and loop using exceptions

The program is reading a file of keywords with number values attached to them. Then it is reading a file of a couple thousand tweets containing the latitude and longitude and the text of the tweet. You have to sort the tweets into specific regions and then calculate a sentiment average for each region based on the keywords and values of the first document. The user has to input these to two files and it has to have a try statement with exception errors. The functions work alone to calculate the proper values but when i go to put it in the try statement i get these errors:
Traceback (most recent call last):for line 129 main() and line 16 sortKets(keys). And last error line 56 keyword[lines[0]] = int(lines[1]) IndexError: list index out of range
is there anything i can do to fix it?
List item
eastern = []
central = []
mountain = []
pacific = []
keyword = {}
easternsum =[]
centralsum= []
mountainsum = []
pacificsum = []
def main() :
done = False
while not done:
try:
keys = input("Enter file: ")
readkeys(keys)
sortKeys(keys)
tweets = input("Enter second file: ")
readtweets(tweets)
sorttweet(tweets)
calcsentiment()
print("The eastern amount of tweets is",len(easternsum))
print("The eastern happiness score is",sum(easternsum)/len(easternsum))
print("The central amount of tweets is",len(centralsum))
print("The central happiness score is",sum(centralsum)/len(centralsum))
print("The mountain amount of tweets is",len(mountainsum))
print("The mountain happiness score is",sum(mountainsum)/len(mountainsum))
print("The pacific amount of tweets is",len(pacificsum))
print("The pacific happiness score is",sum(pacificsum)/len(pacificsum))
done = True
except IOError:
print("Error, file not found.")
except ValueError:
print("Invalid file.")
except RuntimeError as error:
print("Error", str(error))
def readkeys(keys):
keys = open(keys, "r")
def readtweets(tweets):
tweets = open(tweets, "r")
def sortKeys(keys):
for line in keys :
lines = line.split(",")
keyword[lines[0]] = int(lines[1])
def sorttweet(tweets) :
for line in tweets :
stuff = line.split(" ",5)
long = float(stuff[0].strip("[,"))
lat = float(stuff[1].strip('],'))
tweet = stuff[5]
if 24.660845 < long < 49.189787 and -87.518395 < lat < -67.444574 :
eastern.append(tweet)
if 24.660845 < long < 49.189787 and -101.998892 < lat < -87.518395 :
central.append(tweet)
if 24.660845 < long < 49.189787 and -115.236428 < lat < -101.998892 :
mountain.append(tweet)
if 24.660845 < long < 49.189787 and -125.242264 < lat < -115.236428 :
pacific.append(tweet)
def calcsentiment():
for tweet in eastern :
tweetlist = tweet.split()
count = 0
tweetV = 0
for word in tweetlist:
if word in keyword :
count = count + 1
tweetV = tweetV + keyword[word]
if count > 0:
easternsum.append(tweetV / count)
for tweet in central:
tweetlist2 = tweet.split()
count = 0
tweetV = 0
for word in tweetlist2 :
if word in keyword :
count = count + 1
tweetV = tweetV + keyword[word]
if count > 0:
centralsum.append(tweetV / count)
for tweet in mountain:
tweetlist3 = tweet.split()
count = 0
tweetV = 0
for word in tweetlist3 :
if word in keyword :
count = count + 1
tweetV = tweetV + keyword[word]
if count > 0:
mountainsum.append(tweetV / count)
for tweet in pacific:
tweetlist4 = tweet.split()
count = 0
tweetV = 0
for word in tweetlist4 :
if word in keyword :
count = count + 1
tweetV = tweetV + keyword[word]
if count > 0:
pacificsum.append(tweetV / count)
calcsentiment()
main()

You have a problem here:
def sortKeys(keys):
for line in keys :
lines = line.split(",")
keyword[lines[0]] = int(lines[1])
when you split the line, you don't get 2 tokens, just one.
That happens when the line you are trying to split does not contain a ',' character.
Try in python console something line "xxxx".split(",") and you will see the result is ["xxxx"], so a list with just one element, while in your code lines[1] tries to access the second element of a list.

python scripts showing different result( with one error ) in two similar input files

The script, originally taken and modified from (http://globplot.embl.de/):
#!/usr/bin/env python
# Copyright (C) 2003 Rune Linding - EMBL
# GlobPlot TM
# GlobPlot is licensed under the Academic Free license
from string import *
from sys import argv
from Bio import File
from Bio import SeqIO
import fpformat
import sys
import tempfile
import os
from os import system,popen3
import math
# Russell/Linding
RL = {'N':0.229885057471264,'P':0.552316012226663,'Q':-0.187676577424997,'A':-0.261538461538462,'R':-0.176592654077609, \
'S':0.142883029808825,'C':-0.0151515151515152,'T':0.00887797506611258,'D':0.227629796839729,'E':-0.204684629516228, \
'V':-0.386174834235195,'F':-0.225572305974316,'W':-0.243375458622095,'G':0.433225711769886,'H':-0.00121743364986608, \
'Y':-0.20750516775322,'I':-0.422234699606962,'K':-0.100092289621613,'L':-0.337933495925287,'M':-0.225903614457831}
def Sum(seq,par_dict):
sum = 0
results = []
raws = []
sums = []
p = 1
for residue in seq:
try:
parameter = par_dict[residue]
except:
parameter = 0
if p == 1:
sum = parameter
else:
sum = sum + parameter#*math.log10(p)
ssum = float(fpformat.fix(sum,10))
sums.append(ssum)
p +=1
return sums
def getSlices(dydx_data, DOM_join_frame, DOM_peak_frame, DIS_join_frame, DIS_peak_frame):
DOMslices = []
DISslices = []
in_DOMslice = 0
in_DISslice = 0
beginDOMslice = 0
endDOMslice = 0
beginDISslice = 0
endDISslice = 0
for i in range( len(dydx_data) ):
#close dom slice
if in_DOMslice and dydx_data[i] > 0:
DOMslices.append([beginDOMslice, endDOMslice])
in_DOMslice = 0
#close dis slice
elif in_DISslice and dydx_data[i] < 0:
DISslices.append([beginDISslice, endDISslice])
in_DISslice = 0
# elseif inSlice expandslice
elif in_DOMslice:
endDOMslice += 1
elif in_DISslice:
endDISslice += 1
# if not in slice and dydx !== 0 start slice
if dydx_data[i] > 0 and not in_DISslice:
beginDISslice = i
endDISslice = i
in_DISslice = 1
elif dydx_data[i] < 0 and not in_DOMslice:
beginDOMslice = i
endDOMslice = i
in_DOMslice = 1
#last slice
if in_DOMslice:
DOMslices.append([beginDOMslice, endDOMslice])
if in_DISslice:
DISslices.append([beginDISslice,endDISslice])
k = 0
l = 0
while k < len(DOMslices):
if k+1 < len(DOMslices) and DOMslices[k+1][0]-DOMslices[k][1] < DOM_join_frame:
DOMslices[k] = [ DOMslices[k][0], DOMslices[k+1][1] ]
del DOMslices[k+1]
elif DOMslices[k][1]-DOMslices[k][0]+1 < DOM_peak_frame:
del DOMslices[k]
else:
k += 1
while l < len(DISslices):
if l+1 < len(DISslices) and DISslices[l+1][0]-DISslices[l][1] < DIS_join_frame:
DISslices[l] = [ DISslices[l][0], DISslices[l+1][1] ]
del DISslices[l+1]
elif DISslices[l][1]-DISslices[l][0]+1 < DIS_peak_frame:
del DISslices[l]
else:
l += 1
return DOMslices, DISslices
def SavitzkyGolay(window,derivative,datalist):
SG_bin = 'sav_gol'
stdin, stdout, stderr = popen3(SG_bin + '-D' + str(derivative) + ' -n' + str(window)+','+str(window))
for data in datalist:
stdin.write(`data`+'\n')
try:
stdin.close()
except:
print stderr.readlines()
results = stdout.readlines()
stdout.close()
SG_results = []
for result in results:
SG_results.append(float(fpformat.fix(result,6)))
return SG_results
def reportSlicesTXT(slices, sequence, maskFlag):
if maskFlag == 'DOM':
coordstr = '|GlobDoms:'
elif maskFlag == 'DIS':
coordstr = '|Disorder:'
else:
raise SystemExit
if slices == []:
#by default the sequence is in uppercase which is our search space
s = sequence
else:
# insert seq before first slide
if slices[0][0] > 0:
s = sequence[0:slices[0][0]]
else:
s = ''
for i in range(len(slices)):
#skip first slice
if i > 0:
coordstr = coordstr + ', '
coordstr = coordstr + str(slices[i][0]+1) + '-' + str(slices[i][1]+1)
#insert the actual slice
if maskFlag == 'DOM':
s = s + lower(sequence[slices[i][0]:(slices[i][1]+1)])
if i < len(slices)-1:
s = s + upper(sequence[(slices[i][1]+1):(slices[i+1][0])])
#last slice
elif slices[i][1] < len(sequence)-1:
s = s + lower(sequence[(slices[i][1]+1):(len(sequence))])
elif maskFlag == 'DIS':
s = s + upper(sequence[slices[i][0]:(slices[i][1]+1)])
#insert untouched seq between disorder segments, 2-run labelling
if i < len(slices)-1:
s = s + sequence[(slices[i][1]+1):(slices[i+1][0])]
#last slice
elif slices[i][1] < len(sequence)-1:
s = s + sequence[(slices[i][1]+1):(len(sequence))]
return s,coordstr
def runGlobPlot():
try:
smoothFrame = int(sys.argv[1])
DOM_joinFrame = int(sys.argv[2])
DOM_peakFrame = int(sys.argv[3])
DIS_joinFrame = int(sys.argv[4])
DIS_peakFrame = int(sys.argv[5])
file = str(sys.argv[6])
db = open(file,'r')
except:
print 'Usage:'
print ' ./GlobPipe.py SmoothFrame DOMjoinFrame DOMpeakFrame DISjoinFrame DISpeakFrame FASTAfile'
print ' Optimised for ELM: ./GlobPlot.py 10 8 75 8 8 sequence_file'
print ' Webserver settings: ./GlobPlot.py 10 15 74 4 5 sequence_file'
raise SystemExit
for cur_record in SeqIO.parse(db, "fasta"):
#uppercase is searchspace
seq = upper(str(cur_record.seq))
# sum function
sum_vector = Sum(seq,RL)
# Run Savitzky-Golay
smooth = SavitzkyGolay('smoothFrame',0, sum_vector)
dydx_vector = SavitzkyGolay('smoothFrame',1, sum_vector)
#test
sumHEAD = sum_vector[:smoothFrame]
sumTAIL = sum_vector[len(sum_vector)-smoothFrame:]
newHEAD = []
newTAIL = []
for i in range(len(sumHEAD)):
try:
dHEAD = (sumHEAD[i+1]-sumHEAD[i])/2
except:
dHEAD = (sumHEAD[i]-sumHEAD[i-1])/2
try:
dTAIL = (sumTAIL[i+1]-sumTAIL[i])/2
except:
dTAIL = (sumTAIL[i]-sumTAIL[i-1])/2
newHEAD.append(dHEAD)
newTAIL.append(dTAIL)
dydx_vector[:smoothFrame] = newHEAD
dydx_vector[len(dydx_vector)-smoothFrame:] = newTAIL
globdoms, globdis = getSlices(dydx_vector, DOM_joinFrame, DOM_peakFrame, DIS_joinFrame, DIS_peakFrame)
s_domMask, coordstrDOM = reportSlicesTXT(globdoms, seq, 'DOM')
s_final, coordstrDIS = reportSlicesTXT(globdis, s_domMask, 'DIS')
sys.stdout.write('>'+cur_record.id+coordstrDOM+coordstrDIS+'\n')
print s_final
print '\n'
return
runGlobPlot()
My input and output files are here: link
This script takes a input (input1.fa) and gives following output output1.txt
But when I try to run this script with similar type but larger input file (input2.fa) .. It shows following error:
Traceback (most recent call last):
File "final_script_globpipe.py", line 207, in <module>
runGlobPlot()
File "final_script_globpipe.py", line 179, in runGlobPlot
smooth = SavitzkyGolay('smoothFrame',0, sum_vector)
File "final_script_globpipe.py", line 105, in SavitzkyGolay
stdin.write(`data`+'\n')
IOError: [Errno 22] Invalid argument
I have no idea where the problem is. Any type of suggestion is appriciated.
I am using python 2.7 in windows 7 machine. I have also attached the Savitzky Golay module which is needed to run the script.
Thanks

UPDATE:
After trying to reproduce the error on linux it's showing a similar behavior, working fine with the first file but with the second is returning Errno32.
Traceback:
Traceback (most recent call last):
File "Glob.py", line 207, in <module>
runGlobPlot()
File "Glob.py", line 179, in runGlobPlot
smooth = SavitzkyGolay('smoothFrame',0, sum_vector)
File "Glob.py", line 105, in SavitzkyGolay
stdin.write(`data`+'\n')
IOError: [Errno 32] Broken pipe
Update:
Some calls of the SG_bin return that the -n parameter is the wrong type.
Wrong type of parameter for flag -n. Has to be unsigned,unsigned
This parameter comes from the window variable that is passed to the SavitzkyGolay function.
Surrounding the stdin.write with a trycatch block reveals that it breaks a hadnfull of times.
try:
for data in datalist:
stdin.write(repr(data)+'\n')
except:
print "It broke"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Map Reduce program to calculate the average and count - python

Related

IndexError: list index out of range when using tuples

read multiple file and compare with the fixed files

Value Error : need more than 1 value to unpack

putting functions that read user input files into and loop using exceptions

python scripts showing different result( with one error ) in two similar input files

Categories

Resources