sum two columns, calculate max, min and mean value in MapReduce

sum two columns, calculate max, min and mean value in MapReduce - python

I have a sample code of mapper as the following shows, the key is UCO, the value is TaxiTotal, which should be the sum of two columns, TaxiIn and TaxiOut, how to sum the two columns?
my current solution TaxiIn + TaxiOut result in a paste number, like 333+444 = 333444, I need it to be 777， how to write the code?
#! /usr/bin/env python
import sys
# -- Airline Data
# Year, Month, DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime, CRSArrTime, UniqueCarrier, FlightNum,
# TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, ArrDelay, DepDelay, Origin, Dest, Distance, TaxiIn,
# TaxiOut, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay
for line in sys.stdin:
line = line.strip()
unpacked = line.split(",")
Year, Month, DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime, CRSArrTime, UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, ArrDelay, DepDelay, Origin, Dest, Distance, TaxiIn,TaxiOut, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay = line.split(",")
UCO = "-".join([UniqueCarrier, Origin])
results = [UCO, TaxiIn+TaxiOut]
print("\t".join(results))

Convert TaxiIn + TaxiOut to:
int(TaxiIn) + int(TaxiOut)
See below example:
In [1612]: TaxiIn = '333'
In [1613]: TaxiOut = '444'
In [1614]: TaxiIn + TaxiOut
Out[1614]: '333444'
In [1615]: int(TaxiIn) + int(TaxiOut)
Out[1615]: 777
You can't have numerical sums of string, for that convert str to int or float.
your code should be:
results = [UCO, str(int(TaxiIn) + int(TaxiOut))]
print("\t".join(results))

Related

Get column total from list

I'm trying to get the sum total of a particular column from a list in a CSV file. I'm able to select the column and remove the header but I can't add up all of the values.
import csv
projectFile = open('data.csv')
projectReader = csv.reader(projectFile)
projectData = list(projectReader)
sum = 0
for amount in projectData[1:]:
amount = amount[1]
print(amount)
I've tried sum(amount) which didn't work and then tried adding a global variable, sum = 0, and adding the float of the list to it ex: total= int(sum + float(amount)) and got errors. I can't use Pandas or mapping for this.
EDIT:
CSV example -

Here's an example of calculating the sum of the 3rd column from a 3x3 matrix (stored as list of lists). Note that column index of 2 corresponds to the 3rd column:
col = 2
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
sum = sum([row[col] for row in my_matrix])
print(sum)
The output is:
18
(calculated as 3+6+9)
For string matrix (based on comment by #mpstring)
Just add float() to convert each string to float.
col = 2
mymat = [['1','2','3'],['4','5','6'],['7','8','9']]
sum = sum([float(row[col]) for row in mymat])
print(sum)
Given example data.csv (based on updated question by #mpstring)
import csv
projectFile = open('data.csv')
projectReader = csv.reader(projectFile)
next(projectReader)
projectData = list(projectReader)
sum = sum(float(row[1]) for row in projectData)
print(sum)
Output is
216.61

Summing up datetimes without using pandas

I have a data set of rain fall in half hour intervals. I want to sum up the rainfall for each day and keep track of how many data points are summed per day to account for data gaps. Then I want to create a new file with a column for the date, a column for the rainfall, and a column for how many data points were available to sum for each day.
daily sum is my function that is trying to do this, get data is my function for extracting the data.
def get_data(avrains):
print('opening{}'.format(avrains))
with open(avrains, 'r') as rfile:
header = rfile.readline()
dates = []
rainfalls = []
for line in rfile:
line = (line.strip())
row = line.split(',')
d = datetime.strptime(row[0], '%Y-%m-%d %H:%M:%S')
r = row[-1]
dates.append(d)
rainfalls.append(float(r))
data = zip(dates, rainfalls)
data = sorted(data)
return (data)
def dailysum(rains):
day_date = []
rain_sum = []
for i in rains:
dayi = i[0]
rainsi = i[1]
for i in dayi:
try:
if dayi[i]== dayi[i+1]:
s= rains[i]+rains[i+1]
rain_sum.append(float(s))
except:
pass
day_date.append(dayi[i])

There's a lot of ways to solve this, but I'll try to stay as close to your existing code as I can:
def get_data(avrains):
"""
opens the file specified in avrains and returns a dictionary
keyed by date, containing a 2-tuple of the total rainfall and
the count of data points, like so:
{
date(2018, 11, 1) : (0.25, 6),
date(2018, 11, 2) : (0.00, 5),
}
"""
print('opening{}'.format(avrains))
rainfall_totals = dict()
with open(avrains, 'r') as rfile:
header = rfile.readline()
for line in rfile:
line = (line.strip())
row = line.split(',')
d = datetime.strptime(row[0], '%Y-%m-%d %H:%M:%S')
r = row[-1]
try:
daily_rainfall, daily_count = rainfalls[d]
daily_rainfall += r
daily_count += 1
rainfalls[d] = (daily_rainfall, daily_count)
except KeyError:
# if we don't find that date in rainfalls, add it
rainfalls[d] = (r, 1)
return rainfalls
Now when you call get_data("/path/to/file"), you'll get back a dictionary. You can spit out the values with some thing like this:
foo = get_data("/path/to/file")
for (measure_date, (rainfall, observations)) in foo.items():
print measure_date, rainfall, observations
(I will leave the formatting of the date, and any sorting or file-writing as an exercise :) )

getting max value from each column of the csv file

Would anybody help me to solve the following problem. I have tried it on my own and I have attached the solution also. I have used 2-d list, but I want a different solution without 2-d list, which should be more pythonic.
pl suggest me any of you have any other way of doing this.
Q) Consider Share prices for a N number of companies given for each month since year 1990 in a CSV file. Format of the file is as below with first line as header.
Year,Month,Company A, Company B,Company C, .............Company N
1990, Jan, 10, 15, 20, , ..........,50
1990, Feb, 10, 15, 20, , ..........,50
.
.
.
.
2013, Sep, 50, 10, 15............500
The solution should be in this format.
a) List for each Company year and month in which the share price was highest.
Here is my answer using 2-d list.
def generate_list(file_path):
'''
return list of list's containing file data.'''
data_list=None #local variable
try:
file_obj = open(file_path,'r')
try:
gen = (line.split(',') for line in file_obj) #generator, to generate one line each time until EOF (End of File)
for j,line in enumerate(gen):
if not data_list:
#if dl is None then create list containing n empty lists, where n will be number of columns.
data_list = [[] for i in range(len(line))]
if line[-1].find('\n'):
line[-1] = line[-1][:-1] #to remove last list element's '\n' character
#loop to convert numbers from string to float, and leave others as strings only
for i,l in enumerate(line):
if i >=2 and j >= 1:
data_list[i].append(float(l))
else:
data_list[i].append(l)
except IOError, io_except:
print io_except
finally:
file_obj.close()
except IOError, io_exception:
print io_exception
return data_list
def generate_result(file_path):
'''
return list of tuples containing (max price, year, month,
company name).
'''
data_list = generate_list(file_path)
re=[] #list to store results in tuple formet as follow [(max_price, year, month, company_name), ....]
if data_list:
for i,d in enumerate(data_list):
if i >= 2:
m = max(data_list[i][1:]) #max_price for the company
idx = data_list[i].index(m) #getting index of max_price in the list
yr = data_list[0][idx] #getting year by using index of max_price in list
mon = data_list[1][idx] #getting month by using index of max_price in list
com = data_list[i][0] #getting company_name
re.append((m,yr,mon,com))
return re
if __name__ == '__main__':
file_path = 'C:/Document and Settings/RajeshT/Desktop/nothing/imp/New Folder/tst.csv'
re = generate_result(file_path)
print 'result ', re
I have tried to solve it with generator also, but in that case it was giving result for only one company i.e. only one column.
p = 'filepath.csv'
f = open(p,'r')
head = f.readline()
gen = ((float(line.split(',')[n]), line.split(',',2)[0:2], head.split(',')[n]) for n in range(2,len(head.split(','))) for i,line in enumerate(f))
x = max((i for i in gen),key=lambda x:x[0])
print x
you can take the below provided input data which is in csv format..
year,month,company 1,company 2,company 3,company 4,company 5
1990,jan,201,245,243,179,133
1990,feb,228,123,124,121,180
1990,march,63,13,158,88,79
1990,april,234,68,187,67,135
1990,may,109,128,46,185,236
1990,june,53,36,202,73,210
1990,july,194,38,48,207,72
1990,august,147,116,149,93,114
1990,september,51,215,15,38,46
1990,october,16,200,115,205,118
1990,november,241,86,58,183,100
1990,december,175,97,143,77,84
1991,jan,190,68,236,202,19
1991,feb,39,209,133,221,161
1991,march,246,81,38,100,122
1991,april,37,137,106,138,26
1991,may,147,48,182,235,47
1991,june,57,20,156,38,245
1991,july,165,153,145,70,157
1991,august,154,16,162,32,21
1991,september,64,160,55,220,138
1991,october,162,72,162,222,179
1991,november,215,207,37,176,30
1991,december,106,153,31,247,69
expected output is following.
[(246.0, '1991', 'march', 'company 1'),
(245.0, '1990', 'jan', 'company 2'),
(243.0, '1990', 'jan', 'company 3'),
(247.0, '1991', 'december', 'company 4'),
(245.0, '1991', 'june', 'company 5')]
Thanks in advance...

Using collections.OrderedDict and collections.namedtuple:
import csv
from collections import OrderedDict, namedtuple
with open('abc1') as f:
reader = csv.reader(f)
tup = namedtuple('tup', ['price', 'year', 'month'])
d = OrderedDict()
names = next(reader)[2:]
for name in names:
#initialize the dict
d[name] = tup(0, 'year', 'month')
for row in reader:
year, month = row[:2] # Use year, month, *prices = row in py3.x
for name, price in zip(names, map(int, row[2:])): # map(int, prices) py3.x
if d[name].price < price:
d[name] = tup(price, year, month)
print d
Output:
OrderedDict([
('company 1', tup(price=246, year='1991', month='march')),
('company 2', tup(price=245, year='1990', month='jan')),
('company 3', tup(price=243, year='1990', month='jan')),
('company 4', tup(price=247, year='1991', month='december')),
('company 5', tup(price=245, year='1991', month='june'))])

I wasn't entirely sure how you wanted to output so for now I just have it print the output to screen.
import os
import csv
import codecs
## Import data !!!!!!!!!!!! CHANGE TO APPROPRIATE PATH !!!!!!!!!!!!!!!!!
filename= os.path.expanduser("~/Documents/PYTHON/StackTest/tailor_raj/Workbook1.csv")
## Get useable data
data = [row for row in csv.reader(codecs.open(filename, 'rb', encoding="utf_8"))]
## Find Number of rows
row_count= (sum(1 for row in data)) -1
## Find Number of columns
## Since this cannot be explicitly done, I set it to run through the columns on one row until it fails.
## Failure is caught by try/except so the program does not crash
columns_found = False
column_try =1
while columns_found == False:
column_try +=1
try:
identify_column = data[0][column_try]
except:
columns_found=True
## Set column count to discoverd column count (1 before it failed)
column_count=column_try-1
## Set which company we are checking (start with the first company listed. Since it starts at 0 the first company is at 2 not 3)
companyIndex = 2
#This will keep all the company bests as single rows of text. I was not sure how you wanted to output them.
companyBest=[]
## Set loop to go through each company
while companyIndex <= (column_count):
## For each new company reset the rowIndex and highestShare
rowIndex=1
highestShare=rowIndex
## Set loop to go through each row
while rowIndex <=row_count:
## Test if data point is above or equal to current max
## Currently set to use the most recent high point
if int(data[highestShare][companyIndex]) <= int(data[rowIndex][companyIndex]):
highestShare=rowIndex
## Move on to next row
rowIndex+=1
## Company best = Company Name + year + month + value
companyBest.append(str(data[0][companyIndex])+": "+str(data[highestShare][0]) +", "+str(data[highestShare][1])+", "+str(data[highestShare][companyIndex]))
## Move on to next company
companyIndex +=1
for item in companyBest:
print item
Be sure to change your filename path one more appropriate.
Output is currently displayed like this:
Company A: 1990, Nov, 1985
Company B: 1990, May, 52873
Company C: 1990, May, 3658
Company D: 1990, Nov, 156498
Company E: 1990, Jul, 987

No generator unfortunately but small code size, especially in Python 3:
from operator import itemgetter
from csv import reader
with open('test.csv') as f:
year, month, *data = zip(*reader(f))
for pricelist in data:
name = pricelist[0]
prices = map(int, pricelist[1:])
i, price = max(enumerate(prices), key=itemgetter(1))
print(name, price, year[i+1], month[i+1])
In Python 2.X you can do the same thing but slightly more clumsy, using the following (and the different print statement):
with open('test.csv') as f:
columns = zip(*reader(f))
year, month = columns[:2]
data = columns[2:]
Okay I came up with some gruesome generators! Also it makes use of lexicographic tuple comparison and reduce to compare consecutive lines:
from functools import reduce # only in Python 3
import csv
def group(year, month, *prices):
return ((int(p), year, month) for p in prices)
def compare(a, b):
return map(max, zip(a, group(*b)))
def run(fname):
with open(fname) as f:
r = csv.reader(f)
names = next(r)[2:]
return zip(names, reduce(compare, r, group(*next(r))))
list(run('test.csv'))

Sort CSV using a key computed from two columns, grab first n largest values

Python amateur here...let's say here I have snippet of an example csv file:
Country, Year, GDP, Population
Country1,2002,44545,24352
Country2,2004,14325,75677
Country3,2005,23132412,1345234
Country4,,2312421,12412
I need to sort the file by descending GDP per capita (GDP/Population) in a certain year, say, 2002, then grab the first 10 rows with the largest GDP per capita values.
So far, after I import the csv to a 'data' variable, I grab all the 2002 data without missing fields using:
data_2 = []
for row in data:
if row[1] == '2002' and row[2]!= ' ' and row[3] != ' ':
data_2.append(row)
I need to find some way to sort data_2 by row[2]/row[3] descending, preferably without using a class, and then grab each entire row tied to each of the largest 10 values to then write to another csv. If someone could point me in the right direction I would be forever grateful as I've tried countless googles...

This is an approach that will enable you to do one scan of the file to get the top 10 for each country...
It is possible to do this without pandas by utilising the heapq module, the following is untested, but should be a base for you to refer to appropriate documentation and adapt for your purposes:
import csv
import heapq
from itertools import islice
freqs = {}
with open('yourfile') as fin:
csvin = csv.reader(fin)
rows_with_gdp = ([float(row[2]) / float(row[3])] + row for row in islice(csvin, 1, None) if row[2] and row[3])
for row in rows_with_gdp:
cnt = freqs.setdefault(row[2], [[]] * 10) # 2 = year, 10 = num to keep
heapq.heappushpop(cnt, row)
for year, vals in freqs.iteritems():
print year, [row[1:] for row in sorted(filter(None, vals), reverse=True)]

The relevant modules would be:
csv for parsing the input
collections.namedtuple to name the fields
the filter() function to extract the specified year range
heapq.nlargest() to find the largest values
pprint.pprint() for nice output
Here's a little bit to get you started (I would do it all but what is the fun in having someone write your whole program and deprive you of the joy of finishing it):
from __future__ import division
import csv, collections, heapq, pprint
filecontents = '''\
Country, Year, GDP, Population
Country1,2002,44545,24352
Country2,2004,14325,75677
Country3,2004,23132412,1345234
Country4,2004,2312421,12412
'''
CountryStats = collections.namedtuple('CountryStats', ['country', 'year', 'gdp', 'population'])
dialect = csv.Sniffer().sniff(filecontents)
data = []
for country, year, gdp, pop in csv.reader(filecontents.splitlines()[1:], dialect):
row = CountryStats(country, int(year), int(gdp), int(pop))
if row.year == 2004:
data.append(row)
data.sort(key = lambda s: s.gdp / s.population)
pprint.pprint(data)

Use the optional key argument to the sort function:
array.sort(key=lambda x: x[2])
will sort array using its third element as a key. The value of the key argument should be a lambda expression that takes in a single argument (an arbitrary element of the array being sorted) and returns the key for sorting.
For your GDP example, the lambda function to use would be:
lambda x: float(x[2])/float(x[3]) # x[2] is GDP, x[3] is population
The float function converts the CSV fields from strings into floating point numbers. Since there are no guarantees that this will be successful (improper formatting, bad data, etc), I'd typically do this before sorting, when inserting stuff into the array. You should use floating point division here explicitly, as integer division won't give you the results you expect. If you find yourself doing this often, changing the behavior of the division operator is an option (http://www.python.org/dev/peps/pep-0238/ and related links).

Print out table in a format specified by user in Python?

The program starts as how many rows? how many coloumns? Alignment of each coloumn?(Left(L), Centre(C), Right(R)). Then accept entries(data in table) from the user. The entries should be printed in the format specified by the user? Here's what I have done so far:
rows = input("How many rows?")
coloumns = input("How many coloumns?")
alignment = raw_input("Enter alignment of each table?")
entry = raw_input("Enter rows x cols entries:")
print entry
I think I have to format entry in such a way that it comes out exactly how the user wants. How can I do it? Thanks

This block of code referenced from http://ginstrom.com/scribbles/2007/09/04/pretty-printing-a-table-in-python/ will help you.
import locale
locale.setlocale(locale.LC_NUMERIC, "")
def format_num(num):
"""Format a number according to given places.
Adds commas, etc. Will truncate floats into ints!"""
try:
inum = int(num)
return locale.format("%.*f", (0, inum), True)
except (ValueError, TypeError):
return str(num)
def get_max_width(table, index):
"""Get the maximum width of the given column index"""
return max([len(format_num(row[index])) for row in table])
def pprint_table(out, table):
"""Prints out a table of data, padded for alignment
#param out: Output stream (file-like object)
#param table: The table to print. A list of lists.
Each row must have the same number of columns. """
col_paddings = []
for i in range(len(table[0])):
col_paddings.append(get_max_width(table, i))
for row in table:
# left col
print >> out, row[0].ljust(col_paddings[0] + 1),
# rest of the cols
for i in range(1, len(row)):
col = format_num(row[i]).rjust(col_paddings[i] + 2)
print >> out, col,
print >> out
table = [["", "taste", "land speed", "life"],
["spam", 300101, 4, 1003],
["eggs", 105, 13, 42],
["lumberjacks", 13, 105, 10]]
import sys
out = sys.stdout
pprint_table(out, table)
In your case, because you are collecting inputs of rows, columns, alignment and entry in the table you can plug them in to construct your table variable.
len(table[0]) is equivalent to number of columns (-1 to prevent counting in the "y-axis" labels, also known as table index).
len(table) is
equivalent to your number of rows (-1 to prevent counting in the table header).
col_padding (alignment) is
dynamically computed using rjust and ljust methods while calculating a particular column.
And each element in your table list can be
updated using standard python list syntax.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

sum two columns, calculate max, min and mean value in MapReduce - python

Related

Get column total from list

Summing up datetimes without using pandas

getting max value from each column of the csv file

Sort CSV using a key computed from two columns, grab first n largest values

Print out table in a format specified by user in Python?

Categories

Resources