compare two data in a columns using one csv file in python - python

I am trying to compare two data in one csv file and I cannot use panda.
What I am trying to get is the total Unit sold that the two person sell and the sum of all the years then compare who sold more based on the sum of all they sold through out the years. Then also get the least they sold on that particular year.
For example, my .csv is setup like this:
John Smith, 343, 2020
John Smith, 522, 2019
John Smith, 248, 2018
Sherwin Cooper, 412, 2020
Sherwin Cooper, 367, 2019
Sherwin Cooper, 97, 2018
Dorothy Lee, 612, 2020
Dorothy Lee, 687, 2019
Dorothy Lee, 591, 2018
I want to compare John and Dorothy's unit sold and who sold more. So the output should be:
Dorothy Lee sold more units than John smith. A total of 1890 to 1113.
Dorothy Lee sold less in 2018, for only 591.
John Smith sold less in 2018, for only 248.
My code so far is:
import csv
def compare1(employee1):
with open("employeedata.csv") as file:
rows = list(csv.DictReader(file, fieldnames = ['c1', 'c2', 'c3']))
res = {}
for row in rows:
if row['c1'] == employee1:
res[employee1] = res.get(employee1, 0) + int(row['c2'])
print(res)
def compare2(employee2):
with open("employee2.csv") as file:
rows = list(csv.DictReader(file, fieldnames = ['c1', 'c2', 'c3']))
res = {}
for row in rows:
if row['c1'] == employee2:
res[employee2] = res.get(employee2, 0) + int(row['c2'])
print(res)
employee1 = input("Enter the first name: ")
employee2 = input("Enter the first name: ")
compare1(employee1)
compare2(employee2)
I don't know the rest. I am stuck. I am a beginner and I can't use Panda. The output I need to have should look like this:
Dorothy Lee sold more units than John smith. A total of 1890 to 1113.
Dorothy Lee sold less in 2018, for only 591.
John Smith sold less in 2018, for only 248.
right now I got the output:
{'John Smith : 1113}
{'Dorothy Lee' : 1890}

Suppose my.csv has columns name, sales, year:
import pandas as pd
emp_df = pd.read_csv("my.csv")
emp_gp = emp_df.groupby("name").sales.sum().reset_index(inplace=True)
def compare(saler1, saler2):
if saler1 in emp_pg.name.values and saler2 in emp_pg.name.values:
saler1_tol = emp_pg.loc[emp_pg.name == saler1, ["sales"]]
saler2_tol = emp_pg.loc[emp_pg.name == saler2, ["sales"]]
if saler1_tol > saler2_tol:
print(f"{saler1} sold more unit than {saler2}. A total {saler1_tol} to {saler1_tol}")
else:
print(f"{saler2} sold more unit than {saler1}. A total {saler2_tol} to {saler2_tol}")
emp_gb2 = emp_df.groupby("name")
emp_agg = emp_gb2.agg({
"sales" : "min"
})
emp_agg = emp_agg.reset_index()
print("{saler1} sold less in {emp_pg.loc[emp_pg.name == saler1, ["year"]].values}, for only {emp_pg.loc[emp_pg.name == saler1, ["sales"]].values}")
print("{saler2} sold less in {emp_pg.loc[emp_pg.name == saler2, ["year"]].values}, for only {emp_pg.loc[emp_pg.name == saler2, ["sales"]].values}")
else:
print("names of salers are not in the table")

Instead of creating a function for each result you want to get, first create a database (a dict is OK) that aggregates the sum of units sold for each name and for each year. Then it is easier to answer to all kind of comparisons without having to repeat code. You can start with something like this,
import csv
from collections import defaultdict
db=defaultdict(lambda: defaultdict(int))
with open('teste.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
db[row['name']][int(row['year'])]+=int(row['units'])
print(db['Dorothy Lee'][2019]) #Units sold by Dorothy Lee in 2019
print(sum(db['Dorothy Lee'].values())) #Total Units sold by Dorothy Lee
Don't be afraid of the defaultdict module. Check the docs, it is really handy in this kind of scenario. The defaultdict creates a dictionary with a default for every missing key. In this case, the default value of the first defaultdict is another defaultdict, this time with a default value of 0 (the result of calling int()), since we want to compute a sum of units sold (therefore an integer).
With this approach, you don't need to check if the key already exists or not, defaultdict takes care of that for you.
PS: the lambda in the first defaultdict is needed to nest a second defaultdict. If you are not familiar with lambda either, check this

Related

Text file handling, data prep, date--time format, list within list python program problem of living presidents unique question

The question is as below:
Donald John Trump (born June 14, 1946) has taken office as the 45th
President of the United States on January 20, 2017. Since that day,
five former American presidents are alive at the same time.In the
history of the United States there have been only four periods when
this was the case:
March 4, 1861 - January 18, 1862: Martin Van Buren, John Tyler,
Millard Fillmore, Franklin Pierce, James Buchanan
January 20, 1993 - April 22, 1994: Richard Nixon, Gerald Ford, Jimmy
Carter, Ronald Reagan, George H. W. Bush
January 20, 2001 - June 5, 2004: Gerald Ford, Jimmy Carter, Ronald
Reagan, George H. W. Bush, Bill Clinton
January 20, 2017 - November 30, 2018: Jimmy Carter, George H. W. Bush,
Bill Clinton, George W. Bush, Barack Obama
Herbert Hoover lived another 11,553 days (31 years and 230 days) after
leaving office. James Polk died only three months (103 days) after
leaving his presidency. Of the individuals elected as US President,
eight never obtained the status of "former president" because they
died in office: William H. Harrison (pneumonia), Zachary Taylor
(bilious diarrhea), Abraham Lincoln (assassinated), James A. Garfield
(assassinated), William McKinley (assassinated), Warren G. Harding
(heart attack), Franklin D. Roosevelt (cerebral hemorrhage) and John
F. Kennedy (assassinated).
In this problem we will process text files that contain information
about the lifespan and term of the heads of a particular state. Each
line contains five tab-separated information fields: i) name of head
of state, ii) birth date, iii) (first) term start date, iv) (last)
term end date and v) death date. The four date fields are given in the
format dd/mm/yyyy with each fragment being a natural number without
leading zeroes: dd indicates the day, mm the month and yyyy the year.
The following link shows the contents of a tecxt file that contains
information about the last ten Presidents of the United States.
[1]:
https://medusa.ugent.be/en/activities/58851522/description/kwhRYw8usry-8zoF/media/us_presidents.txt
In case a head of state has served multiple non-consecutive terms, we
make the assumption that he has served only one consecutive term that
runs from the start date of the first term until the end date of the
last term. In case the head of state is still serving today, the end
date of his term is represented by an empty string. In case the head
of state is still alive today, the death date is represented by an
empty string.
I need to write a function headsOfState that takes the location of a text file that contains information about the lifespan and term of the heads of a particular state. The function must return a dictionary that maps the names of all heads of state in the file onto a tuple with the dates (datetime.date objects) of the four events mentioned in the file, in the same order of appearance as in the file. Dates of events that have not yet occurred must be represented by the value None.
Below is my code:
from datetime import date
def headsOfState(filepath_of_workdir):
open_file = open(filepath_of_workdir, 'r', encoding='utf-8')
'''accessesing the text file from working directory. here we are \
creating a list of lists of all words line wise using (readlines)'''
content = open_file.readlines()
open_file.close()
content_list = []
for i in range(len(content)):
content_list.append(content[i].split('\t'))
prez_list = []
for i in range(len(content_list)):
prez_list.append(content_list[i][0])
del (content_list[i][0])
#print(prez_list)
#print(content_list)
temp_date = None
inter_date_list = []
final_date_list = []
for i in range(len(content_list)):
temp_date = (content_list[i])
for j in range(len(temp_date)):
item1 = temp_date[j].strip()
item2 = item1.split('/')
if item2 == '' or item2 == ['\n']:
inter_date_list.append(None)
else:
year = int(item2[2])
month = int(item2[1])
day = int(item2[0])
inter_date_list.append(date(year, month, day))
if len(inter_date_list) == len(content_list[i]):
final_date_list.append(inter_date_list)
temp_date = None
inter_date_list = []
dict_prez = dict(zip(prez_list, final_list))
return dict_prez
I am getting the below error:
headsOfState('us_presidents.txt')
Traceback (most recent call last):
File "<ipython-input-66-1730ba5bcf8b>", line 1, in <module>
headsOfState('us_presidents.txt')
File "<ipython-input-65-1fbaf479e49a>", line 28, in headsOfState
year = int(item2[2])
IndexError: list index out of range
The output should look like:
>>> events = headsOfState('us_presidents.txt')
>>> events['George Washington']
(datetime.date(1732, 2, 22), datetime.date(1789, 4, 30), datetime.date(1797, 3, 4), datetime.date(1799, 12, 14))
>>> events['Barack Obama']
(datetime.date(1961, 8, 4), datetime.date(2009, 1, 20), datetime.date(2017, 1, 20), None)
>>> events['Donald Trump']
(datetime.date(1946, 6, 14), datetime.date(2017, 1, 20), None, None)
Kindly help me in resolving the error or suggest a better strategy.
Thanks #mkrieger1. Below is my final code. It works.
from datetime import date
def headsOfState(filepath_of_workdir):
open_file = open(filepath_of_workdir, 'r', encoding='utf-8')
'''accessesing the text file from working directory. here we are \
creating a list of lists of all words line wise using (readlines)'''
content = open_file.readlines()
open_file.close()
'''creating an empty list for splitting the data at position with (\t)'''
content_list = []
'''adding the splitted data to the list'''
for i in range(len(content)):
content_list.append(content[i].split('\t'))
'''creating an empty list to add all president names to it and deleting the names\
from content_list'''
prez_list = []
for i in range(len(content_list)):
prez_list.append(content_list[i][0])
del (content_list[i][0])
'''creating an interim empty date list for date manipuation and empty final date list\
for final tuple of dates in w.r.t. president names in prez_list'''
inter_date_list = []
final_date_list = []
for i in range(len(content_list)):
temp_date = content_list[i]
for j in range(len(temp_date)):
item1 = temp_date[j].strip()
item2 = item1.split('/')
'''creating integer for year, month and day for datetime format'''
if len(item2) >1:
year = int(item2[2])
month = int(item2[1])
day = int(item2[0])
inter_date_list.append(date(year, month, day))
while len(inter_date_list) <= 3:
inter_date_list.append(None)
'''adding the dates for respective presidents to final list as tuple'''
final_date_list.append(tuple(inter_date_list))
'''emptying the inter_date_list for next list of dates from content_list'''
inter_date_list = []
'''creating the dictionary with keys as president names and values as date tuples'''
dict_prez = dict(zip(prez_list, final_date_list))
return dict_prez

Reading a csv file and performing formulations with certain rows

I asked a question similar last night, but my professor gave some clarification on how she wants the question answered, and it's thrown me for a loop.
I've got a csv file with 3 columns. I'm saving them as a dictionary, but I'm trying to find a way to read the year and the title_field and find a specific title_field (Occupied Housing Units), match it up with the earliest year (2008) and take the number in the value column next to those and match it to the NEXT year (2009) with the same title_field (Occupied Housing Units), find the difference between those two values, and print the outcome and do the same with 2009 & 2010, etc. like this:
2008-2009 795
2009-2010 5091
etc.
The csv looks like this:
year,title_field,value
2014,Total Housing Units,49109
2014,Vacant Housing Units,2814
2014,Occupied Housing Units,46295
2013,Total Housing Units,47888
2013,Vacant Housing Units,4215
2013,Occupied Housing Units,43673
2012,Total Housing Units,45121
2012,Vacant Housing Units,3013
2012,Occupied Housing Units,42108
2011,Total Housing Units,44917
2011,Vacant Housing Units,4213
2011,Occupied Housing Units,40704
2010,Total Housing Units,44642
2010,Vacant Housing Units,3635
2010,Occupied Housing Units,41007
2009,Total Housing Units,39499
2009,Vacant Housing Units,3583
2009,Occupied Housing Units,35916
2008,Total Housing Units,41194
2008,Vacant Housing Units,4483
2008,Occupied Housing Units,36711
And the code I have so far is:
import csv
def process(year, field_name, value):
print(year, field_name, value)
with open('denton_housing.csv', 'r', encoding='utf8',newline='') as f:
reader = csv.DictReader(f, delimiter=',')
housing_stats = []
for row in reader:
year = row["year"]
field_name = row["title_field"]
value = int(row["value"])
denton_dict = {"year": year, "field_name": field_name, "value": value}
housing_stats.append(denton_dict)
process(year, field_name, value)
Thanks! I'm new to programming, and I'm an older dude. I love that the programming community is beyond helpful, as if you all are welcoming everyone into a cult (the good kind?).
You could do it like this:
Create a list of those row dictss which have the target title_field value in them.
Sort it by the year value in each one.
Use the itertools recipe for the pairwise() generator to process each pair of rows/years in the sorted list.
Code implementing the above:
import csv
from itertools import tee
# From https://docs.python.org/3/library/itertools.html#recipes
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
target_title_field = 'Occupied Housing Units'
csv_filename = 'denton_housing.csv'
with open(csv_filename, 'r', encoding='utf8', newline='') as f:
housing_stats = []
for row in csv.DictReader(f, delimiter=','):
if row['title_field'] == target_title_field:
year = int(row["year"])
field_name = row["title_field"]
value = int(row["value"])
denton_dict = {"year": year, "field_name": field_name, "value": value}
housing_stats.append(denton_dict)
housing_stats.sort(key=lambda row: row['year'])
for r1, r2 in pairwise(housing_stats):
print('{}-{} {:5}'.format(r1['year'], r2['year'], abs(r2['value'] - r1['value'])))
Output:
2008-2009 795
2009-2010 5091
2010-2011 303
2011-2012 1404
2012-2013 1565
2013-2014 2622
One easy way is to use 3 lists(each of your title_field) to keep the year and value fields, then you can process each list.
total = []
vacant = []
occupied = []
with open('denton_housing.csv', 'r', encoding='utf8',newline='') as f:
spamreader = csv.reader(f, delimiter=',')
for row in spamreader:
if row[1] == 'Occupied Housing Units':
# use the data structure you preferred, in this example I use tuple
mytuple = (row[0], row[2])
occupied.append(mytuple)
# do the same for total and vacant list, ignore if you don't need
...
# then you can process the list, for example, occupied
# I assume your csv file is sorted by year, so you may safely assume that each
# year field of the data entry in the occupied list is sorted as well
for i in range(len(occupied)-1):
# if your data contains every year, ie 2008-2014 without missing any
# the year field is useless in this case, so you can just
value_diff = abs(occupied[i][1] - occupied[i+1][1])
# if the year entry is not sorted, and it may missed some years
occupied.sort(key=lambda x: x[0]) # this sort in ascending order
for i in range(len(occupied)-1):
this_year = occupied[i][0]
next_year = occupied[i+1][0]
if next_year - this_year == 1:
value_diff = abs(occupied[i][1] - occupied[i+1][1])
I suggest you to use pandas for doing it.
Then you could use groupby and aggregation in a breeze.
like this:
df.groupby(df['year'].dt.year)['a'].agg(['value'])
Result:
2012 14
2015 6

Pandas - Matching reference number to find earliest date

I'm hoping to pick your brains on optimization. I am still learning more and more about python and using it for my day to day operation analyst position. One of the tasks I have is sorting through approx 60k unique record identifiers, and searching through another dataframe that has approx 120k records of interactions, the employee who authored the interaction and the time it happened.
For Reference, the two dataframes at this point look like:
main_data = Unique Identifier Only
nok_data = Authored By Name, Unique Identifer(known as Case File Identifier), Note Text, Created On.
My set up currently runs it at approximately sorting through and matching my data at 2500 rows per minute, so approximately 25-30 minutes or so for a run. What I am curious is are there any steps I performed that are:
Redundant and inefficient overall slowing my process
A poor use of syntax to work around my lack of knowledge.
Below is my code:
nok_data = pd.read_csv("raw nok data.csv") #Data set from warehouse
main_data = pd.read_csv("exampledata.csv") #Data set taken from iTx ids from referral view
row_count = 0
error_count = 0
print(nok_data.columns.values.tolist())
print(main_data.columns.values.tolist()) #Commented out, used to grab header titles if needed.
data_length = len(main_data) #used for counting how many records left.
earliest_nok = {}
nok_data["Created On"] = pd.to_datetime(nok_data["Created On"]) #convert all dates to datetime at beginning.
for row in main_data["iTx Case ID"]:
list_data = []
nok = nok_data["Case File Identifier"] == row
matching_dates = nok_data[["Created On", "Authored By Name"]][nok == True] #takes created on date only if nok shows row was true
if len(matching_dates) > 0:
try:
min_dates = matching_dates.min(axis=0)
earliest_nok[row] = [min_dates[0], min_dates[1]]
except ValueError:
error_count += 1
earliest_nok[row] = None
row_count += 1
print("{} out of {} records").format(row_count, data_length)
with open('finaloutput.csv','wb') as csv_file:
writer = csv.writer(csv_file)
for key, value in earliest_nok.items():
writer.writerow([key, value])
Looking for any advice or expertise from those performing code like this much longer then I have. I appreciate all of you who even just took the time to read this. Happy Tuesday,
Andy M.
**** EDIT REQUESTED TO SHOW DATA
Sorry for my novice move there not including any data type.
main_data example
ITX Case ID
2017-023597
2017-023594
2017-023592
2017-023590
nok_data aka "raw nok data.csv"
Authored By: Case File Identifier: Note Text: Authored on
John Doe 2017-023594 Random Text 4/1/2017 13:24:35
John Doe 2017-023594 Random Text 4/1/2017 13:11:20
Jane Doe 2017-023590 Random Text 4/3/2017 09:32:00
Jane Doe 2017-023590 Random Text 4/3/2017 07:43:23
Jane Doe 2017-023590 Random Text 4/3/2017 7:41:00
John Doe 2017-023592 Random Text 4/5/2017 23:32:35
John Doe 2017-023592 Random Text 4/6/2017 00:00:35
It looks like you want to group on the Case File Identifier and get the minimum date and corresponding author.
# Sort the data by `Case File Identifier:` and `Authored on` date
# so that you can easily get the author corresponding to the min date using `first`.
nok_data.sort_values(['Case File Identifier:', 'Authored on'], inplace=True)
df = (
nok_data[nok_data['Case File Identifier:'].isin(main_data['ITX Case ID'])]
.groupby('Case File Identifier:')['Authored on', 'Authored By:'].first()
)
d = {k: [v['Authored on'], v['Authored By:']] for k, v in df.to_dict('index').iteritems()}
>>> d
{'2017-023590': ['4/3/17 7:41', 'Jane Doe'],
'2017-023592': ['4/5/17 23:32', 'John Doe'],
'2017-023594': ['4/1/17 13:11', 'John Doe']}
>>> df
Authored on Authored By:
Case File Identifier:
2017-023590 4/3/17 7:41 Jane Doe
2017-023592 4/5/17 23:32 John Doe
2017-023594 4/1/17 13:11 John Doe
It is probably easier to use df.to_csv(...).
The items from main_data['ITX Case ID'] where there is no matching record have been ignored but could be included if required.

adding data from different rows in a csv belonging to a common variable

this is my csv excel file information:
Receipt merchant Address Date Time Total price
25007 A ABC pte ltd 3/7/2016 10:40 12.30
25008 A ABC ptd ltd 3/7/2016 11.30 6.70
25009 B CCC ptd ltd 4/7/2016 07.35 23.40
25010 A ABC pte ltd 4/7/2016 12:40 9.90
how is it possible to add the 'Total Price' of each line together only if they belong to the same 'merchant', 'date' and 'time' then grouping them together in a list or dict, example: {['A','3/7/2016', '19.0'], ['A',4/7/2016, '9.90'],..}
My previous code does what i wanted except that i lack the code to count the total price for each same date and merchant.
from collections import defaultdict
from csv import reader
with open("assignment_info.csv") as f:
next(f)
group_dict = defaultdict(list)
for rec, name, _, dte, time, price in reader(f):
group_dict[name, dte].extend(time)
for v in group_dict.values():v.sort()
from pprint import pprint as pp
print 'Sales tracker:'
pp(dict(group_dict))
import pandas as pd
df = pd.read_csv('assignment_info.csv')
df = df.groupby(['merchant', 'Date', 'Time']).sum().reset_index()
df
As the other answer points out, pandas is an excellent library for this kind of data manipulation. My answer won't use pandas though.
A few issues:
In your problem description, you state that you want to group by three columns, but in your example cases you are only grouping by two. Since the former makes more sense, I am only grouping by name and date
You are looping and sorting each value, but for the life of me I can't figure out why.
You declare the default type of the defaultdict a list and then extend with a string, which ends up giving you a (sorted!) list of characters. You don't really want to do this.
Your example uses the syntax of a set: { [a,b,c], [d,e,f] } but the syntax of a dict makes more sense: { (a, b): c, }. I have changed the output to the latter.
Here is a working example:
from collections import defaultdict
from csv import reader
with open("assignment_info.csv") as f:
next(f)
group_dict = defaultdict(float)
for rec, name, _, dte, time, price in reader(f):
group_dict[name, dte] += float(price)
group_dict is now:
{('A', '3/7/2016'): 19.0, ('A', '4/7/2016'): 9.9, ('B', '4/7/2016'): 23.4}
I removed extra columns which aren't in your example: here's the file I worked with:
Receipt,merchant,Address,Date,Time,Total price
25007,A,ABC pte ltd,3/7/2016,10:40,12.30
25008,A,ABC ptd ltd,3/7/2016,11.30,6.70
25009,B,CCC ptd ltd,4/7/2016,07.35,23.40
25010,A,ABC pte ltd,4/7/2016,12:40,9.90

reading and writing files for a food processing company

I am working on a Python project where a food processing company is trying to calculate its total sales for the year. Python has to read from a text file where its divided into for categories split by commas. The first category is the Type of product, which can be cereal, chocolate candy etc produced by the company. The second category is the brand of the said product, for example, Kaptain Krunch for cereal or Coco Jam for chocolate. The third category is the sales for the last fiscal year(2014) and the last category is sales for this fiscal year(2015). Note that only sales for fiscal year 2015 are to be calculated. The 2014 has no use in this program but it is there. Here is how the text file looks like. Its name is product.txt
Cereal,Magic Balls,2200,2344
Cereal,Kaptain Krunch,3300,3123
Cereal,Coco Bongo,1800,2100
Cereal,Sugar Munch,4355,6500
Cereal,Oats n Barley,3299,5400
Sugar Candy,Pop Rocks,546,982
Sugar Candy,Lollipop,1233,1544
Sugar Candy,Gingerbud,2344,2211
Sugar Candy,Respur,1245,2211
Chocolate,Coco Jam,3322,4300
Chocolate,Larkspur,1600,2200
Chocolate,Mighty Milk,1234,2235
Chocolate,Almond Berry,998,1233
Condiments,Peanut Butter,3500,3902
Condiments,Hot Sauce,1234,1560
Condiments,Jelly,346,544
Condiments,Spread,2334,5644
What we are looking to do is to add the sales for Fiscal year 2015 by products and then the total sales for everything in 2015
The output should look something like the in the written text file
Total sales for cereal in 2015 : {Insert total number here}
Total sales for Sugar Candy in 2015 : {Insert total number here}
Total sales for Chocolate in 2015 : {Insert total number here}
Total sales for Condiments in 2015 : {Insert total number here}
Total sales for the company in 2015: {Insert total for all the
products sold in 2015}
Along with that, it should also print the grand total on the Python run screen in the IDE along with the text file.
Total sales for the company in 2015: {Insert total for all the
products sold in 2015}
Here is my code. I am new to Python and reading and writing files so I can't really say if I am on the right track.
PRODUCT_FILE = "products.txt"
REPORT_FILE = "report.txt"
def main():
#open the file
productFile = open(PRODUCT_FILE, "r")
reportFile = open(REPORT_FILE, "w")
# reading the file
proData = extractDataRecord(productFile)
product = proData[0]
category = proData[1]
salesLastYear = prodata[2]
salesThisYear = proData[3]
#computing
product = 0.0
product = salesThisYear
productFile.close()
reportFile.close()
def extractDataRecord(infile) :
line = infile.readline()
if line == "" :
return []
else :
parts = line.rsplit(",", 1)
parts[1] = int(parts[1])
return parts
# Start the program.
main()
The short version here is that you're doing this wrong. Never roll your own parsing code if you can help it. I'd suggest taking a look at the built-in csv module, and trying using that to "contract out" the CSV parsing, letting you focus on the rest of the logic.
Simple rewrite and completed code with csv:
import collections
import csv
PRODUCT_FILE = "products.txt"
REPORT_FILE = "report.txt"
def main():
# Easy way to get a dictionary where lookup defaults to 0
categorycounts = collections.defaultdict(int)
#open the files using with statements to ensure they're closed properly
# without the need for an explicit call to close, even on exceptions
with open(PRODUCT_FILE, newline='') as productfile,\
open(REPORT_FILE, "w") as reportfile:
pcsv = csv.reader(productfile)
# Sum sales by product type letting csv parse
# Filter removes empty rows for us; assume all other rows complete
for category, brand, sales_lastyear, sales_thisyear in filter(None, pcsv):
categorycounts[category] += int(sales_thisyear)
# Print categories in sorted order with their total sales
for category, sales in sorted(categorycounts.items()):
print('Total sales for', category, 'in 2015:', sales, file=reportfile)
print('-'*80, file=reportfile) # Separator line between categories and total
# Sum and print total sales to both file and screen
totalsales = sum(categorycounts.values())
print("Total sales for the company in 2015:", totalsales, file=reportfile)
print("Total sales for the company in 2015:", totalsales)
if __name__ == '__main__':
main()

Categories