Python: Getting values of one column based on another column name - python

I am new to python and trying to read a csv file and generate a sales report. For example, I have a dataset as per below
Category | State | Sales| Profits|
Clothes | California| 2389 | 608
Stationery| Georgia | 687 | 54
Gadgets | Washington| 324 | 90
How can I get the sum of the profit based on the state and category without using pandas? Meaning I need to sum the values of "Sales" and "Profit" when category is "Clothes".
I am using the code below currently, which requires a lot manual effort.
with open(path,"r") as store_file:
reader = csv.DictReader(superstore_file)
total_sales_clothes = 0
total_sales_stationery = 0
total_sales_gadgets = 0
for row in reader:
category = row.get("Category")
if category=="Clothes":
sales_clothes = float(row.get("Sales"))
total_sales_clothes += sales_clothes
elif category=="Stationery":
sales_stationery = float(row.get("Sales"))
total_sales_stationery += sales_stationery
elif category=="Gadgets":
sales_office_gadgets = float(row.get("Sales"))
total_sales_gadgets += sales_gadgets
print("Total Sales for Clothes is: {0}".format(total_sales_clothes))
print("Total Sales for Stationery is {0}".format(total_sales_stationery))
print("Total Sales for Gadgets is {0}".format(total_sales_gadgets))

You can use the python dict. This way, you don't have to hardcode the categories
The python dictionary is a key-value data structure which is extremely useful.
In your case, key would be the categories and value would be the total sales.
Eg.
{ "Clothes" : 2389, "Stationery" : 0, "Gadgets" : 0, }
Edit : Note that you should check if the category exists. If it does, just add the sales value, else just assign the value.
with open(path,"r") as store_file:
reader = csv.DictReader(superstore_file)
total_sales = {}
for row in reader:
category = row.get("Category")
if category in total_sales:
total_sales[category] += float(row.get(category))
else:
total_sales[category] = float(row.get(category))
print("Total Sales for Clothes is: {0}".format(total_sales['Clothes']))
If you want to traverse through the dict and print sales for all the categories use
for key in total_sales:
print("Total Sales for " + key + " is : " + total_sales[key])
And, though you mentioned you don't want to use Pandas, I would suggest you to check its usage if you are going to work in these types of CSV Dataset for long period. Once you start, you ll find how easy it makes your job. You will save more time than what you spend learning Pandas.

Related

How to append a new value in a CSV file in Python?

I have a CSV sheet, having data like this:
| not used | Day 1 | Day 2 |
| Person 1 | Score | Score |
| Person 2 | Score | Score |
But with a lot more rows and columns. Every day I get progress of how much each person progressed, and I get that data as a dictionary where keys are names and values are score amounts.
The thing is, sometimes that dictionary will include new people and not include already existing ones. Then, if a new person comes, it will add 0 as every previous day and if the dict doesn't include already existing person, it will give him 0 score to that day
My idea of solving this is doing lines = file.readlines() on that CSV file, making a new list of people's names with
for line in lines:
names.append(line.split(",")[0])
then making a copy of lines (newLines = lines)
and going through dict's keys, seeing if that person is already in the csv, if so, append the value followed by a comma
But I'm stuck at the part of adding score of 0
Any help or contributions would be appreciated
EXAMPLE: Before I will have this
-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
And I have this dictionary to add
{'Mark': 1750, 'Hannah':1640, 'Brian':1780}
The result should be
-,day1,day2,day3,day4
Mark,1500,0,1660,1750
John,1800,1640,0,0
Peter,1670,1680,1630,0
Hannah,1480,1520,1570,1640
Brian,0,0,0,1780
See how Brian is in the dict and not in the before csv and he got added with any other day score 0. I figured out that one line .split(',') would give a list of N elements, where N - 2 will be amount of zero scores to add prior to first day of that person
This is easy to do in pandas as an outer join. Read the CSV into a dataframe and generate a new dataframe from the dictionary. The join is almost what you want except that since not-a-number values are inserted for empty cells, you need to fill the NaN's with zero and reconvert everything to integer.
The one potential problem is that the CSV is sorted. You don't simply have the new rows appended to the bottom.
import pandas as pd
import errno
import os
INDEX_COL = "-"
def add_days_score(filename, colname, scores):
try:
df = pd.read_csv(filename, index_col=INDEX_COL)
except OSError as e:
if e.errno == errno.ENOENT:
# file doesn't exist, create empty df
df = pd.DataFrame([], columns=[INDEX_COL])
df = df.set_index(INDEX_COl)
else:
raise
new_df = pd.DataFrame.from_dict({colname:scores})
merged = df.join(new_df, how="outer").fillna(0).astype(int)
try:
merged.to_csv(filename + ".tmp", index_label=[INDEX_COL])
except:
raise
else:
os.rename(filename + ".tmp", filename)
return merged
#============================================================================
# TEST
#============================================================================
test_file = "this_is_a_test.csv"
before = """-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
"""
after = """-,day1,day2,day3,day4
Brian,0,0,0,1780
Hannah,1480,1520,1570,1640
John,1800,1640,0,0
Mark,1500,0,1660,1750
Peter,1670,1680,1630,0
"""
test_dicts = [
["day4", {'Mark': 1750, 'Hannah':1640, 'Brian':1780}],
]
open(test_file, "w").write(before)
for name, scores in test_dicts:
add_days_score(test_file, name, scores)
print("want\n", after, "\n")
got = open(test_file).read()
print("got\n", got, "\n")
if got != after:
print("FAILED")

For loop: writing dictionary iteration output to excel using pandas?

I am trying to iterate over a dictionary that contains multiple row indexes in its values and then apply pd.nsmallest function to generate the top 3 smallest values for the multiple sets of row indexes that are in the dictionary. However, there seems to be something wrong with my for loop statement as I am overwriting the top 3 values over and over till the last set of values in the dictionary and so my final excel file output shows only 3 rows for the last run of the for loop.
When I use print statements this works as expected and I get an output for all 16 values in the dictionary but when writing to excel file it only gives me the output of the last run on the loop
import pandas as pd
from tabulate import tabulate
VA = pd.read_excel('Columnar BU P&L.xlsx', sheet_name = 'Variance by Co')
legcon = VA[['Expense', 'Consolidation', 'Exp Category']]
legcon['Variance Type'] = ['Unfavorable' if x < 0 else 'favorable' for x in legcon['Consolidation']]
d = {'Travel & Entertainment': [1,2,3,4,5,6,7,8,9,10,11], 'Office supplies & Expenses': [13,14,15,16,17],
'Professional Fees':[19,20,21,22,23], 'Fees & Assessments':[25,26,27], 'IT Expenses':[29],
'Bad Debt Expense':[31],'Miscellaneous expenses': [33,34,35,36,37],'Marketing Expenses':[40,41,42],
'Payroll & Related Expenses': [45,46,47,48,49,50,51,52,53,54,55,56], 'Total Utilities':[59,60],
'Total Equipment Maint, & Rental Expense': [63,64,65,66,67,68],'Total Mill Expense':[70,71,72,73,74,75,76,77],
'Total Taxes':[80,81],'Total Insurance Expense':[83,84,85],'Incentive Compensation':[88],
'Strategic Initiative':[89]}
Printing output directly works fine when I do this:
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
print(a)
Expense Consolidation Exp Category Variance Type
5 Transportation - AIR -19054 Travel & Entertainment Unfavorable
9 Meals -9617 Travel & Entertainment Unfavorable
7 Lodging -9439 Travel & Entertainment Unfavorable
Expense Consolidation Exp Category Variance Type
26 Bank Charges -4320 Fees & Assessments Unfavorable
27 Finance Charges -1389 Fees & Assessments Unfavorable
25 Payroll Fees -1145 Fees & Assessments Unfavorable
However when I use the below code to write to excel:
writer = pd.ExcelWriter('testt.xlsx', engine = 'xlsxwriter')
row = 0
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
for i in range(0,16):
a.to_excel(writer, sheet_name = 'test', startrow = row+4, index = False)
writer.save()
my output looks like this and does not show all the exp categories:
I would really appreciate any feedback on how to correct this. Thanks in advance!
With some help from a friend I just realized my silly mistake, there was no row iterator in my for loop for the output to print on the next lines and using the below code fixed the issue (initially i placed the row iterator within my df.to_excel statement):
writer = pd.ExcelWriter('testt.xlsx', engine = 'xlsxwriter')
row = 0
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
a.to_excel(writer, sheet_name = 'Testt', startrow = row, index = False)
row = row+4
writer.save()

What is the best way to compute a rolling (lag and lead) difference in sales?

I'm looking to add a field or two into my data set that represents the difference in sales from the last week to current week and from current week to the next week.
My dataset is about 4.5 million rows so I'm looking to find an efficient way of doing this, currently I'm getting into a lot of iteration and for loops and I'm quite sure I'm going about this the wrong way. but Im trying to write code that will be reusable on other datasets and there are situations where you might have nulls or no change in sales week to week (therefore there is no record)
The dataset looks like the following:
Store Item WeekID WeeklySales
1 1567 34 100.00
2 2765 34 86.00
3 1163 34 200.00
1 1567 35 160.00
. .
. .
. .
I have each week as its own dictionary and then each store sales for that week in a dictionary within. So I can use the week as a key and then within the week I access the store's dictionary of item sales.
weekly_sales_dict = {}
for i in df['WeekID'].unique():
store_items_dict = {}
subset = df[df['WeekID'] == i]
subset = subset.groupby(['Store', 'Item']).agg({'WeeklySales':'sum'}).reset_index()
for j in subset['Store'].unique():
storeset = subset[subset['Store'] == j]
store_items_dict.update({str(j): storeset})
weekly_sales_dict.update({ str(i) : store_items_dict})
Then I iterate through each week in the weekly_sales_dict and compare each store/item within it to the week behind it (I planned to do the same for the next week as well). The 'lag_list' I create can be indexed by week, store, and Item so I was going to iterate through and add the values to my df as a new lag column but I feel I am way overthinking this.
count = 0
key_list = list(df['WeekID'].unique())
lag_list = []
for k,v in weekly_sales_dict.items():
if count != 0 and count != len(df['WeekID'].unique())-1:
prev_wk = weekly_sales_dict[str(key_list[(count - 1)])]
current_wk = weekly_sales_dict[str(key_list[count])
for i in df['Store'].unique():
prev_df = prev_wk[str(i)]
current_df = current_wk[str(i)]
for j in df['Item'].unique():
print('in j')
if j in list(current_df['Item'].unique()) and j in list(prev_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values - prev_df[prev_df['Item'] == int(j)]['WeeklySales'].values
df[df['Item'] == j][df['Store'] == i ][df['WeekID'] == key_list[count]]['lag'] = item_lag[0]
lag_list.append((str(i),str(j),item_lag[0]))
elif j in list(current_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values
lag_list.append((str(i),str(j),item_lag[0]))
else:
pass
count += 1
else:
count += 1
Using pd.diff() the problem was solved. I sorted all rows by week, then created a subset with a multi-index by grouping on store,items,and week. Finally I used pd.diff() with a period of 1 and I ended up with the sales difference from the current week to the week prior.
df = df.sort_values(by = 'WeekID')
subset = df.groupby(['Store', 'Items', 'WeekID']).agg({''WeeklySales'':'sum'})
subset['lag'] = subset[['WeeklySales']].diff(1)

Calculate totals by reading from a text file in Python

I am trying to write a program in python in which we have to add the numbers from different categories and sub-categories. The program is about a farmer's annual sale of produce from his farm. The text file from where we have to read from has 4 categories. The first category is the type of product for example Vegetables, Fruits, condiments. The second category tells us about the type of product we have, for example Potatoes, Apples, Hot Sauce. The third category tells us about the sales in 2014 and the fourth category tells us about the sales in 2015. In this program, we only have to calculate the totals from the 2015 numbers. The 2014 numbers are present in the text file but are irrelevant.
Here is how the text file looks like
PRODUCT,CATEGORY,2014 Sales,2015 Sales
Vegetables,Potatoes,4455,5644
Vegetables,Tomatoes,5544,6547
Vegetables,Peas,987,1236
Vegetables,Carrots,7877,8766
Vegetables,Broccoli,5564,3498
Fruits,Apples,398,4233
Fruits,Grapes,1099,1234
Fruits,Pear,2342,3219
Fruits,Bananas,998,1235
Fruits,Peaches,1678,1875
Condiments,Peanut Butter,3500,3902
Condiments,Hot Sauce,1234,1560
Condiments,Jelly,346,544
Condiments,Spread,2334,5644
Condiments,Ketchup,3321,3655
Condiments,Olive Oil,3211,2344
What we are looking to do is to add the sales for 2015 by products and then the total sales for everything in 2015.
The output should look something like this in the written text file:
Total sales for Vegetables in 2015 : {Insert total number here}
Total sales for Fruits in 2015 : {Insert total number here}
Total sales for Condiments in 2015 : {Insert total number here}
Total sales for the farmer in 2015: {Insert total for all the
products sold in 2015}
Along with that, it should also print the grand total on the Python run screen in the IDE along with the text file:
Total sales for the farmer in 2015: {Insert total for all the
products sold in 2015}
Here is my code. I am new to Python and reading and writing files so I can't really say if I am on the right track.
PRODUCT_FILE = "products.txt"
REPORT_FILE = "report.txt"
def main():
#open the file
productFile = open(PRODUCT_FILE, "r")
reportFile = open(REPORT_FILE, "w")
# reading the file
proData = extractDataRecord(productFile)
product = proData[0]
category = proData[1]
salesLastYear = prodata[2]
salesThisYear = proData[3]
#computing
product = 0.0
product = salesThisYear
productFile.close()
reportFile.close()
def extractDataRecord(infile) :
line = infile.readline()
if line == "" :
return []
else :
parts = line.rsplit(",", 1)
parts[1] = int(parts[1])
return parts
# Start the program.
main()
You have a csv file so you should probably use pythons built-in csv module to parse the file. The DictReader class turns every line into a dictionary with the key being the header. If your csv file was called product_sales.csv, the following code would work.
import csv
product_dict = {}
cat_dict = {}
with open('product_sales.csv', 'r') as f:
for line in csv.DictReader(f):
cat = line['CATEGORY']
product = line['PRODUCT']
sales_15 = int(line['2015 Sales'])
if cat in cat_dict:
cat_dict[cat] += sales_15
else:
cat_dict[cat] = sales_15
if product in product_dict:
product_dict[product] += sales_15
else:
product_dict[product] = sales_15
Total = sum(cat_dict.values())
print product_dict
print cat_dict
print Total
Your code looks like a good start; here are a few pointers:
it is a good idea to use with when opening files because it guarantees they will get closed properly. Instead of
productFile = open(PRODUCT_FILE, "r")
# do something with the file
productFile.close()
you should do
with open(PRODUCT_FILE, "r") as product_file:
# do something with the file
# the file has been closed!
you only call proData = extractDataRecord(productFile) once (ie you get the header line but none of the data). You could put it in a while loop, but it is much more idiomatic to iterate directly on the file, ie
for line in product_file:
product, category, _, sales = line.split(',')
sales = int(sales)
# now do something with the values!
(using _ as a variable name is shorthand for "I don't care about this value")
you can use a dict to keep track of products and total sales for each,
product_sales = {}
then in the for loop,
product_sales[product] = product_sales.get(product, 0) + sales
If you could from collections import defaultdict, this becomes even simpler:
product_sales = defaultdict(int)
product_sales[product] += sales
once you have processed the entire file, you need to report on the results like
all_sales = 0
for product, sales in product_sales.items():
# write the sales for this product
all_sales += sales
# print all_sales

Improving the algorithm to distinguish the different types of table

I have two tables with the following structures where in table 1, the ID is next to Name while in table 2, the ID is next to Title 1. The one similarity between the two tables are that, the first person always has the ID next to their name. They are different for the subsequent people.
Table 1:
Name&Title | ID #
----------------------
Random_Name 1|2000
Title_1_1 | -
Title_1_2 | -
Random_Name 2| 2000
Title_2_1 | -
Title_2_2 | -
... |...
Table 2:
Name&Title | ID #
----------------------
Random_Name 1| 2000
Title_1_1 | -
Title_1_2 | -
Random_Name 2| -
Title_2_1 | 2000
Title_2_2 | -
... |...
I have the code to recognize table 1 but struggle to incorporate structure 2. The table is stored as a nested list of row (each row is a list). Usually, for one person there are only 1 row of name but multiple rows of titles. The pseudo-code is this:
set count = 0
find the ID next to the first name, set it to be a recognizer
for row_i,row in enumerate(table):
compare the ID of the next row until I found: row[1] == recognizer
set count = row i
slice the table to get the first person.
The actual code is this:
header_ind = 0 # something related to the rest of the code
recognizer = data[header_ind+1][1]
count = header_ind+1
result = []
result.append(data[0]) #this append the headers
for i, row in enumerate(data[header_ind+2:]):
if i <= len(data[header_ind+4:]):
if row[1] and data[i+1+header_ind+2][1] is recognizer:
print data[i+header_ind+3]
one_person = data[count:i+header_ind+3]
result.append(one_person)
count = i+header_ind+3
else:
if i == len(data[header_ind+3:]):
last_person = data[count:i+header_ind+3]
result.append(last_person)
count = i+header_ind+3
I have been thinking about this for a while and so I just want to know whether it is possible to get an algorithm to incorporate Table 2 given that the we cannot distinguish the row name and titles.
Going to stick this here
So these are your inputs assumption is you are restricted to this...:
# Table 1
data1 = [['Name&Title','ID#'],
['Random_Name1','2000'],
['Title_1_1','-'],
['Title_1_2','-'],
['Random_Name2','2000'],
['Title_2_1','-'],
['Title_2_2','-']]
# TABLE 2
data2 = [['Name&Title','ID#'],
['Random_Name1','2000'],
['Title_1_1','-'],
['Title_1_2','-'],
['Random_Name2','-'],
['Title_2_1','2000'],
['Title_2_2','-']]
And this is your desired output:
for x in data:
print x
['Random_Name2', '2000']
['Name&Title', 'ID#']
[['Random_Name1', '2000'], ['Title_1_1', '-'], ['Title_1_2', '-']]
[['Random_Name2', '2000'], ['Title_2_1', '-'], ['Title_2_2', '-']]

Categories