I asked a question similar last night, but my professor gave some clarification on how she wants the question answered, and it's thrown me for a loop.
I've got a csv file with 3 columns. I'm saving them as a dictionary, but I'm trying to find a way to read the year and the title_field and find a specific title_field (Occupied Housing Units), match it up with the earliest year (2008) and take the number in the value column next to those and match it to the NEXT year (2009) with the same title_field (Occupied Housing Units), find the difference between those two values, and print the outcome and do the same with 2009 & 2010, etc. like this:
2008-2009 795
2009-2010 5091
etc.
The csv looks like this:
year,title_field,value
2014,Total Housing Units,49109
2014,Vacant Housing Units,2814
2014,Occupied Housing Units,46295
2013,Total Housing Units,47888
2013,Vacant Housing Units,4215
2013,Occupied Housing Units,43673
2012,Total Housing Units,45121
2012,Vacant Housing Units,3013
2012,Occupied Housing Units,42108
2011,Total Housing Units,44917
2011,Vacant Housing Units,4213
2011,Occupied Housing Units,40704
2010,Total Housing Units,44642
2010,Vacant Housing Units,3635
2010,Occupied Housing Units,41007
2009,Total Housing Units,39499
2009,Vacant Housing Units,3583
2009,Occupied Housing Units,35916
2008,Total Housing Units,41194
2008,Vacant Housing Units,4483
2008,Occupied Housing Units,36711
And the code I have so far is:
import csv
def process(year, field_name, value):
print(year, field_name, value)
with open('denton_housing.csv', 'r', encoding='utf8',newline='') as f:
reader = csv.DictReader(f, delimiter=',')
housing_stats = []
for row in reader:
year = row["year"]
field_name = row["title_field"]
value = int(row["value"])
denton_dict = {"year": year, "field_name": field_name, "value": value}
housing_stats.append(denton_dict)
process(year, field_name, value)
Thanks! I'm new to programming, and I'm an older dude. I love that the programming community is beyond helpful, as if you all are welcoming everyone into a cult (the good kind?).
You could do it like this:
Create a list of those row dictss which have the target title_field value in them.
Sort it by the year value in each one.
Use the itertools recipe for the pairwise() generator to process each pair of rows/years in the sorted list.
Code implementing the above:
import csv
from itertools import tee
# From https://docs.python.org/3/library/itertools.html#recipes
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
target_title_field = 'Occupied Housing Units'
csv_filename = 'denton_housing.csv'
with open(csv_filename, 'r', encoding='utf8', newline='') as f:
housing_stats = []
for row in csv.DictReader(f, delimiter=','):
if row['title_field'] == target_title_field:
year = int(row["year"])
field_name = row["title_field"]
value = int(row["value"])
denton_dict = {"year": year, "field_name": field_name, "value": value}
housing_stats.append(denton_dict)
housing_stats.sort(key=lambda row: row['year'])
for r1, r2 in pairwise(housing_stats):
print('{}-{} {:5}'.format(r1['year'], r2['year'], abs(r2['value'] - r1['value'])))
Output:
2008-2009 795
2009-2010 5091
2010-2011 303
2011-2012 1404
2012-2013 1565
2013-2014 2622
One easy way is to use 3 lists(each of your title_field) to keep the year and value fields, then you can process each list.
total = []
vacant = []
occupied = []
with open('denton_housing.csv', 'r', encoding='utf8',newline='') as f:
spamreader = csv.reader(f, delimiter=',')
for row in spamreader:
if row[1] == 'Occupied Housing Units':
# use the data structure you preferred, in this example I use tuple
mytuple = (row[0], row[2])
occupied.append(mytuple)
# do the same for total and vacant list, ignore if you don't need
...
# then you can process the list, for example, occupied
# I assume your csv file is sorted by year, so you may safely assume that each
# year field of the data entry in the occupied list is sorted as well
for i in range(len(occupied)-1):
# if your data contains every year, ie 2008-2014 without missing any
# the year field is useless in this case, so you can just
value_diff = abs(occupied[i][1] - occupied[i+1][1])
# if the year entry is not sorted, and it may missed some years
occupied.sort(key=lambda x: x[0]) # this sort in ascending order
for i in range(len(occupied)-1):
this_year = occupied[i][0]
next_year = occupied[i+1][0]
if next_year - this_year == 1:
value_diff = abs(occupied[i][1] - occupied[i+1][1])
I suggest you to use pandas for doing it.
Then you could use groupby and aggregation in a breeze.
like this:
df.groupby(df['year'].dt.year)['a'].agg(['value'])
Result:
2012 14
2015 6
Related
I've been given a homework task to get data from a csv file without using Pandas. The info in the csv file contains headers such as...
work year:
experience level: EN Entry-level / Junior MI Mid-level / Inter- mediate SE Senior-level / Expert EX Executive-level / Director
employment type: PT Part-time FT Full-time CT Contract FL Freelance
job title:
salary:
salary currency:
salaryinusd: The salary in USD
employee residence: Employee’s primary country of residence
remote ratio:
One of the questions is:
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
The only way I've managed to do this is to iterate through the csv and add a load of 'if' statements according to the experience level and job title, but this is taking me forever.
Any ideas of how to tackle this differently? Not using any libraries/modules.
Example of my code:
with open('/Users/xxx/Desktop/ds_salaries.csv', 'r') as f:
csv_reader = f.readlines()
for row in csv_reader[1:]:
new_row = row.split(',')
experience_level = new_row[2]
job_title = new_row[4]
salary_in_usd = new_row[7]
if experience_level == 'EN' and job_title == 'AI Scientist':
en_ai_scientist += int(salary_in_usd)
count_en_ai_scientist += 1
avg_en_ai_scientist = en_ai_scientist / count_en_ai_scientist
print(avg_en_ai_scientist)
Data:
When working out an example like this, I find it helpful to ask, "What data structure would make this question easy to answer?"
For example, the question asks
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
To me, this implies that I want a dictionary keyed by a tuple of experience level and job title, with the salaries of every person who matches. Something like this:
data = {
("EN", "AI Scientist"): [1000, 2000, 3000],
("SE", "AI Scientist"): [2000, 3000, 4000],
}
The next question is: how do I get my data into that format? I would read the data in with csv.DictReader, and add each salary number into the structure.
data = {}
with open('input.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
experience_level = row['first_name']
job_title = row['last_name']
key = experience_level, job_title
if key not in data:
# provide default value if no key exists
# look at collections.defaultdict if you want to see a better way to do this
data[key] = []
data[key].append(row['salary_in_usd'])
Now that you have your data organized, you can compute average salaries:
for (experience_level, job_title), salary_data in data:
print(experience_level, job_title, sum(salary_data)/len(salary_data))
Hello am supposed to the steps below. I have finished but getting this error
File "C:/Users/User/Desktop/question2.py", line 37, in
jobtype_salary[li['job']] = int(li['salary'])
ValueError: invalid literal for int() with base 10: 'SECRETARY
a. Read the file into a list of lists (14 rows, 5 columns)
b. Transform each row of the list into a dictionary. The keys are : ename, job, salary, comm, dno. Call the resulting list of dictionaries dict_of_emp
c. Display the table dict_of_emp, one row per line
d. Perform the following computations on dict_of_emp:
D1. Compute and print the incomes of Richard and Mary (add salary and comm)
D2 Compute and display the sum of salaries paid to each type of job (i.e. salary paid to analysts is 3500 + 3500= 7000)
D3. Add 5000 to the salaries of employees in department 30. Display the new table
import csv
#Open the file in read mode
f = open("employeeData.csv",'r')
reader = csv.reader(f)
#To read the file into list of lists we use list() method
emps = list(reader)
#print(emps)
#Transform each row into a dictionary.
dict_of_emp = [] #list of dictionaries
for row in emps:
d={}
d['ename'] = row[0]
d['job'] = row[1]
d['salary']=row[2]
d['comm']=row[3]
d['dno']=row[4]
dict_of_emp.append(d)
print("*************************************************")
#display the table dict_of_emp, one row per line.
for li in dict_of_emp:
print(li)
print("*************************************************")
#Incomes of Richard and Mary, to add salary and commision, first we need to cast them to integers.
d1 = ['RICHARD','MARY']
for li in dict_of_emp:
if li['ename'] in d1:
print('income of ', li['ename']," is ",int(li['salary']+li['comm']))
print("*************************************************")
#Sum of salaries based on type of job, dictionary is used so the job type is key
#and sum of salary is value
jobtype_salary = {}
for li in dict_of_emp:
if li['job'] in jobtype_salary.keys():
jobtype_salary[li['job']] += int(li['salary'])
else:
jobtype_salary[li['job']] = int(li['salary'])
print(jobtype_salary)
print("*************************************************")
#Add 5000 to salaries of employees in department 30.
for li in dict_of_emp:
if li['dno']=='30':
li['salary']=int(li['salary'])+5000
for li in dict_of_emp:
print(li)
Here is the csv as an image:
I think the indexing of your columns is slightly off. You do d['salary'] = row[2], which, according to the CSV corresponds with the third row i.e. with the position of the person (SECRETARY, SALESPERSON). If you then try to convert this string to an integer, you get the error.
Does it run with this instead?
for row in emps:
d={}
d['ename'] = row[1]
d['job'] = row[2]
d['salary']=row[3]
d['comm']=row[4]
d['dno']=row[5]
dict_of_emp.append(d)
I am trying to compare two data in one csv file and I cannot use panda.
What I am trying to get is the total Unit sold that the two person sell and the sum of all the years then compare who sold more based on the sum of all they sold through out the years. Then also get the least they sold on that particular year.
For example, my .csv is setup like this:
John Smith, 343, 2020
John Smith, 522, 2019
John Smith, 248, 2018
Sherwin Cooper, 412, 2020
Sherwin Cooper, 367, 2019
Sherwin Cooper, 97, 2018
Dorothy Lee, 612, 2020
Dorothy Lee, 687, 2019
Dorothy Lee, 591, 2018
I want to compare John and Dorothy's unit sold and who sold more. So the output should be:
Dorothy Lee sold more units than John smith. A total of 1890 to 1113.
Dorothy Lee sold less in 2018, for only 591.
John Smith sold less in 2018, for only 248.
My code so far is:
import csv
def compare1(employee1):
with open("employeedata.csv") as file:
rows = list(csv.DictReader(file, fieldnames = ['c1', 'c2', 'c3']))
res = {}
for row in rows:
if row['c1'] == employee1:
res[employee1] = res.get(employee1, 0) + int(row['c2'])
print(res)
def compare2(employee2):
with open("employee2.csv") as file:
rows = list(csv.DictReader(file, fieldnames = ['c1', 'c2', 'c3']))
res = {}
for row in rows:
if row['c1'] == employee2:
res[employee2] = res.get(employee2, 0) + int(row['c2'])
print(res)
employee1 = input("Enter the first name: ")
employee2 = input("Enter the first name: ")
compare1(employee1)
compare2(employee2)
I don't know the rest. I am stuck. I am a beginner and I can't use Panda. The output I need to have should look like this:
Dorothy Lee sold more units than John smith. A total of 1890 to 1113.
Dorothy Lee sold less in 2018, for only 591.
John Smith sold less in 2018, for only 248.
right now I got the output:
{'John Smith : 1113}
{'Dorothy Lee' : 1890}
Suppose my.csv has columns name, sales, year:
import pandas as pd
emp_df = pd.read_csv("my.csv")
emp_gp = emp_df.groupby("name").sales.sum().reset_index(inplace=True)
def compare(saler1, saler2):
if saler1 in emp_pg.name.values and saler2 in emp_pg.name.values:
saler1_tol = emp_pg.loc[emp_pg.name == saler1, ["sales"]]
saler2_tol = emp_pg.loc[emp_pg.name == saler2, ["sales"]]
if saler1_tol > saler2_tol:
print(f"{saler1} sold more unit than {saler2}. A total {saler1_tol} to {saler1_tol}")
else:
print(f"{saler2} sold more unit than {saler1}. A total {saler2_tol} to {saler2_tol}")
emp_gb2 = emp_df.groupby("name")
emp_agg = emp_gb2.agg({
"sales" : "min"
})
emp_agg = emp_agg.reset_index()
print("{saler1} sold less in {emp_pg.loc[emp_pg.name == saler1, ["year"]].values}, for only {emp_pg.loc[emp_pg.name == saler1, ["sales"]].values}")
print("{saler2} sold less in {emp_pg.loc[emp_pg.name == saler2, ["year"]].values}, for only {emp_pg.loc[emp_pg.name == saler2, ["sales"]].values}")
else:
print("names of salers are not in the table")
Instead of creating a function for each result you want to get, first create a database (a dict is OK) that aggregates the sum of units sold for each name and for each year. Then it is easier to answer to all kind of comparisons without having to repeat code. You can start with something like this,
import csv
from collections import defaultdict
db=defaultdict(lambda: defaultdict(int))
with open('teste.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
db[row['name']][int(row['year'])]+=int(row['units'])
print(db['Dorothy Lee'][2019]) #Units sold by Dorothy Lee in 2019
print(sum(db['Dorothy Lee'].values())) #Total Units sold by Dorothy Lee
Don't be afraid of the defaultdict module. Check the docs, it is really handy in this kind of scenario. The defaultdict creates a dictionary with a default for every missing key. In this case, the default value of the first defaultdict is another defaultdict, this time with a default value of 0 (the result of calling int()), since we want to compute a sum of units sold (therefore an integer).
With this approach, you don't need to check if the key already exists or not, defaultdict takes care of that for you.
PS: the lambda in the first defaultdict is needed to nest a second defaultdict. If you are not familiar with lambda either, check this
I am trying to iterate over a dictionary that contains multiple row indexes in its values and then apply pd.nsmallest function to generate the top 3 smallest values for the multiple sets of row indexes that are in the dictionary. However, there seems to be something wrong with my for loop statement as I am overwriting the top 3 values over and over till the last set of values in the dictionary and so my final excel file output shows only 3 rows for the last run of the for loop.
When I use print statements this works as expected and I get an output for all 16 values in the dictionary but when writing to excel file it only gives me the output of the last run on the loop
import pandas as pd
from tabulate import tabulate
VA = pd.read_excel('Columnar BU P&L.xlsx', sheet_name = 'Variance by Co')
legcon = VA[['Expense', 'Consolidation', 'Exp Category']]
legcon['Variance Type'] = ['Unfavorable' if x < 0 else 'favorable' for x in legcon['Consolidation']]
d = {'Travel & Entertainment': [1,2,3,4,5,6,7,8,9,10,11], 'Office supplies & Expenses': [13,14,15,16,17],
'Professional Fees':[19,20,21,22,23], 'Fees & Assessments':[25,26,27], 'IT Expenses':[29],
'Bad Debt Expense':[31],'Miscellaneous expenses': [33,34,35,36,37],'Marketing Expenses':[40,41,42],
'Payroll & Related Expenses': [45,46,47,48,49,50,51,52,53,54,55,56], 'Total Utilities':[59,60],
'Total Equipment Maint, & Rental Expense': [63,64,65,66,67,68],'Total Mill Expense':[70,71,72,73,74,75,76,77],
'Total Taxes':[80,81],'Total Insurance Expense':[83,84,85],'Incentive Compensation':[88],
'Strategic Initiative':[89]}
Printing output directly works fine when I do this:
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
print(a)
Expense Consolidation Exp Category Variance Type
5 Transportation - AIR -19054 Travel & Entertainment Unfavorable
9 Meals -9617 Travel & Entertainment Unfavorable
7 Lodging -9439 Travel & Entertainment Unfavorable
Expense Consolidation Exp Category Variance Type
26 Bank Charges -4320 Fees & Assessments Unfavorable
27 Finance Charges -1389 Fees & Assessments Unfavorable
25 Payroll Fees -1145 Fees & Assessments Unfavorable
However when I use the below code to write to excel:
writer = pd.ExcelWriter('testt.xlsx', engine = 'xlsxwriter')
row = 0
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
for i in range(0,16):
a.to_excel(writer, sheet_name = 'test', startrow = row+4, index = False)
writer.save()
my output looks like this and does not show all the exp categories:
I would really appreciate any feedback on how to correct this. Thanks in advance!
With some help from a friend I just realized my silly mistake, there was no row iterator in my for loop for the output to print on the next lines and using the below code fixed the issue (initially i placed the row iterator within my df.to_excel statement):
writer = pd.ExcelWriter('testt.xlsx', engine = 'xlsxwriter')
row = 0
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
a.to_excel(writer, sheet_name = 'Testt', startrow = row, index = False)
row = row+4
writer.save()
this is my csv excel file information:
Receipt merchant Address Date Time Total price
25007 A ABC pte ltd 3/7/2016 10:40 12.30
25008 A ABC ptd ltd 3/7/2016 11.30 6.70
25009 B CCC ptd ltd 4/7/2016 07.35 23.40
25010 A ABC pte ltd 4/7/2016 12:40 9.90
how is it possible to add the 'Total Price' of each line together only if they belong to the same 'merchant', 'date' and 'time' then grouping them together in a list or dict, example: {['A','3/7/2016', '19.0'], ['A',4/7/2016, '9.90'],..}
My previous code does what i wanted except that i lack the code to count the total price for each same date and merchant.
from collections import defaultdict
from csv import reader
with open("assignment_info.csv") as f:
next(f)
group_dict = defaultdict(list)
for rec, name, _, dte, time, price in reader(f):
group_dict[name, dte].extend(time)
for v in group_dict.values():v.sort()
from pprint import pprint as pp
print 'Sales tracker:'
pp(dict(group_dict))
import pandas as pd
df = pd.read_csv('assignment_info.csv')
df = df.groupby(['merchant', 'Date', 'Time']).sum().reset_index()
df
As the other answer points out, pandas is an excellent library for this kind of data manipulation. My answer won't use pandas though.
A few issues:
In your problem description, you state that you want to group by three columns, but in your example cases you are only grouping by two. Since the former makes more sense, I am only grouping by name and date
You are looping and sorting each value, but for the life of me I can't figure out why.
You declare the default type of the defaultdict a list and then extend with a string, which ends up giving you a (sorted!) list of characters. You don't really want to do this.
Your example uses the syntax of a set: { [a,b,c], [d,e,f] } but the syntax of a dict makes more sense: { (a, b): c, }. I have changed the output to the latter.
Here is a working example:
from collections import defaultdict
from csv import reader
with open("assignment_info.csv") as f:
next(f)
group_dict = defaultdict(float)
for rec, name, _, dte, time, price in reader(f):
group_dict[name, dte] += float(price)
group_dict is now:
{('A', '3/7/2016'): 19.0, ('A', '4/7/2016'): 9.9, ('B', '4/7/2016'): 23.4}
I removed extra columns which aren't in your example: here's the file I worked with:
Receipt,merchant,Address,Date,Time,Total price
25007,A,ABC pte ltd,3/7/2016,10:40,12.30
25008,A,ABC ptd ltd,3/7/2016,11.30,6.70
25009,B,CCC ptd ltd,4/7/2016,07.35,23.40
25010,A,ABC pte ltd,4/7/2016,12:40,9.90