I've been given a homework task to get data from a csv file without using Pandas. The info in the csv file contains headers such as...
work year:
experience level: EN Entry-level / Junior MI Mid-level / Inter- mediate SE Senior-level / Expert EX Executive-level / Director
employment type: PT Part-time FT Full-time CT Contract FL Freelance
job title:
salary:
salary currency:
salaryinusd: The salary in USD
employee residence: Employee’s primary country of residence
remote ratio:
One of the questions is:
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
The only way I've managed to do this is to iterate through the csv and add a load of 'if' statements according to the experience level and job title, but this is taking me forever.
Any ideas of how to tackle this differently? Not using any libraries/modules.
Example of my code:
with open('/Users/xxx/Desktop/ds_salaries.csv', 'r') as f:
csv_reader = f.readlines()
for row in csv_reader[1:]:
new_row = row.split(',')
experience_level = new_row[2]
job_title = new_row[4]
salary_in_usd = new_row[7]
if experience_level == 'EN' and job_title == 'AI Scientist':
en_ai_scientist += int(salary_in_usd)
count_en_ai_scientist += 1
avg_en_ai_scientist = en_ai_scientist / count_en_ai_scientist
print(avg_en_ai_scientist)
Data:
When working out an example like this, I find it helpful to ask, "What data structure would make this question easy to answer?"
For example, the question asks
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
To me, this implies that I want a dictionary keyed by a tuple of experience level and job title, with the salaries of every person who matches. Something like this:
data = {
("EN", "AI Scientist"): [1000, 2000, 3000],
("SE", "AI Scientist"): [2000, 3000, 4000],
}
The next question is: how do I get my data into that format? I would read the data in with csv.DictReader, and add each salary number into the structure.
data = {}
with open('input.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
experience_level = row['first_name']
job_title = row['last_name']
key = experience_level, job_title
if key not in data:
# provide default value if no key exists
# look at collections.defaultdict if you want to see a better way to do this
data[key] = []
data[key].append(row['salary_in_usd'])
Now that you have your data organized, you can compute average salaries:
for (experience_level, job_title), salary_data in data:
print(experience_level, job_title, sum(salary_data)/len(salary_data))
Related
I am trying to iterate over a dictionary that contains multiple row indexes in its values and then apply pd.nsmallest function to generate the top 3 smallest values for the multiple sets of row indexes that are in the dictionary. However, there seems to be something wrong with my for loop statement as I am overwriting the top 3 values over and over till the last set of values in the dictionary and so my final excel file output shows only 3 rows for the last run of the for loop.
When I use print statements this works as expected and I get an output for all 16 values in the dictionary but when writing to excel file it only gives me the output of the last run on the loop
import pandas as pd
from tabulate import tabulate
VA = pd.read_excel('Columnar BU P&L.xlsx', sheet_name = 'Variance by Co')
legcon = VA[['Expense', 'Consolidation', 'Exp Category']]
legcon['Variance Type'] = ['Unfavorable' if x < 0 else 'favorable' for x in legcon['Consolidation']]
d = {'Travel & Entertainment': [1,2,3,4,5,6,7,8,9,10,11], 'Office supplies & Expenses': [13,14,15,16,17],
'Professional Fees':[19,20,21,22,23], 'Fees & Assessments':[25,26,27], 'IT Expenses':[29],
'Bad Debt Expense':[31],'Miscellaneous expenses': [33,34,35,36,37],'Marketing Expenses':[40,41,42],
'Payroll & Related Expenses': [45,46,47,48,49,50,51,52,53,54,55,56], 'Total Utilities':[59,60],
'Total Equipment Maint, & Rental Expense': [63,64,65,66,67,68],'Total Mill Expense':[70,71,72,73,74,75,76,77],
'Total Taxes':[80,81],'Total Insurance Expense':[83,84,85],'Incentive Compensation':[88],
'Strategic Initiative':[89]}
Printing output directly works fine when I do this:
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
print(a)
Expense Consolidation Exp Category Variance Type
5 Transportation - AIR -19054 Travel & Entertainment Unfavorable
9 Meals -9617 Travel & Entertainment Unfavorable
7 Lodging -9439 Travel & Entertainment Unfavorable
Expense Consolidation Exp Category Variance Type
26 Bank Charges -4320 Fees & Assessments Unfavorable
27 Finance Charges -1389 Fees & Assessments Unfavorable
25 Payroll Fees -1145 Fees & Assessments Unfavorable
However when I use the below code to write to excel:
writer = pd.ExcelWriter('testt.xlsx', engine = 'xlsxwriter')
row = 0
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
for i in range(0,16):
a.to_excel(writer, sheet_name = 'test', startrow = row+4, index = False)
writer.save()
my output looks like this and does not show all the exp categories:
I would really appreciate any feedback on how to correct this. Thanks in advance!
With some help from a friend I just realized my silly mistake, there was no row iterator in my for loop for the output to print on the next lines and using the below code fixed the issue (initially i placed the row iterator within my df.to_excel statement):
writer = pd.ExcelWriter('testt.xlsx', engine = 'xlsxwriter')
row = 0
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
a.to_excel(writer, sheet_name = 'Testt', startrow = row, index = False)
row = row+4
writer.save()
Following my question submitted in the last few days, i have a defaultdict which contains in each of the lines a record of the ticket sale for a deviceID, or passenger for a bus sale. The whole devicedict contains all the tickets sold for a given year, around 1 million.The defaultdict is indexed by the deviceID which is the key.
I need to know the average delay between the purchase date and the actual date of departure for each ticket purchase. My problem is that i can't seem to extract each record from the dictionary.
So devicedict contains for each key devicedict[key] a list of over 60 diferent characteristics: date_departure, date_arrival etc. In each turn of the loop i want to process something like devicedict[deviceID][field of interest] do something with it, and for example extract the median delay between each purchase.
I've tried using append, and using nested arrays, but it doesnt return each individual record by itself.
ValoresDias is the sum of the delays for each ticket(purchase date minus departure) in seconds divided by a day-86400, and ValoresTotalesDias is just an increment variable. The total median delay should be ValoresDias/ValoresTotalesDias for all the records.
with open('Salida1.csv',newline='', mode='r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
#rows1 = list(csv_reader)
#print(len(rows1))
line_count = 0
count=0
for row in csv_reader:
key = row[20]
devicedict[key].append(row)
if line_count == 0:
print(f'Column names are {", ".join(row)}')
line_count += 1
else:
#print(f'\t{row[0]} works in the {row[20]} department, and was born in {row[2]}.')
#print(row['id'], row['idapp'])
#print(len(row))
#print(list(row))
mydict5ordenado.append(list(row))
line_count += 1
print(len(devicedict.keys()))
f = "%Y-%m-%d %H:%M:%S"
p = devicedict.keys()
for i in range(0,len(devicedict)):
mydict.append(devicedict[list(p)[i]])
print(mydict[i])
print("Los campos temporales:")
#print(mydict[i][4])
#print(mydict[i][3])
out1=datetime.datetime.strptime(mydict[i][4], f)
out2=datetime.datetime.strptime(mydict[i][3], f)
out3=out1-out2
valoresTotalesDias+=1
valoresDias+=out3.seconds/86400
#This is what i am trying to obtain for each record without hardcoding
#I want to access each field in the above loop
count1=len(devicedict['4ff70ad8e2e74f49'])
for i in range(0,count1):
mydict5.append(devicedict['4ff70ad8e2e74f49'][i])
print(len(mydict5))
for i in range (0,len(mydict5)):
print(mydict5[i][7])
print("Tipo de Bus:")
print(mydict5[i][16])
print(mydict5[i][14])
if (mydict5[i][16]=='P'):
preferente+=1
Mydict[i] should contain only one line of the record, that is one sale for each passenger not the whole record.
I am trying to run the below code to read the csv file and make it so that I can search by month to get the respective months utilities charges. If I put a print statement above "try" it will print every line. however, when I try to print (or reference) the bills defaultdict it should create it is empty. Any ideas what I might be doing wrong here? Thanks in advance for any help on this!!
import csv
from collections import defaultdict, namedtuple
source = 'C:/Users/George/PycharmProjects/Day6-Monthly-bills/Monthly Bills record - Sheet3 (1).csv'
Bills = namedtuple('Utilities', 'Month Gas Electric Water Total')
def lookup_bills_by_month(data=source):
bills = defaultdict(list)
with open(data, encoding='utf-8') as f:
for line in csv.DictReader(f):
try:
month = line['Month']
electric = int(line['Electric'])
gas = int(line['Gas'])
water = int(line['Water'])
total = int(line['Total'])
except ValueError:
continue
b = Bills(Gas=gas, Electric=electric, Water=water, Total=total}
bills[month].append(b)
print(bills)
return bills
bills = lookup_bills_by_month()
print(bills)
I asked a question similar last night, but my professor gave some clarification on how she wants the question answered, and it's thrown me for a loop.
I've got a csv file with 3 columns. I'm saving them as a dictionary, but I'm trying to find a way to read the year and the title_field and find a specific title_field (Occupied Housing Units), match it up with the earliest year (2008) and take the number in the value column next to those and match it to the NEXT year (2009) with the same title_field (Occupied Housing Units), find the difference between those two values, and print the outcome and do the same with 2009 & 2010, etc. like this:
2008-2009 795
2009-2010 5091
etc.
The csv looks like this:
year,title_field,value
2014,Total Housing Units,49109
2014,Vacant Housing Units,2814
2014,Occupied Housing Units,46295
2013,Total Housing Units,47888
2013,Vacant Housing Units,4215
2013,Occupied Housing Units,43673
2012,Total Housing Units,45121
2012,Vacant Housing Units,3013
2012,Occupied Housing Units,42108
2011,Total Housing Units,44917
2011,Vacant Housing Units,4213
2011,Occupied Housing Units,40704
2010,Total Housing Units,44642
2010,Vacant Housing Units,3635
2010,Occupied Housing Units,41007
2009,Total Housing Units,39499
2009,Vacant Housing Units,3583
2009,Occupied Housing Units,35916
2008,Total Housing Units,41194
2008,Vacant Housing Units,4483
2008,Occupied Housing Units,36711
And the code I have so far is:
import csv
def process(year, field_name, value):
print(year, field_name, value)
with open('denton_housing.csv', 'r', encoding='utf8',newline='') as f:
reader = csv.DictReader(f, delimiter=',')
housing_stats = []
for row in reader:
year = row["year"]
field_name = row["title_field"]
value = int(row["value"])
denton_dict = {"year": year, "field_name": field_name, "value": value}
housing_stats.append(denton_dict)
process(year, field_name, value)
Thanks! I'm new to programming, and I'm an older dude. I love that the programming community is beyond helpful, as if you all are welcoming everyone into a cult (the good kind?).
You could do it like this:
Create a list of those row dictss which have the target title_field value in them.
Sort it by the year value in each one.
Use the itertools recipe for the pairwise() generator to process each pair of rows/years in the sorted list.
Code implementing the above:
import csv
from itertools import tee
# From https://docs.python.org/3/library/itertools.html#recipes
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
target_title_field = 'Occupied Housing Units'
csv_filename = 'denton_housing.csv'
with open(csv_filename, 'r', encoding='utf8', newline='') as f:
housing_stats = []
for row in csv.DictReader(f, delimiter=','):
if row['title_field'] == target_title_field:
year = int(row["year"])
field_name = row["title_field"]
value = int(row["value"])
denton_dict = {"year": year, "field_name": field_name, "value": value}
housing_stats.append(denton_dict)
housing_stats.sort(key=lambda row: row['year'])
for r1, r2 in pairwise(housing_stats):
print('{}-{} {:5}'.format(r1['year'], r2['year'], abs(r2['value'] - r1['value'])))
Output:
2008-2009 795
2009-2010 5091
2010-2011 303
2011-2012 1404
2012-2013 1565
2013-2014 2622
One easy way is to use 3 lists(each of your title_field) to keep the year and value fields, then you can process each list.
total = []
vacant = []
occupied = []
with open('denton_housing.csv', 'r', encoding='utf8',newline='') as f:
spamreader = csv.reader(f, delimiter=',')
for row in spamreader:
if row[1] == 'Occupied Housing Units':
# use the data structure you preferred, in this example I use tuple
mytuple = (row[0], row[2])
occupied.append(mytuple)
# do the same for total and vacant list, ignore if you don't need
...
# then you can process the list, for example, occupied
# I assume your csv file is sorted by year, so you may safely assume that each
# year field of the data entry in the occupied list is sorted as well
for i in range(len(occupied)-1):
# if your data contains every year, ie 2008-2014 without missing any
# the year field is useless in this case, so you can just
value_diff = abs(occupied[i][1] - occupied[i+1][1])
# if the year entry is not sorted, and it may missed some years
occupied.sort(key=lambda x: x[0]) # this sort in ascending order
for i in range(len(occupied)-1):
this_year = occupied[i][0]
next_year = occupied[i+1][0]
if next_year - this_year == 1:
value_diff = abs(occupied[i][1] - occupied[i+1][1])
I suggest you to use pandas for doing it.
Then you could use groupby and aggregation in a breeze.
like this:
df.groupby(df['year'].dt.year)['a'].agg(['value'])
Result:
2012 14
2015 6
I need a little help reading specific values into a dictionary using Python. I have a csv file with User numbers. So user 1,2,3... each user is within a specific department 1,2,3... and each department is in a specific building 1,2,3... So I need to know how can I list all the users in department one in building 1 then department 2 in building 1 so on. I have been trying and have read everything into a massive dictionary using csv.ReadDict, but this would work if I could search through which entries I read into each dictionary of dictionaries. Any ideas for how to sort through this file? The CSV has over 150,000 entries for users. Each row is a new user and it lists 3 attributes, user_name, departmentnumber, department building. There are a 100 departments and 100 buildings and 150,000 users. Any ideas on a short script to sort them all out? Thanks for your help in advance
A brute-force approach would look like
import csv
csvFile = csv.reader(open('myfile.csv'))
data = list(csvFile)
data.sort(key=lambda x: (x[2], x[1], x[0]))
It could then be extended to
import csv
import collections
csvFile = csv.reader(open('myfile.csv'))
data = collections.defaultdict(lambda: collections.defaultdict(list))
for name, dept, building in csvFile:
data[building][dept].append(name)
buildings = data.keys()
buildings.sort()
for building in buildings:
print "Building {0}".format(building)
depts = data[building].keys()
depts.sort()
for dept in depts:
print " Dept {0}".format(dept)
names = data[building][dept]
names.sort()
for name in names:
print " ",name