Hello am supposed to the steps below. I have finished but getting this error
File "C:/Users/User/Desktop/question2.py", line 37, in
jobtype_salary[li['job']] = int(li['salary'])
ValueError: invalid literal for int() with base 10: 'SECRETARY
a. Read the file into a list of lists (14 rows, 5 columns)
b. Transform each row of the list into a dictionary. The keys are : ename, job, salary, comm, dno. Call the resulting list of dictionaries dict_of_emp
c. Display the table dict_of_emp, one row per line
d. Perform the following computations on dict_of_emp:
D1. Compute and print the incomes of Richard and Mary (add salary and comm)
D2 Compute and display the sum of salaries paid to each type of job (i.e. salary paid to analysts is 3500 + 3500= 7000)
D3. Add 5000 to the salaries of employees in department 30. Display the new table
import csv
#Open the file in read mode
f = open("employeeData.csv",'r')
reader = csv.reader(f)
#To read the file into list of lists we use list() method
emps = list(reader)
#print(emps)
#Transform each row into a dictionary.
dict_of_emp = [] #list of dictionaries
for row in emps:
d={}
d['ename'] = row[0]
d['job'] = row[1]
d['salary']=row[2]
d['comm']=row[3]
d['dno']=row[4]
dict_of_emp.append(d)
print("*************************************************")
#display the table dict_of_emp, one row per line.
for li in dict_of_emp:
print(li)
print("*************************************************")
#Incomes of Richard and Mary, to add salary and commision, first we need to cast them to integers.
d1 = ['RICHARD','MARY']
for li in dict_of_emp:
if li['ename'] in d1:
print('income of ', li['ename']," is ",int(li['salary']+li['comm']))
print("*************************************************")
#Sum of salaries based on type of job, dictionary is used so the job type is key
#and sum of salary is value
jobtype_salary = {}
for li in dict_of_emp:
if li['job'] in jobtype_salary.keys():
jobtype_salary[li['job']] += int(li['salary'])
else:
jobtype_salary[li['job']] = int(li['salary'])
print(jobtype_salary)
print("*************************************************")
#Add 5000 to salaries of employees in department 30.
for li in dict_of_emp:
if li['dno']=='30':
li['salary']=int(li['salary'])+5000
for li in dict_of_emp:
print(li)
Here is the csv as an image:
I think the indexing of your columns is slightly off. You do d['salary'] = row[2], which, according to the CSV corresponds with the third row i.e. with the position of the person (SECRETARY, SALESPERSON). If you then try to convert this string to an integer, you get the error.
Does it run with this instead?
for row in emps:
d={}
d['ename'] = row[1]
d['job'] = row[2]
d['salary']=row[3]
d['comm']=row[4]
d['dno']=row[5]
dict_of_emp.append(d)
Related
Following my question submitted in the last few days, i have a defaultdict which contains in each of the lines a record of the ticket sale for a deviceID, or passenger for a bus sale. The whole devicedict contains all the tickets sold for a given year, around 1 million.The defaultdict is indexed by the deviceID which is the key.
I need to know the average delay between the purchase date and the actual date of departure for each ticket purchase. My problem is that i can't seem to extract each record from the dictionary.
So devicedict contains for each key devicedict[key] a list of over 60 diferent characteristics: date_departure, date_arrival etc. In each turn of the loop i want to process something like devicedict[deviceID][field of interest] do something with it, and for example extract the median delay between each purchase.
I've tried using append, and using nested arrays, but it doesnt return each individual record by itself.
ValoresDias is the sum of the delays for each ticket(purchase date minus departure) in seconds divided by a day-86400, and ValoresTotalesDias is just an increment variable. The total median delay should be ValoresDias/ValoresTotalesDias for all the records.
with open('Salida1.csv',newline='', mode='r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
#rows1 = list(csv_reader)
#print(len(rows1))
line_count = 0
count=0
for row in csv_reader:
key = row[20]
devicedict[key].append(row)
if line_count == 0:
print(f'Column names are {", ".join(row)}')
line_count += 1
else:
#print(f'\t{row[0]} works in the {row[20]} department, and was born in {row[2]}.')
#print(row['id'], row['idapp'])
#print(len(row))
#print(list(row))
mydict5ordenado.append(list(row))
line_count += 1
print(len(devicedict.keys()))
f = "%Y-%m-%d %H:%M:%S"
p = devicedict.keys()
for i in range(0,len(devicedict)):
mydict.append(devicedict[list(p)[i]])
print(mydict[i])
print("Los campos temporales:")
#print(mydict[i][4])
#print(mydict[i][3])
out1=datetime.datetime.strptime(mydict[i][4], f)
out2=datetime.datetime.strptime(mydict[i][3], f)
out3=out1-out2
valoresTotalesDias+=1
valoresDias+=out3.seconds/86400
#This is what i am trying to obtain for each record without hardcoding
#I want to access each field in the above loop
count1=len(devicedict['4ff70ad8e2e74f49'])
for i in range(0,count1):
mydict5.append(devicedict['4ff70ad8e2e74f49'][i])
print(len(mydict5))
for i in range (0,len(mydict5)):
print(mydict5[i][7])
print("Tipo de Bus:")
print(mydict5[i][16])
print(mydict5[i][14])
if (mydict5[i][16]=='P'):
preferente+=1
Mydict[i] should contain only one line of the record, that is one sale for each passenger not the whole record.
I am trying to write code that will handle my input file of numbers, and then perform various operations on them. For example, The first column is a name. The second is an hourly rate, and the third is hours. The File looks like this,
John 15 8
Sam 10 4
Mike 16 10
John 19 15
I want to go through and if a name is a duplicate (John in the example) it will average the 2nd number (hourly rate), get the sum the 3rd number (hours), and delete the duplicate leaving 1 John with average wage and total hours. If not a duplicate it will just output the original entry.
I cannot figure out how to keep track of the duplicate, and then move on to the next line in the row. Is there any way to do this without using line.split()?
This problem is easier if you break it up into parts.
First, you want to read through the file and parse each line into three variables, the name, the hourly rate, and the hours.
Second, you need to handle the matching on the first value (the name). You need some kind of data structure to store values in; a dict is probably the right thing here.
Thirdly, you need to compute the average at the end (you can't compute it along the way because you need the count of values).
Putting it together, I would do something like this:
class PersonRecord:
def __init__(self, name):
self.name = name
self.hourly_rates = []
self.total_hours = 0
def add_record(self, hourly_rate, hours):
self.hourly_rates.append(hourly_rate)
self.total_hours += hours
def get_average_hourly_rate(self):
return sum(self.hourly_rates) / len(self.hourly_rates)
def compute_person_records(data_file_path):
person_records = {}
with open(data_file_path, 'r') as data_file:
for line in data_file:
parts = line.split(' ')
name = parts[0]
hourly_rate = int(parts[1])
hours = int(parts[2])
person_record = person_records.get(name)
if person_record is None:
person_record = PersonRecord(name)
person_records[name] = person_record
person_record.add_record(hourly_rate, hours)
return person_records
def main():
person_records = compute_person_records()
for person_name, person_record in person_records.items():
print('{name} {average_hourly_rate} {total_hours}'.format(
name=person_name,
average_hourly_rate=person_record.get_average_hourly_rate(),
total_hours=person_record.total_hours))
if __name__ == '__main__':
main()
Here we go. Just groupby the name and aggregate on the rate and hours taking the mean and sum as shown below.
#assume d is the name of your DataFrame.
d.groupby(by =['name']).agg({'rate': "mean", 'hours':'sum'})
Here's a version that's not particularly efficient. I wouldn't run it on lots of data, but it's easy to read and returns your data to its original form, which is apparently what you want...
from statistics import mean
input = '''John 15 8
Sam 10 4
Mike 16 10
John 19 15'''
lines = input.splitlines()
data = [line.split(' ') for line in lines]
names = set([item[0] for item in data])
processed = [(name, str(mean([int(i[1]) for i in data if i[0] == name])), str(sum([int(i[2]) for i in data if i[0] == name]))) for name in names]
joined = [' '.join(p) for p in processed]
line_joined = '\n'.join(joined)
a=[] #list to store all the values
while(True): #infinite while loop to take any number of values given
try: #for giving any number of inputs u want
l=input().split()
a.append(l)
except(EOFError):
break;
for i in a:
m=[i] #temperory list which will contain duplicate values
for j in range(a.index(i)+1,len(a)):
if(i[0]==a[j][0]):
m.append(a[j]) #appending duplicates
a.pop(j) #popping duplicates from main list
hr=0 #initializing hourly rate and hours with 0
hrs=0
if(len(m)>1):
for k in m:
hr+=int(k[1])
hrs+=int(k[2])# calculating total hourly rate and hours
i[1]=hr/len(m)
i[2]=hrs/len(m)#finding average
for i in a:
print(i[0],i[1],i[2]) # printing the final list
Read comments in the code for code explanation
You can do:
from collections import defaultdict
with open('file_name') as fd:
data = fd.read().splitlines()
line_elems = []
for line in data:
line_elems.append(line.split())
a_dict = defaultdict(list)
for e in line_elems:
a_dict[e[0]].append((e[1], e[2]))
final_dict = {}
for key in a_dict:
if len(a_dict[key]) > 1:
hour_rates = [float(x[0]) for x in a_dict[key]]
hours = [float(x[1]) for x in a_dict[key]]
ave_rate = sum(hour_rates) / len(hour_rates)
total_hours = sum(hours)
final_dict[key] = (ave_rate, total_hours)
else:
final_dict[key] = a_dict[key]
print(final_dict)
# write to file or do whatever
I'm attempting to learn how to search csv files. In this example, I've worked out how to search a specific column (date of birth) and how to search indexes within that column to get the year of birth.
I can search for greater than a specific year - e.g. typing in 45 will give me everyone born in or after 1945, but the bit I'm stuck on is if I type in a year not specifically in the csv/list I will get an error saying the year isn't in the list (which it isn't).
What I'd like to do is iterate through the years in the column until the next year that is in the list is found and print anything greater than that.
I've tried a few bits with iteration, but my brain has finally ground to a halt. Here is my code so far...
data=[]
with open("users.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
print(data)
lookup = input("Please enter a year of birth to start at (eg 67): ")
#lookupint = int(lookup)
#searching column 3 eg [3]
#but also searching index 6-8 in column 3
#eg [6:8] being the year of birth within the DOB field
col3 = [x[3][6:8] for x in data]
#just to check if col3 is showing the right data
print(col3)
print ("test3")
#looks in column 3 for 'lookup' which is a string
#in the table
if lookup in col3: #can get rid of this
output = col3.index(lookup)
print (col3.index(lookup))
print("test2")
for k in range (0, len(col3)):
#looks for data that is equal or greater than YOB
if col3[k] >= lookup:
print(data[k])
Thanks in advance!
I have a spreadsheet with the below structure (Data starts from Column B. Col A is empty)
A B C D
Name city salary
Jennifer Boston 100
Andrew Pittsburgh 1000
Sarah LA 100
Grand Total 1200
I need to filter out the row with the grand total before loading it into the database.
For this, I'm reading the Grand Total as:
import xlrd
import pymssql
#open workbook
book = xlrd.open_workbook("C:\_Workspace\Test\MM.xls")
print( "The number of worksheets is", book.nsheets)
#for each row in xls file loop
#skip last row
last_row = curr_sheet.nrows
print(last_row)
print(curr_sheet.ncols)
skip_val = curr_sheet.cell(last_row,1).value
print( skip_val)
if skip_val == "Grand Total":
last_row = last_row - 1
else:
last_row = last_row
for rx in range(last_row):
print( curr_sheet.row(rx))
However, I'm getting the below error:
Traceback (most recent call last):
File "C:\_Workspace\Test\xldb.py", line 26, in <module>
skip_val = curr_sheet.cell(last_row,1).value
File "c:\Python34\lib\site-packages\xlrd-0.9.3- >py3.4.egg\xlrd\sheet.py", line 399, in cell
self._cell_types[rowx][colx],
IndexError: list index out of range
I'm not able to figure out what is wrong with the syntax above. Hoping someone here can spot why its throwing the error.
Thanks much in advance,
Bee
I think your problem is not accounting for the zero-based index. last_row = curr_sheet.nrows returns the number of rows in the worksheet, so accessing the last row requires:
skip_val = curr_sheet.cell_value(last_row-1, 1)
The first element in Python is indexed by 0, so the first element of a list mylist would be mylist[0]. The last element is not mylist[len(mylist)], instead it's mylist[len(mylist)-1], which should be written as mylist[-1]. You can therefore write the following:
skip_val = curr_sheet.cell_value(-1, 1)
Starting from a large imported data set, I am trying to identify and print each line corresponding to a city that has at least 2 unique colleges/universities there.
So far (the relevant code):
for line in file:
fields = line.split(",")
ID, name, city = fields[0], fields[1], fields[3]
count = line.count()
if line.count(city) >= 2:
if line.count(ID) < 2:
print "ID:", ID, "Name: ", name, "City: ", city
In other words, I want to be able to eliminate 1) any duplicate school listings (by ID - this file has many institutions appearing repeatedly), 2) any cities that do not have two or more institutions there.
Thank you!
dicts come in handy when you want to order data by some key. In your case, nested dicts that first index by city and then by ID should do the trick.
# will hold cities[city][ID] = [ID, name, city]
cities = {}
for line in file:
fields = lines.split()
ID, name, city = fields
cities.setdefault(name, {})[ID] = fields
# 'cities' values are the IDs for that city. make a list if there are at least 2 ids
multi_schooled_cities = [ids_by_city.values() for ids_by_city in cities.values() if len(ids_by_city) >= 2]