I am very new to using python to process data on CSV files. I have a CSV file with the data below. I want to take the averages of the time stamps for each Sprint, Jog, and Walk column by session. The below example has the subject John Doe and Session2 and Session3 that I would like to find the averages of separately and write them to a new CSV file. Is there a way not using PANDAS but other modules like CSV or Numpy to gather the data by the person (subject) and then by session. I have tried to make a dictionary but the keys get overwritten. I have also tried using a List but I cannot figure out how to target the sessions to average them out. Not sure what I am doing wrong. I also tried using dictReader to read the fieldnames and then to process the data but I cannot figure out how to group all the John Doe Session2 data to find the average of the times.
Subject, Session, Course, Size, Category, Sprint, Jog, Walk
John Doe, Session2, 17, 2, Bad, 25s, 36s, 55s
John Doe, Session2, 3, 2, Good, 26s, 35s, 45s
John Doe, Session2, 1, 2, Good, 22s, 31s, 47s
John Doe, Session3, 5, 2, Good, 16s, 32s, 55s
John Doe, Session3, 2, 2, Good, 13s, 24s, 52s
John Doe, Session3, 16, 2, Bad, 15s, 26s, 49s
PS I say no PANDAS because my groupmates are not adding this module since we have so many other dependencies.
Given your input, these built-in Python libraries can generate the output you want:
import csv
from itertools import groupby
from operator import itemgetter
from collections import defaultdict
with open('input.csv','r',newline='') as fin,open('output.csv','w',newline='') as fout:
# skip needed because sample data had spaces after comma delimiters.
reader = csv.DictReader(fin,skipinitialspace=True)
# Output file will have these fieldnames
writer = csv.DictWriter(fout,fieldnames='Subject Session Sprint Jog Walk'.split())
writer.writeheader()
# for each subject/session, groupby returns a 2-tuple of sort key and an
# iterator over the rows of that key. Data must be sorted by the key already!
for (subject,session),group in groupby(reader,key=itemgetter('Subject','Session')):
# built the row to output. defaultdict(int) assumes integer(0) if key doesn't exist.
row = defaultdict(int)
row['Subject'] = subject
row['Session'] = session
# Count the items for average.
count = 0
for item in group:
count += 1
# sum the rows, removing the 's'
for col in ('Sprint','Jog','Walk'):
row[col] += int(item[col][:-1])
# produce the average
for col in ('Sprint','Jog','Walk'):
row[col] /= count
writer.writerow(row)
Output:
Subject,Session,Sprint,Jog,Walk
John Doe,Session2,24.333333333333332,34.0,49.0
John Doe,Session3,14.666666666666666,27.333333333333332,52.0
Function links: itemgetter
groupby
defaultdict
If your data is not pre-sorted, you can use the following replacement lines to read in and sort the data by using the same key used in groupby. However, in this implementation the data must be small enough to load it all into memory at once.
sortkey = itemgetter('Subject','Session')
data = sorted(reader,key=sortkey)
for (subject,session),group in groupby(data,key=sortkey):
...
As you want the average grouped by subject and session, just compose unique keys out of that information:
import csv
times = {}
with open('yourfile.csv', 'r') as csvfile[1:]:
for row in csv.reader(csvfile, delimiter=','):
key = row[0]+row[1]
if key not in times.keys():
times[key] = row[-3:]
else:
times[key].extend(row[-3:])
average = {k: sum([int(entry[:-1]) for entry in v])/len(v) for k, v in times.items()}
This assumes that the first two entries do have regular structure as in your example and there is no ambiguity when composing the two first entries per row. To be sure one could insert a special delimiter between them in the key.
If you are also the person storing the data: Writing the unit of a column in the column header saves transformation effort later and avoids redundant information storage.
Related
I have python code that gives data in the following list = [author, item, number]
I want to add this data to an excel file that looks like this: .
The python script will:
Check if the author given in the list is in the Author Names column, and add name if it is not in present.
Then the code will add the number in the column that matches the item given.
For example:
['author2', 'Oranges', 300], 300 would be added to Oranges column on the row for author2.
If the person adds a list again like ['author2', 'Oranges', 500] and an input already exists for the item, the number will be added together so the final result is 800.
How do I get started with this? I'm mostly confused about how to read columns/rows to find where to insert things.
Here's one example of how you might do it:
import csv
from collections import defaultdict
# Make a dictionary for the authors, that will automatically start all the
# values at 0 when you try to add a new author
authors = defaultdict(lambda: dict({'Oranges':0, 'Apples':0, 'Peaches':0}))
# Add some items
authors['author1']['Oranges'] += 300
authors['author2']['Peaches'] += 200
authors['author3']['Apples'] += 50
authors['author1']['Apples'] += 20
authors['author2']['Oranges'] += 250
authors['author3']['Apples'] += 100
# Write the output to a csv file, for opening in Excel
with open('authors_csv.csv', 'w', newline='') as file:
writer = csv.writer(file)
# Write Header
writer.writerow(['Author Names', 'Oranges', 'Apples', 'Peaches'])
for key, val in authors.items():
writer.writerow(
[key,
val['Oranges'], val['Apples'], val['Peaches']
])
For more details on writing out to CSV's you can see the documentation here: https://docs.python.org/3/library/csv.html
Alternatively, just search using DuckDuckGo or your favorite search engine.
Most likely it appears your spreadsheet is stored externally, and you want to read in some new data from the list format [author, item, number].
Python pandas is great for this. This would read in the data file, lets call it authorVolumes.xlsx. This assumes the spreadsheet is already in the folder we are working in and looks as it does in your first picture. Also the items are limited to the ones in the spreadsheet already as you did not mention that in the question.
import pandas as pd
df = pd.read_excel('authorVolumes.xlsx', index_col='Author Names').fillna(0)
print(df)
Author Names Oranges Apples Peaches
author1 0 0 0
author2 0 0 0
author3 0 0 0
author4 0 0 0
author5 0 0 0
Now lets define a function to handle the updates.
def updateVolumes(author, item, amount):
try:
df.loc[author][item] += amount
except KeyError:
df = pd.concat([df,pd.DataFrame([amount], index=[author], columns=[item])]).fillna(0)
Time to handle the first update:['author2', 'Oranges', 300]
author, item, amount = ['author2', 'Oranges', 300]
updateVolumes(author, item, amount)
Now to handle one where the author is not there:
author, item, amount = ['author10', 'Apples', 300]
updateVolumes(author, item, amount)
When we are done we can save our excel file back out to the files system.
df.to_excel('authorVolumes.xlsx')
This is almost the same from my yesterday's question. But I took it for granted to use a unique value list to create the nested dict & list structure. But then, I came to the question of how to build this dict & list structure (refer as data structure) row by row from the excel data.
The excel files (multiple files in a folder) all look like the following:
Category Subcategory Name
Main Dish Noodle Tomato Noodle
Main Dish Stir Fry Chicken Rice
Main Dish Soup Beef Goulash
Drink Wine Bordeaux
Drink Softdrink Cola
My desired structure of dict & list structure is:
data = [0:{'data':0, 'Category':[
{'name':'Main Dish', 'Subcategory':[
{'name':'Noodle', 'key':0, 'data':['key':1, 'title':'Tomato Noodle']},
{'name':'Stir Fry', 'key':1, 'data':['key':2, 'title':'Chicken Rice']},
{'name':'Soup', 'key':2, 'data':['key':3, 'title':'Beef Goulash']}]},
{'name':'Drink', 'Subcategory':[
{'name':'Wine', 'key':0, 'data':['key':1, 'title':'Bordeaux']},
{'name':'Softdrink', 'key':1, 'data':['key':2, 'title':'cola'}]}]},
1:{'data':1, 'Category':.........#Same structure as dataset 0}]
So, for each excel file, it is fine, just loop through and set {'data':0, 'Category':[]}, {'data':1, 'Category':[]} and so on. The key is, for each Category and Subcategory values, Main Dish has three entries in excel, but only needs 1 in the data structure, and Drink has two entries in excel, but only 1 in the data structure. For each subcategory nested in the category list, they follow the same rule, only unique values should be nested to category. Then, each corresponding Name of dishes, they go into the data structure depending on their category and subcategory.
The issue is, I cannot find a better way to convert the data to this data structure. Plus, there are other columns after the Name column. So it is kind of sophisticated. I was thinking to first extract the unique values from the entire column of category and subcategory, this simplifies the process, but leads to problems when filling in the corresponding Name values. If I am doing this from a row by row approach, then designing a if subcategory exist or category exit test to keep unique values are somehow difficult based on my current programming skills...
Therefore, what would be the best approach to convert this excel file into such a data structure? Thank you very much.
One way could be to read the excelfile into a dataframe using pandas, and then build on this excellent answer Pandas convert DataFrame to Nested Json
import pandas as pd
excel_file = 'path-to-your-excel.xls'
def fdrec(df):
drec = dict()
ncols = df.values.shape[1]
for line in df.values:
d = drec
for j, col in enumerate(line[:-1]):
if not col in d.keys():
if j != ncols-2:
d[col] = {}
d = d[col]
else:
d[col] = line[-1]
else:
if j!= ncols-2:
d = d[col]
return drec
df = pd.read_excel(excel_file)
print(fdrec(df))
I know Python is almost made for these kind of purposes, but I am really struggling to understand how I get access to specific values in the dataset, and I tried both with pandas and csv modules. It is probably a matter of syntax. Here's the thing: I have a csv file in the form of
Nation, Year, No. of refugees
Afghanistan,2013,6657
Albania,2013,199
Algeria,2013,91
Angola,2013,47
Armenia,2013,156
...
...
Afghanistan,2012,6960
Albania,2012,157
Algeria,2012,67
Angola,2012,43
Armenia,2012,143
...
and so on. What I would like to do is to get the total amount of refugees per year, i.e. selecting all the rows with a certain year and summing all the elements in the related "no. of refugees" column. I managed to do this:
import csv
with open('refugees.csv', 'r') as f:
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
print headers
#2013
list2013=[]
for line in d_reader:
if (line['Year']=='2013'):
list2013.append(line['Refugees'])
list2013=map(int,list2013) #I have str values in my file
ref13=sum(list2013)
but I am looking for a more elegant (and, above all, iterative) solution. Moreover, if I perform that procedure multiple times for different years, I always get 0: it works for 2013 only, not sure why.
Edit: I tried this as well, without success, but I think this could be totally wrong:
import csv
refugees_dict={}
a=range(2005,2014)
a=map(str, a)
with open('refugees.csv', 'r') as f:
d_reader = csv.DictReader(f)
for element in a:
for line in d_reader:
if (line['Year']==element):
print 'hello!'
temp_list=[]
temp_list.append(line['Refugees'])
temp_list=map(int, temp_list)
refugees_dict[a]=sum(temp_list)
print refugees_dict
The next step of my work will involve further studies on the dataset, eg I am probably gonna need to access data nation-wise instead of year-wise, and I really appreciate any hint so I understand how to manipulate data.
Thanks a lot.
Since you tagged pandas in the question, here's a pandas solution to getting the number of refugees per year.
Let's say my input csv looks like this (note that I've eliminated the extra space before the column names):
Nation,Year,No. of refugees
Afghanistan,2013,6657
Albania,2013,199
Algeria,2013,91
Angola,2013,47
Armenia,2013,156
Afghanistan,2012,6960
Albania,2012,157
Algeria,2012,67
Angola,2012,43
Armenia,2012,143
You can read that into a pandas DataFrame like this:
df = pd.read_csv('data.csv')
You can then get the total like this:
df.groupby(['Year']).sum()
This gives:
No. of refugees
Year
2012 7370
2013 7150
Consider:
from collections import defaultdict
by_year = defaultdict(int) # a dict that has a 0 under every key.
and then
by_year[line['year']] += int(line['Refugees'])
Now you can just look at by_year['2013'] and see your sum (same for other years).
To sum by year you can try this:
f = open('file.csv').readlines()
f = [i.strip('\n').split(',') for i in f]
years = {i[1]:0 for i in f}
for i in f:
years[i[1]] += int(i[-1])
Now, you have a dictionary that has the sum of all the refugees by year.
To access nation-wise:
nations = {i[0]:0 for i in f}
for i in f:
nations[i[0]] += int(i[-1])
I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.
Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)
I need a little help reading specific values into a dictionary using Python. I have a csv file with User numbers. So user 1,2,3... each user is within a specific department 1,2,3... and each department is in a specific building 1,2,3... So I need to know how can I list all the users in department one in building 1 then department 2 in building 1 so on. I have been trying and have read everything into a massive dictionary using csv.ReadDict, but this would work if I could search through which entries I read into each dictionary of dictionaries. Any ideas for how to sort through this file? The CSV has over 150,000 entries for users. Each row is a new user and it lists 3 attributes, user_name, departmentnumber, department building. There are a 100 departments and 100 buildings and 150,000 users. Any ideas on a short script to sort them all out? Thanks for your help in advance
A brute-force approach would look like
import csv
csvFile = csv.reader(open('myfile.csv'))
data = list(csvFile)
data.sort(key=lambda x: (x[2], x[1], x[0]))
It could then be extended to
import csv
import collections
csvFile = csv.reader(open('myfile.csv'))
data = collections.defaultdict(lambda: collections.defaultdict(list))
for name, dept, building in csvFile:
data[building][dept].append(name)
buildings = data.keys()
buildings.sort()
for building in buildings:
print "Building {0}".format(building)
depts = data[building].keys()
depts.sort()
for dept in depts:
print " Dept {0}".format(dept)
names = data[building][dept]
names.sort()
for name in names:
print " ",name