I'm parsing a big CSV file using csv.DictReader.
quotes=open( "file.csv", "rb" )
csvReader= csv.DictReader( quotes )
Then for each row I'm converting the time value in the CSV in datetime using this :
for data in csvReader:
year = int(data["Date"].split("-")[2])
month = strptime(data["Date"].split("-")[1],'%b').tm_mon
day = int(data["Date"].split("-")[0])
hour = int(data["Time"].split(":")[0])
minute = int(data["Time"].split(":")[1])
bars = datetime.datetime(year,month,day,hour,minute)
Now I would like to perform actions only on the rows of the same day. Would it be possible to do it in the same for loop or should I maybe save the data out per day and then perform actions? What would be an efficient way of baking the parsing?
As jogojapan has pointed out, it is important to know whether we can assume that the CSV file is sorted by date. If it is, then you could use itertools.groupby to simplify your code. For example, the for loop in this code iterates over the data one day at time:
import csv
import datetime
import itertools
with open("file.csv", "rb") as quotes:
csvReader = csv.DictReader(quotes)
lmb = lambda d: datetime.datetime.strptime(d["Date"], "%d-%b-%Y").date()
for k, g in itertools.groupby(csvReader, key = lmb):
# do stuff per day
counts = (int(data["Count"]) for data in g)
print "On {0} the total count was {1}".format(k, sum(counts))
I created a test "file.csv" containing the following data:
Date,Time,Count
1-Apr-2012,13:23,10
2-Apr-2012,10:57,5
2-Apr-2012,11:38,23
2-Apr-2012,15:10,1
3-Apr-2012,17:47,123
3-Apr-2012,18:21,8
and when I ran the above code I got the following results:
On 2012-04-01 the total count was 10
On 2012-04-02 the total count was 29
On 2012-04-03 the total count was 131
But remember that this will only work if the data in "file.csv" is sorted by date.
If (for some reason) you can assume that the input rows are already sorted by date, you could put them into a local container one by one as long as the date of any new row is the same as the previous one:
same_date_rows = []
prev_date = None
for data in csvReader:
# ... your existing code
bars = datetime.datetime(year,month,day,hour,minute)
if bars == prev_date:
same_date_rows.append(data)
else:
# New date. We process all rows collected so far
do_something(same_date_rows)
# Then we start a new collection for the new date
same_date_rows = [date]
# Remember the date of the current row
prev_date = bars
# Finally, process the final group of rows
do_something(same_date_rows)
But if you cannot make that assumption, you will have to
Either: Put the rows in a long list, sort that by date, and then apply an algorithm like the above to the sorted list
Or: Put the rows in a dictionary, using the date as key, and a list of rows as value for each key. Then you can iterate through the keys of that dictionary to get access to all rows that share a date.
The second of these two approaches is a little more space-consuming, but it may allow you do to some of the date-specific processing in the main loop, because whenever you receive a new row for an already-existing date, you could apply some of the date-specific processing right away, possibly avoiding the need to actually store all date-specific rows explicitly. Whether that is possible depends on what kind of processing you apply to the rows.
If you are not going for space efficeny, an elegant solution would be to create a dictionary where the key is your day, and the value is a list object, where all the information for each day is stored. Later you can do whatever operations you want based on per day.
For example
d = {} #Initialize emptry dictionry
for data in csvReader:
Day = int(data["Date"].split("-")[0])
try:
d[Day].append('Some_Val')
except KeyError:
d[Day] = ['Some_val']
This will either modify or create a new list object for each day. This is later easily accessible either by iterating over the dictionary or simply referring to the day as a key.
For example:
d[Some_Day]
will give you simply a list object with all the information you have stored. Given the linear lookup time of a dictionary, it should be quite efficent in terms of time.
Related
I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)
This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)
This is an assignment for school. I have a text file which contains the following abbreviated list with each entry on a single line. The first entry is the date and the second entry after the pip is the value for stock market close, there are approximately 365 entries in the file.
8/28/2018|26064.01953
8/29/2018|26124.57031
8/30/2018|25986.91992
Using the following code I have split the data into a list of lists with the date and value separated.
import os
import math
import statistics
def main ():
infile = open('DJI.txt', 'r')
values = infile.read()
infile.close()
values=values.split("\n")
values=[value.split("|") for value in values]
print(values)
avg = sum([float(l[1]) for l in values])/len(values)
main()
This gives the following output
[['8/28/2018', '26064.01953'], ['8/29/2018', '26124.57031'], ['8/30/2018', '25986.91992'],
the Avg line gives the following error: IndexError: list index out of range
My task is to create a program which calculates
Average close value for the entire year.
Average close value per month
Highest close value and the date in which that happened.
Lowest close value and the date of which that happened.
Sort prices lowest to highest and write the sorted list to a new text file called DJI_Sorted.
I am have trouble with how to access the second value in the list of lists to perform the statistics on the file. I am also unsure how I would write a code which sorts the list from lowest to highest as well as the average close for each month, rather than on the entire file.
Your help is greatly appreciated.
You can access each list in your list with values[i] and each element of such a list with values[i][j]. So the value of the 10th date would be values[9][1]. Since you know the number of elements per inner list and it is rather small, you could also unpack your lists.
Example with three elements per inner list: a,b,c = values[i].
You want to iterate over the entire list, so a for-loop is what you need and instead of handling indices you can directly unpack the inner lists in variables with meaningful names.
for date, value in values:
value = float(value)
if value > highest:
highest = value
highest_date = date
Another option would be a list comprehension:
avg = sum([float(l[1]) for l in values])/len(values)
Since this is your homework, I don't want to give you a complete solution, but this should be enough to solve all your questions described above.
One last tip: for monthly statistics you need to further split the date, then there are multiple options to go from there (saving it the values for each month in one list/dictionary, or computing the on the fly)
I am trying to write a simple code in which I have units produced in a dataframe 'Yield' and 'Date' on which they were produced. Multiple records are present for the same date. I am going to use numpy cumsum function to get running total for each row and then subtract the value for the current row. I do not wish to do aggregation for the date since I need the original raw records to remain.
I can do this for one set of date by having .loc variable made for each date and then apply the function. But can't figure out how to do this iteratively.
data_43102 = data['Yield_Done','PDate'].loc[data['PDate'] ==43102]
#gives me Yield Done for only 43102
data_43102['Running_total']= cumsum(data_43102['Yield_Done']) #gives me cumulative total
data_43102['Running_total'] = data_43102['Running_total'] - data_43102['Yield_Done']
Whet I expect after running the code is there to be output like in the case of one I had
You can store all the dates in a list and then use isin to get data filtered for all the dates:
dates = ['43102', '23102', '43102'...]
data_filtered_by_date = data['Yield_Done','PDate'].loc[data['PDate'].isin(dates)]
I hope this helps.
I am currently trying to figure out a way to get information stored across multiple datasets as .csv files.
Context
For the purposes of this question, suppose I have 4 datasets: experiment_1.csv, experiment_2.csv, experiment_3.csv, and experiment_4.csv. In each dataset, there are 20,000+ rows with 80+ columns in each row. Each row represents an Animal, identified by a id number, and each column represents various experimental data about that Animal. Assume each row's Animal ID number is unique for each dataset, but not across all datasets. For instance, ID#ABC123 can be found in experiment_1.csv, experiment_2.csv, but not experiment_3.csv and experiment_4.csv
Problem
Say a user wants to get info for ~100 Animals by looking up each Animal's ID # across all datasets. How would I go about doing this? I'm relatively new to programming, and I would like to improve. Here's what I have so far.
class Animal:
def __init__(self, id_number, *other_parameters):
self.animal_id = id_number
self.animal_data = {}
def store_info(self, csv_row, dataset):
self.animal_data[dataset] = csv_row
# Main function
# ...
# Assume animal_queries = list of Animal Objects
# Iterate through each dataset csv file
for dataset in all_datasets:
# Make a copy of the list of queries
copy_animal_queries = animal_queries[:]
with open(dataset, 'r', newline='') as dataset_file:
reader = csv.DictReader(dataset_file, delimiter=',')
# Iterate through each row in the csv file
for row in reader:
# Check if the list is not empty
if animal_queries_copy:
# Get the current row's animal id number
row_animal_id = row['ANIMAL ID']
# Check if the animal id number matches with a query for
# every animal in the list
for animal in animal_queries_copy[:]:
if animal.animal_id == row_animal_id:
# If a match is found, store the info, remove the
# query from the list, and exit iterating through
# each query
animal.store_info(row, dataset)
animal_list_copy.remove(animal)
break
# If the list is empty, all queries were found for the current
# dataset, so exit iterating through rows in reader
else:
break
Discussion
Is there a more obvious approach for this? Assume that I want to use .csv files for now, and I will consider converting these .csv files to an easier-to-use format like SQL Tables later down the line (I am an absolute beginner at databases and SQL, so I need to spend time learning this).
The one thing that sticks out to me is that I have to create multiple copies of animal_queries: 1 for each dataset, and 1 for each row in a dataset (in the for loop). Since 1 row only contains 1 ID, I can exit the loop early once I find a match to an ID from animal_queries. In addition, since that ID was already found, I no longer need to search for that ID for the rest of the current dataset, so I remove it from the list, but I need to keep the original copy of the queries since I also need it to search the remaining datasets. However, I can't remove an element from a list while inside a for loop, so I need to create another copy as well. This doesn't seem optimal to me and I'm wondering if I'm approaching this in the wrong direction. Any help would be appreciated, thanks!
Well, you could greatly speed this up by using the pandas library for one thing. Ignoring the class definition for now, you could do the following:
import pandas as pd
file_names = ['ex_1.csv', 'ex_2.csv']
animal_queries = ['foo', 'bar'] #input by user
#create list of data sets
data_sets = [pd.read_csv(_file) for _file in file_names]
#create store of retrieved data
retrieved_data = [d_s[d_s['ANIMAL ID'].isin(animal_queries)] for d_s in data_sets]
#concatenate the data
final_data = pd.concat(retrieved_data)
#export to csv
final_data.to_csv('your_data')
This simplifies things a lot. The isin method slices each data frame where ANIMAL ID is found in the list animal_queires. Incidentally pandas will also help you to cope with sql tables also so is probably a good route for you to go down.
I am beginning to move from R to Python and have a stupid question.
I have been looking for close to 5 hours to find a solution to my question.
I have the following code in R, which essentially takes the dataframe df and aggregates the outdates from a hospital based on unique ids. So my original table has many UIds repeated since someone may visit a hospital many times and each time they leave the hospital they have an out date. I want the UID, and all the outdates in one row. I could do this very easily with the following code in R.
newdf= aggregate(data = df, OutDate~UID, FUN=paste, sep="," )
Can anyone pray tell me how this can be accomplished in Python?
HEre's what my table looks like after using the above function in R
-UID1, 10/20/2008, 11/30/2008, 1/1/1900, 1/1/1900
-UID2, 6/19/2010, 1/1/1900
-UID3, 11/17/2009
-UID4, 3/14/2010 , 4/20/2010, 1/1/1900, 1/1/1900
-UID5, 12/12/2008, 8/27/2009, 1/1/1900
Ignore the dates, i just made them up. But the output needs to look like above.
Previously I had multiple UID1 rows for each of the dates in the current columns.
Now how do I do this in python.
You can do this with a dictionary comprehension:
from collections import defauldict
d = defaultdict(list)
for f in df.values():
// Assuming the first value is the UID:
d[f[0]].append(f)
Now d is a dictionary, where each key is the UID and the values are a list of rows from the dataframe. You can combine them into a string (like what you are doing with paste), like this:
for uid,values in d.iteritems():
for value in values:
print('{},{}'.format(uid,','.join(value)))
This sounds like building a dictionary where the key is the UID and you append each outdate to the key as you loop through the data. This assumes that you are getting the data in the form of a csv file where3 each row of data is read by csv.DictReader. I make the assumption based on what you seem to show of the data file and the separators. As a result, each entry in the row (which can include in time, out time, diagnosis, etc) is keyed by the header row. I will alsao assume that you can tell how to read the data into csv processing. The quick code below shows how to generate the dictionary entries from the row once you have it in.
I show the final way the data will look followed by how it was derived.
data = {UID1:(out1, out2, out3), UID2:(out3, out4)}
data = {}
for d in datarow:
uid = d[UID]
if uid not in data.keys():
data[uid] = ()
out = d[OUT]
data[uid].append(out)