aggregating and pasting - python

I am beginning to move from R to Python and have a stupid question.
I have been looking for close to 5 hours to find a solution to my question.
I have the following code in R, which essentially takes the dataframe df and aggregates the outdates from a hospital based on unique ids. So my original table has many UIds repeated since someone may visit a hospital many times and each time they leave the hospital they have an out date. I want the UID, and all the outdates in one row. I could do this very easily with the following code in R.
newdf= aggregate(data = df, OutDate~UID, FUN=paste, sep="," )
Can anyone pray tell me how this can be accomplished in Python?
HEre's what my table looks like after using the above function in R
-UID1, 10/20/2008, 11/30/2008, 1/1/1900, 1/1/1900
-UID2, 6/19/2010, 1/1/1900
-UID3, 11/17/2009
-UID4, 3/14/2010 , 4/20/2010, 1/1/1900, 1/1/1900
-UID5, 12/12/2008, 8/27/2009, 1/1/1900
Ignore the dates, i just made them up. But the output needs to look like above.
Previously I had multiple UID1 rows for each of the dates in the current columns.
Now how do I do this in python.

You can do this with a dictionary comprehension:
from collections import defauldict
d = defaultdict(list)
for f in df.values():
// Assuming the first value is the UID:
d[f[0]].append(f)
Now d is a dictionary, where each key is the UID and the values are a list of rows from the dataframe. You can combine them into a string (like what you are doing with paste), like this:
for uid,values in d.iteritems():
for value in values:
print('{},{}'.format(uid,','.join(value)))

This sounds like building a dictionary where the key is the UID and you append each outdate to the key as you loop through the data. This assumes that you are getting the data in the form of a csv file where3 each row of data is read by csv.DictReader. I make the assumption based on what you seem to show of the data file and the separators. As a result, each entry in the row (which can include in time, out time, diagnosis, etc) is keyed by the header row. I will alsao assume that you can tell how to read the data into csv processing. The quick code below shows how to generate the dictionary entries from the row once you have it in.
I show the final way the data will look followed by how it was derived.
data = {UID1:(out1, out2, out3), UID2:(out3, out4)}
data = {}
for d in datarow:
uid = d[UID]
if uid not in data.keys():
data[uid] = ()
out = d[OUT]
data[uid].append(out)

Related

Append std,mean columns to a DataFrame with a for-loop

I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)
This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

parsing CSV to pandas dataframes (one-to-many unmunge)

I have a csv file imported to a pandas dataframe. It probably came from a database export that combined a one-to-many parent and detail table. The format of the csv file is as follows:
header1, header2, header3, header4, header5, header6
sample1, property1,,,average1,average2
,,detail1,detail2,,
,,detail1,detail2,,
,,detail1,detail2,,
sample2, ...
,,detail1,detail2,,
,,detail1,detail2,,
...
(i.e. line 0 is the header, line 1 is record 1, lines 2 through n are details, line n+1 is record 2 and so on...)
What is the best way to extricate (renormalize?) the details into separate DataFrames that can be referenced using values in the sample# records? The number of each subset of details are different for each sample.
I can use:
samplelist = df.header2[pd.notnull(df.header2)]
to get the starting index of each sample so that I can grab samplelist.index[0] to samplelist.index[1] and put it in a smaller dataframe. Detail records by themselves have no reference to which sample they came from, so that has to be inferred from the order of the csv file (notice that there is no intersection of filled/empty fields in my example).
Should I make a list of dataframes, a dict of dataframes, or a panel of dataframes?
Can I somehow create variables from the sample1 record fields and somehow attach them to each dataframe that has only detail records (like a collection of objects that have several scalar members and one dataframe each)?
Eventually I will create statistics on data from each detail record grouping and plot them against values in the sample records (e.g. sampletype, day or date, etc. vs. mystatistic). I will create intermediate Series to also be attached to the sample grouping like a kernel density estimation PDF or histogram.
Thanks.
You can use the fact that the first column seems to be empty unless it's a new sample record to .fillna(method='ffill') and then .groupby('header1') to get all the separate groups. On these, you can calculate statistics right away or store as separate DataFrame. High level sketch as follows:
df.header1 = df.header1.fillna(method='ffill')
for sample, data in df.groupby('header1'):
print(sample) # access to sample name
data = ... # process sample records
The answer above got me going in the right direction. With further work, the following was used. It turns out I needed to use two columns as a compound key to uniquely identify samples.
df.header1 = df.header1.fillna(method='ffill')
df.header2 = df.header2.fillna(method='ffill')
grouped = df.groupby(['header1','header2'])
samplelist = []
dfParent = pd.DataFrame()
dfDetail = pd.DataFrame()
for sample, data in grouped:
samplelist.append(sample)
dfParent = dfParent.append(grouped.get_group(sample).head(n=1), ignore_index=True)
dfDetail = dfDetail.append(data[1:], ignore_index=True)
dfParent = dfParent.drop(['header3','header4',etc...]) # remove columns only used in
# detail records
dfDetail = dfDetail.drop(['header5','header6',etc...]) # remove columns only used once
# per sample
# Now details can be extracted by sample number in the sample list
# (e.g. the first 10 for sample 0)
samplenumber = 0
dfDetail[
(dfDetail['header1'] == samplelist[samplenumber][0]) &
(dfDetail['header2'] == samplelist[samplenumber][1])
].header3[:10]
Useful links were:
Pandas groupby and get_group
Pandas append to DataFrame

Getting information for multiple queries across multiple .csv files

I am currently trying to figure out a way to get information stored across multiple datasets as .csv files.
Context
For the purposes of this question, suppose I have 4 datasets: experiment_1.csv, experiment_2.csv, experiment_3.csv, and experiment_4.csv. In each dataset, there are 20,000+ rows with 80+ columns in each row. Each row represents an Animal, identified by a id number, and each column represents various experimental data about that Animal. Assume each row's Animal ID number is unique for each dataset, but not across all datasets. For instance, ID#ABC123 can be found in experiment_1.csv, experiment_2.csv, but not experiment_3.csv and experiment_4.csv
Problem
Say a user wants to get info for ~100 Animals by looking up each Animal's ID # across all datasets. How would I go about doing this? I'm relatively new to programming, and I would like to improve. Here's what I have so far.
class Animal:
def __init__(self, id_number, *other_parameters):
self.animal_id = id_number
self.animal_data = {}
def store_info(self, csv_row, dataset):
self.animal_data[dataset] = csv_row
# Main function
# ...
# Assume animal_queries = list of Animal Objects
# Iterate through each dataset csv file
for dataset in all_datasets:
# Make a copy of the list of queries
copy_animal_queries = animal_queries[:]
with open(dataset, 'r', newline='') as dataset_file:
reader = csv.DictReader(dataset_file, delimiter=',')
# Iterate through each row in the csv file
for row in reader:
# Check if the list is not empty
if animal_queries_copy:
# Get the current row's animal id number
row_animal_id = row['ANIMAL ID']
# Check if the animal id number matches with a query for
# every animal in the list
for animal in animal_queries_copy[:]:
if animal.animal_id == row_animal_id:
# If a match is found, store the info, remove the
# query from the list, and exit iterating through
# each query
animal.store_info(row, dataset)
animal_list_copy.remove(animal)
break
# If the list is empty, all queries were found for the current
# dataset, so exit iterating through rows in reader
else:
break
Discussion
Is there a more obvious approach for this? Assume that I want to use .csv files for now, and I will consider converting these .csv files to an easier-to-use format like SQL Tables later down the line (I am an absolute beginner at databases and SQL, so I need to spend time learning this).
The one thing that sticks out to me is that I have to create multiple copies of animal_queries: 1 for each dataset, and 1 for each row in a dataset (in the for loop). Since 1 row only contains 1 ID, I can exit the loop early once I find a match to an ID from animal_queries. In addition, since that ID was already found, I no longer need to search for that ID for the rest of the current dataset, so I remove it from the list, but I need to keep the original copy of the queries since I also need it to search the remaining datasets. However, I can't remove an element from a list while inside a for loop, so I need to create another copy as well. This doesn't seem optimal to me and I'm wondering if I'm approaching this in the wrong direction. Any help would be appreciated, thanks!
Well, you could greatly speed this up by using the pandas library for one thing. Ignoring the class definition for now, you could do the following:
import pandas as pd
file_names = ['ex_1.csv', 'ex_2.csv']
animal_queries = ['foo', 'bar'] #input by user
#create list of data sets
data_sets = [pd.read_csv(_file) for _file in file_names]
#create store of retrieved data
retrieved_data = [d_s[d_s['ANIMAL ID'].isin(animal_queries)] for d_s in data_sets]
#concatenate the data
final_data = pd.concat(retrieved_data)
#export to csv
final_data.to_csv('your_data')
This simplifies things a lot. The isin method slices each data frame where ANIMAL ID is found in the list animal_queires. Incidentally pandas will also help you to cope with sql tables also so is probably a good route for you to go down.

Dealing with subsets of data using csv.DictReader

I'm parsing a big CSV file using csv.DictReader.
quotes=open( "file.csv", "rb" )
csvReader= csv.DictReader( quotes )
Then for each row I'm converting the time value in the CSV in datetime using this :
for data in csvReader:
year = int(data["Date"].split("-")[2])
month = strptime(data["Date"].split("-")[1],'%b').tm_mon
day = int(data["Date"].split("-")[0])
hour = int(data["Time"].split(":")[0])
minute = int(data["Time"].split(":")[1])
bars = datetime.datetime(year,month,day,hour,minute)
Now I would like to perform actions only on the rows of the same day. Would it be possible to do it in the same for loop or should I maybe save the data out per day and then perform actions? What would be an efficient way of baking the parsing?
As jogojapan has pointed out, it is important to know whether we can assume that the CSV file is sorted by date. If it is, then you could use itertools.groupby to simplify your code. For example, the for loop in this code iterates over the data one day at time:
import csv
import datetime
import itertools
with open("file.csv", "rb") as quotes:
csvReader = csv.DictReader(quotes)
lmb = lambda d: datetime.datetime.strptime(d["Date"], "%d-%b-%Y").date()
for k, g in itertools.groupby(csvReader, key = lmb):
# do stuff per day
counts = (int(data["Count"]) for data in g)
print "On {0} the total count was {1}".format(k, sum(counts))
I created a test "file.csv" containing the following data:
Date,Time,Count
1-Apr-2012,13:23,10
2-Apr-2012,10:57,5
2-Apr-2012,11:38,23
2-Apr-2012,15:10,1
3-Apr-2012,17:47,123
3-Apr-2012,18:21,8
and when I ran the above code I got the following results:
On 2012-04-01 the total count was 10
On 2012-04-02 the total count was 29
On 2012-04-03 the total count was 131
But remember that this will only work if the data in "file.csv" is sorted by date.
If (for some reason) you can assume that the input rows are already sorted by date, you could put them into a local container one by one as long as the date of any new row is the same as the previous one:
same_date_rows = []
prev_date = None
for data in csvReader:
# ... your existing code
bars = datetime.datetime(year,month,day,hour,minute)
if bars == prev_date:
same_date_rows.append(data)
else:
# New date. We process all rows collected so far
do_something(same_date_rows)
# Then we start a new collection for the new date
same_date_rows = [date]
# Remember the date of the current row
prev_date = bars
# Finally, process the final group of rows
do_something(same_date_rows)
But if you cannot make that assumption, you will have to
Either: Put the rows in a long list, sort that by date, and then apply an algorithm like the above to the sorted list
Or: Put the rows in a dictionary, using the date as key, and a list of rows as value for each key. Then you can iterate through the keys of that dictionary to get access to all rows that share a date.
The second of these two approaches is a little more space-consuming, but it may allow you do to some of the date-specific processing in the main loop, because whenever you receive a new row for an already-existing date, you could apply some of the date-specific processing right away, possibly avoiding the need to actually store all date-specific rows explicitly. Whether that is possible depends on what kind of processing you apply to the rows.
If you are not going for space efficeny, an elegant solution would be to create a dictionary where the key is your day, and the value is a list object, where all the information for each day is stored. Later you can do whatever operations you want based on per day.
For example
d = {} #Initialize emptry dictionry
for data in csvReader:
Day = int(data["Date"].split("-")[0])
try:
d[Day].append('Some_Val')
except KeyError:
d[Day] = ['Some_val']
This will either modify or create a new list object for each day. This is later easily accessible either by iterating over the dictionary or simply referring to the day as a key.
For example:
d[Some_Day]
will give you simply a list object with all the information you have stored. Given the linear lookup time of a dictionary, it should be quite efficent in terms of time.

Categories