Multiple Groupings on Pandas DataFrame

Multiple Groupings on Pandas DataFrame - python

Forgive any bad wording as I'm rather new to Pandas. I've done a fair amount of Googling but can't quite figure out the keywords I need to get the answer I'm looking for. I have some rather simple data containing counts of a certain flag grouped by IDs and dates, similar to the below:
id date flag count
-------------------------------------
CAZ1 02/03/2012 Y 12
CAZ1 02/03/2012 N 7
CAZ2 03/03/2012 Y 6
CAZ2 03/03/2012 N 2
CRI2 02/03/2012 Y 14
CRI2 02/03/2012 G 5
LMU3 01/12/2013 G 7
LMU4 02/12/2013 G 4
LMU5 01/12/2014 G 3
LMU6 01/12/2014 G 2
LMU7 05/12/2014 G 2
EUR4 01/16/2014 N 3
What I'm looking to do is group the IDs by certain flag combinations, sum their counts, and then get means for these per year. Resulting data should look something like:
2012 2013 2014 Mean Calculations:
--------------------------------------
Y,N | 6.75 NaN NaN (((12+7)/2)+((6+2)/2))/2
--------------------------------------
Y,G | 9.5 NaN NaN (14+5)/2
--------------------------------------
G | NaN 5.5 2.33 (7+4)/2, (3+2+2)/3
--------------------------------------
N | NaN NaN 3 (3)
Not sure if this makes sense. I think I need to perform multiple GroupBys at the same time, with the option to define the different criteria for each of the different groupings.
Happy to clarify further if needed. My initial attempts at coding this have been filled with errors so I don't think there's much benefit in posting progress so far. In fact, I just tried to write something and it seemed more misleading than helpful. Sorry, >_<.

IIUC, you can get what you want by first doing a groupby and then building a pivot_table:
[original version]
df["date"] = pd.to_datetime(df["date"])
grouped = df.groupby(["id","date"], as_index=False)
df_new = grouped.agg({"flag": ",".join, "count": "sum"})
df_new["year"] = df_new["date"].dt.year
df_final = df_new.pivot_table(index="flag", columns="year")
produces
>>> df_final
count
year 2012 2013 2014
flag
G NaN 5.5 2.333333
N NaN NaN 3.000000
Y,G 19.0 NaN NaN
Y,N 13.5 NaN NaN
[updated after the question was edited]
If you want the mean instead of the sum, just write mean instead of sum when doing the aggregation, i.e.
df_new = grouped.agg({"flag": ",".join, "count": "mean"})
which gives
>>> df_final
count
year 2012 2013 2014
flag
G NaN 5.5 2.333333
N NaN NaN 3.000000
Y,G 9.50 NaN NaN
Y,N 6.75 NaN NaN
The only tricky part is passing the dictionary to agg so we can perform two aggregation operations at once:
>>> df_new
id date count flag year
0 CAZ1 2012-02-03 19 Y,N 2012
1 CAZ2 2012-03-03 8 Y,N 2012
2 CRI2 2012-02-03 19 Y,G 2012
3 EUR4 2014-01-16 3 N 2014
4 LMU3 2013-01-12 7 G 2013
5 LMU4 2013-02-12 4 G 2013
6 LMU5 2014-01-12 3 G 2014
7 LMU6 2014-01-12 2 G 2014
8 LMU7 2014-05-12 2 G 2014
It's usually easier to work with these flat formats as much as you can and then pivot only at the end.
For example, if your real dataset is more complicated than the one you posted, you might need another groupby -- but that's easy enough using this pattern.

Related

Preserving id columns in dataframe after applying assign and groupby

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?

Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN

Pandas combine two dataseries into one series

I need to combine the dataseries rateScore and rate into one.
This is the current DataFrame I have
rateScore rate
10 NaN 4.5
11 2.5 NaN
12 4.5 NaN
13 NaN 5.0
..
235 NaN 4.7
236 3.8 NaN
This needs to be something like this:
rateScore
10 4.5
11 2.5
12 4.5
13 5.0
..
235 4.7
236 3.8
The rate column needs to be dropped after merging the series and also for each row, the index number needs stay the same.

You can try with the following with fillna(), redifining the rateScore column and dropping rate:
df = df.fillna(0)
df['rateScore'] = df['rateScore'] + df['rate']
df = df.drop(columns='rate')

You could use combine_first to fill NaN values from a second Series:
df['rateScore'] = df['rateScore'].combine_first(df['rateScore'])

Let us do add
df['rateScore'] = df['rateScore'].add(df['rate'],fill_value=0)

How to update da Pandas Panel without duplicates

Currently i'm working on a Livetiming-Software for a motorsport-application. Therefore i have to crawl a Livetiming-Webpage and copy the Data to a big Dataframe. This Dataframe is the source of several diagramms i want to make. To keep my Dataframe up to date, i have to crawl the webpage very often.
I can download the Data and save them as a Panda.Dataframe. But my Problem is step from the downloaded DataFrame to the Big Dataframe, that includes all the Data.
import pandas as pd
import numpy as np
df1= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:30,000','1:45,000','1:50,000','1:25,333','1:13,366','1:17,000'],
'Laps':['1','1','1','1','1','1']})
df2= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,],
'Laps':['2','2','2','2','2','2']})
df3= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:31,000','1:41,000','1:51,000','1:21,333','1:11,366','1:11,000'],
'Laps':['2','2','2','2','2','2']})
df1.set_index(['CLS','Nr.','Laps'],inplace=True)
df2.set_index(['CLS','Nr.','Laps'],inplace=True)
df3.set_index(['CLS','Nr.','Laps'],inplace=True)
df1 shows a Dataframe from previous laps.
df2 shows a Dataframe in the second lap. The Lap is not completed, so i have a nan.
df3 shows a Dataframe after the second lap is completed.
My target is to have just one row for each Lap per Car per Class.
Either i have the problem, that i have duplicates with incomplete Laps or all date get overwritten.
I hope that someone can help me with this problem.
Thank you so far.
MrCrunsh

If I understand your problem correctly, your issue is that you have overlapping data for the second lap: information while the lap is still in progress and information after it's over. If you want to put all the information for a given lap in one row, I'd suggest use multi-index columns or changing the column names to reflect the difference between measurements during and after laps.
df = pd.concat([df1, df3])
df = pd.concat([df, df2], axis=1, keys=['after', 'during'])
The result will look like this:
after during
Pos Zeit Pos Zeit
CLS Nr. Laps
V4 24 1 5 1:13,366 NaN NaN
2 5 1:11,366 5.0 NaN
55 1 4 1:25,333 NaN NaN
2 4 1:21,333 4.0 NaN
985 1 6 1:17,000 NaN NaN
2 6 1:11,000 6.0 NaN
V5 13 1 1 1:30,000 NaN NaN
2 1 1:31,000 1.0 NaN
30 1 3 1:50,000 NaN NaN
2 3 1:51,000 3.0 NaN
700 1 2 1:45,000 NaN NaN
2 2 1:41,000 2.0 NaN

what is the most efficient way to synchronize two large data frames in pandas?

I would like to synchronize two very long data frames, performance is key in this use case. The two data frames are indexed in chronological order (this should be exploited to be as fast as possible) using datetimes or Timestamps.
One way to synch is provided in this example:
import pandas as pd
df1=pd.DataFrame({'A':[1,2,3,4,5,6], 'B':[1,5,3,4,5,7]}, index=pd.date_range('20140101 101501', freq='u', periods=6))
df2=pd.DataFrame({'D':[10,2,30,4,5,10], 'F':[1,5,3,4,5,70]}, index=pd.date_range('20140101 101501.000003', freq='u', periods=6))
# synch data frames
df3=df1.merge(df2, how='outer', right_index=True, left_index=True).fillna(method='ffill')
My question is if this is the most efficient way to do it? I am ready to explore other solutions (e.g. using numpy or cython) if there are faster ways to solve this task.
Thanks
Note: time-stamps are not in general equally spaced (as in the example above), the method should also work in this case
Comment after reading the answers
I think there are many use cases in which neither align nor merge or join help. The point is to not use DB related semantics for aligning (which for timeseries are not so relevant in my opinion). For me aligning means map series A into B and have a way to deal with missing values (typically sample and hold method), align and join cause a not wanted effects like several timestamps repeated as a result of joining. I still do not have a perfect solution, but it seems np.searchsorted can help (it is much faster than using several calls to join / align to do what I need). I could not find a pandas way to do this up to now.
How can I map A into B so that B so that the result has all timestamps of A and B but no repetitions (except those which are already in A and B)?
Another typical use case is sample and hold synch, which can be solved in an efficient way as follows (synch A with B, i.e. take for every timestamp in A the corresponding values in B:
idx=np.searchsorted(B.index.values, A.index.values, side='right')-1
df=A.copy()
for i in B:
df[i]=B[i].ix[idx].values
the result df contains the same index of A and the synchronized values in B.
Is there an effective way to do such things directly in pandas?

If you need to synchronize then, use align, docs are here. Otherwise merge is a good option.
In [18]: N=100000
In [19]: df1=pd.DataFrame({'A':[1,2,3,4,5,6]*N, 'B':[1,5,3,4,5,7]*N}, index=pd.date_range('20140101 101501', freq='u', periods=6*N))
In [20]: df2=pd.DataFrame({'D':[10,2,30,4,5,10]*N, 'F':[1,5,3,4,5,70]*N}, index=pd.date_range('20140101 101501.000003', freq='u', periods=6*N))
In [21]: %timeit df1.merge(df2, how='outer', right_index=True, left_index=True).fillna(method='ffill')
10 loops, best of 3: 69.3 ms per loop
In [22]: %timeit df1.align(df2)
10 loops, best of 3: 36.5 ms per loop
In [24]: pd.set_option('max_rows',10)
In [25]: x, y = df1.align(df2)
In [26]: x
Out[26]:
A B D F
2014-01-01 10:15:01 1 1 NaN NaN
2014-01-01 10:15:01.000001 2 5 NaN NaN
2014-01-01 10:15:01.000002 3 3 NaN NaN
2014-01-01 10:15:01.000003 4 4 NaN NaN
2014-01-01 10:15:01.000004 5 5 NaN NaN
... .. .. .. ..
2014-01-01 10:15:01.599998 5 5 NaN NaN
2014-01-01 10:15:01.599999 6 7 NaN NaN
2014-01-01 10:15:01.600000 NaN NaN NaN NaN
2014-01-01 10:15:01.600001 NaN NaN NaN NaN
2014-01-01 10:15:01.600002 NaN NaN NaN NaN
[600003 rows x 4 columns]
In [27]: y
Out[27]:
A B D F
2014-01-01 10:15:01 NaN NaN NaN NaN
2014-01-01 10:15:01.000001 NaN NaN NaN NaN
2014-01-01 10:15:01.000002 NaN NaN NaN NaN
2014-01-01 10:15:01.000003 NaN NaN 10 1
2014-01-01 10:15:01.000004 NaN NaN 2 5
... .. .. .. ..
2014-01-01 10:15:01.599998 NaN NaN 2 5
2014-01-01 10:15:01.599999 NaN NaN 30 3
2014-01-01 10:15:01.600000 NaN NaN 4 4
2014-01-01 10:15:01.600001 NaN NaN 5 5
2014-01-01 10:15:01.600002 NaN NaN 10 70
[600003 rows x 4 columns]

If you wish to use the index of one of your DataFrames as pattern for synchronizing, maybe useful:
df3 = df1.iloc[df1.index.isin(df2.index),]
Note: I guess shape of df1 > shape of df2
In the previous code snippet, you get the elements in df1 and df2 but if you want to add new indexes maybe you prefer doing:
new_indexes = df1.index.diff(df2.index) # indexes of df1 and not in df2
default_values = np.zeros((new_indexes.shape[0],df2.shape[1]))
df2 = df2.append(pd.DataFrame(default_values , index=new_indexes)).sort(axis=0)
You can see another way to synchronize in this post

To my view sync of time series is a very simple procedure. Assume ts# (#=0,1,2) to be filled with
ts#[0,:] - time
ts#[1,:] - ask
ts#[2,:] - bid
ts#[3,:] - asksz
ts#[4,:] - bidsz
output is
totts[0,:] - sync time
totts[1-4,:] - ask/bid/asksz/bidsz of ts0
totts[5-8,:] - ask/bid/asksz/bidsz of ts1
totts[9-12,:] - ask/bid/asksz/bidsz of ts2
function:
def syncTS(ts0,ts1,ts2):
ti0 = ts0[0,:]
ti1 = ts1[0,:]
ti2 = ts2[0,:]
totti = np.union1d(ti0, ti1)
totti = np.union1d(totti,ti2)
totts = np.ndarray((13,len(totti)))
it0=it1=it2=0
nT0=len(ti0)-1
nT1=len(ti1)-1
nT2=len(ti2)-1
for it,tim in enumerate(totti):
if tim >= ti0[it0] and it0 < nT0:
it0+=1
if tim >= ti1[it1] and it1 < nT1:
it1 += 1
if tim >= ti2[it2] and it2 < nT2:
it2 += 1
totts[0, it] = tim
for k in range(1,5):
totts[k, it] = ts0[k, it0]
totts[k + 4, it] = ts1[k, it1]
totts[k + 8, it] = ts2[k, it2]
return totts

Group by multiple time units in pandas data frame

I have a data frame that consists of a time series data with 15-second intervals:
date_time value
2012-12-28 11:11:00 103.2
2012-12-28 11:11:15 103.1
2012-12-28 11:11:30 103.4
2012-12-28 11:11:45 103.5
2012-12-28 11:12:00 103.3
The data spans many years. I would like to group by both year and time to look at the distribution of time-of-day effect over many years. For example, I may want to compute the mean and standard deviation of every 15-second interval across days, and look at how the means and standard deviations change from 2010, 2011, 2012, etc. I naively tried data.groupby(lambda x: [x.year, x.time]) but it didn't work. How can I do such grouping?

In case date_time is not your index, a date_time-indexed DataFrame could be created with:
dfts = df.set_index('date_time')
From there you can group by intervals using
dfts.groupby(lambda x : x.month).mean()
to see mean values for each month. Similarly, you can do
dfts.groupby(lambda x : x.year).std()
for standard deviations across the years.
If I understood the example task you would like to achieve, you could simply split the data into years using xs, group them and concatenate the results and store this in a new DataFrame.
years = range(2012, 2015)
yearly_month_stats = [dfts.xs(str(year)).groupby(lambda x : x.month).mean() for year in years]
df2 = pd.concat(yearly_month_stats, axis=1, keys = years)
From which you get something like
2012 2013 2014
value value value
1 NaN 5.324165 15.747767
2 NaN -23.193429 9.193217
3 NaN -14.144287 23.896030
4 NaN -21.877975 16.310195
5 NaN -3.079910 -6.093905
6 NaN -2.106847 -23.253183
7 NaN 10.644636 6.542562
8 NaN -9.763087 14.335956
9 NaN -3.529646 2.607973
10 NaN -18.633832 0.083575
11 NaN 10.297902 14.059286
12 33.95442 13.692435 22.293245

You were close:
data.groupby([lambda x: x.year, lambda x: x.time])
Also be sure to set date_time as the index, as in kermit666's answer

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiple Groupings on Pandas DataFrame - python

Related

Preserving id columns in dataframe after applying assign and groupby

Pandas combine two dataseries into one series

How to update da Pandas Panel without duplicates

what is the most efficient way to synchronize two large data frames in pandas?

Group by multiple time units in pandas data frame

Categories

Resources