python pandas: Can you perform multiple operations in a groupby? - python

Suppose I have the following DataFrame:
df = pd.DataFrame(
{
'year': [2015,2015,2018,2018,2020],
'total': [100,200,50,150,400],
'tax': [10,20,5,15,40]
}
)
I want to sum up the total and tax columns by year and obtain the size at the same time.
The following code gives me the sum of the two columns:
df_total_tax = df.groupby('year', as_index=False)
df_total_tax = df_total_tax[['total','tax']].apply(np.sum)
However, I can't figure out how to also include a column for size at the same time. Must I perform a different groupby, then use .size() and then append that column to df_total_tax? Or is there an easier way?
The end result would look like this:
Thanks

You can specify for each column separately aggregate function in named aggregation:
df = df.groupby('year', as_index=False).agg(total=('total','sum'),
tax=('tax','sum'),
size=('tax', 'size'))
print (df)
year total tax size
0 2015 300 30 2
1 2018 200 20 2
2 2020 400 40 1

Related

Pandas select values from each hour for each ID

I have a dataframe in which I have some IDs, and for each ID I have some values and timestamps (around one value each 5 minutes for 5 to 7 days in a row). I would like to select, for each hour and for each ID, the mean, median and variance of the values in that hour and store them in different columns like in the following result:
hour mean var median ID
0 2 4 4 1234
1 4 5 3 1234
...
23 2 2 3 1234
My columns are:
ID int64
Value float64
Date datetime64[ns]
dtype: object
My timestamps are in the following type:
%Y-%m-%d %H:%M:%S.%f
How do I create the final dataframe for each ID? Thank you very much
Edit:
With the following line I created a column correctly with the hours:
df['hour'] = df.Date.dt.hour
Now the problem is that I have a very long column with the hours, the same, and if I use the resample like this:
df = df.set_index('Date').resample('60T').mean().reset_index()
automatically it erases the value columns and overwrites with the mean values. I would like to keep that columns, so that i can create different columns for mean, variance and median, based on the values in the Value columns. How can I do that part?
Try this:
# Extract the hour from the Date column
h = df['Date'].dt.hour.rename('Hour')
# Group by ID and Hour
df.groupby(['ID', h]).agg({
'Value': ['mean', 'var', 'median']
})
You can replace the h series by pd.Grouper. By default pd.Grouper groups the index. You can set the key parameter so that it targets another column:
df.groupby([pd.Grouper('1H', key='Date'), 'ID').agg({
'Value': ['mean', 'var', 'median']
})

How to load this kind of data in pandas

Background: I have logs which are generated during the testing of the devices after manufacture. Each device has a serial number and a corresponding csv log file with all the data. Something like this.
DATE,TESTSTEP,READING,LIMIT,RESULT
01/01/2019 07:37:17.432 AM,1,23,10,FAIL
01/01/2019 07:37:23.661 AM,2,3,3,PASS
So there are many such log files. Each with the test data.
I have the the serial number of devices which failed in field. I want to create a model using these log files. And then use it to predict if the given device has a chance of failing in field given its log file.
Till now as a part of learning, I have worked with data like housing price. Every row was complete. Depending on area, number of rooms etc, it was easy to define a model for expected selling price.
Here I wish to find a way to somehow flatten all the logs into a single row. I am thinking of having something like:
DATE_1,TESTSTEP_1,READING_1,LIMIT_1,RESULT_1,DATE_2,TESTSTEP_2,READING_2,LIMIT_2,RESULT_2
1/1/2019 07:37:17.432 AM,1,23,10,FAIL,01/01/2019 07:37:23.661 AM,2,3,3,PASS
Is this the right way to deal with this kind of data?
If so, then does Pandas has any inbuilt support for this?
I will be using scikit-learn to create models.
First convert columns to ordered CategoricalIndex for same order of columns in output, convert DATE column by to_datetime and convert datetimes to dates by Series.dt.date with cumcount for counter, create MultiIndex by set_index, reshape by unstack and sort second level of MultiIndex in columns by sort_index. Last flatten it by list comprehension with reset_index:
df['DATE'] = pd.to_datetime(df['DATE'])
dates = df['DATE'].dt.date
df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
g = df.groupby(dates).cumcount().add(1)
df = df.set_index([dates, g]).unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(drop=True)
print (df)
DATE_1 TESTSTEP_1 READING_1 LIMIT_1 RESULT_1 \
0 2019-01-01 07:37:17.432 1 23 10 FAIL
DATE_2 TESTSTEP_2 READING_2 LIMIT_2 RESULT_2
0 2019-01-01 07:37:23.661 2 3 3 PASS
If need also dates in separate first column:
df['DATE'] = pd.to_datetime(df['DATE'])
dates = df['DATE'].dt.date
df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
g = df.groupby(dates).cumcount().add(1)
df = df.set_index([dates.rename('DAT'), g]).unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
DAT DATE_1 TESTSTEP_1 READING_1 LIMIT_1 RESULT_1 \
0 2019-01-01 2019-01-01 07:37:17.432 1 23 10 FAIL
DATE_2 TESTSTEP_2 READING_2 LIMIT_2 RESULT_2
0 2019-01-01 07:37:23.661 2 3 3 PASS

Python: Summarizing & Aggregating Groups and Sub-groups in DataFrame

I am trying to build a table that has groups that are divided by subgroups with count and average for each subgroup. For example, I want to convert the following data frame:
To a table that looks like this where the interval is a bigger group and columns a thru i become subgroups within the group with the corresponding subgroups' count and average in each cell:
I have tried this with no success:
Try.
df.groupby(['interval']).apply(lambda x : x.stack()
.groupby(level=-1)
.agg({'count', 'mean'}))
Use groupby with apply to apply a function for each group then stack and groupby again with agg to find count and mean.
Use DataFrame.melt with GroupBy.agg and tuples for aggregate functions with new columns names:
df1 = (df.melt('interval', var_name='source')
.groupby(['interval','source'])['value']
.agg([('cnt','count'), ('average','mean')])
.reset_index())
print (df1.head())
interval source cnt average
0 0 a 1 5.0
1 0 b 1 0.0
2 0 c 1 0.0
3 0 d 1 0.0
4 0 f 1 0.0
The following code solves the problem I asked for:
df.group(['interval'],,as_index=False).agg({
'a':{"count":"mean"},
'b':{"count":"mean"},
'c':{"count":"mean"},
'd':{"count":"mean"},
'f':{"count":"mean"},
'g':{"count":"mean"},
'i':{"count":"mean"}
})

Pandas how to aggregate more than one column

Here is the snippet:
test = pd.DataFrame({'userid': [1,1,1,2,2], 'order_id': [1,2,3,4,5], 'fee': [2,1,5,3,1]})
I'd like to group based on userid and count the 'order_id' column and sum the 'fee' column:
test.groupby('userid').order_id.count()
test.groupby('userid').fee.sum()
Is it possible to perform these two operations in one line of code so that I can get a resulting df looks like this:
userid counts sum
...
I've tried pivot_table:
test.pivot_table(index='userid', values=['order_id', 'fee'], aggfunc=[np.size, np.sum])
It gives something like this:
size sum
fee order_id fee order_id
userid
1 3 3 8 6
2 2 2 4 9
Is it possible to tell pandas to use np.size & np.sum on one column but not both?
Use DataFrameGroupBy.agg with rename columns:
d = {'order_id':'counts','fee':'sum'}
df = test.groupby('userid').agg({'order_id':'count', 'fee':'sum'})
.rename(columns=d)
.reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2
But better is aggregate by size, because count is used if need exclude NaNs:
df = test.groupby('userid')
.agg({'order_id':'size', 'fee':'sum'})
.rename(columns=d).reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2

How do I aggregate multiple columns with one function in pandas when using groupby?

I have a data frame with a "group" variable, a "count" variable, and a "total" variable. For each group I want to sum the count column and divide that by the sum of the total column. How do I do this, ideally in one line of code?
Here is an example to work with:
test_dc = {1:{'group':'A','cnt':3,'total':5},
2:{'group':'B','cnt':1,'total':8},
3:{'group':'A','cnt':2,'total':4},
4:{'group':'B','cnt':6,'total':13}
}
test_df = pd.DataFrame.from_dict(test_dc, orient='index')
Expected output (roughly):
group | average
A | 0.55555
B | 0.33333
Edit: changed column name from "count" to "cnt" because there seems to be an existing count() method on groupby objects.
You can use DataFrame.groupby to group by a column, and then call sum on that to get the sums.
>>> df = test_df
.groupby('group')
.sum()
>>> df
count total
group
A 5 9
B 7 21
Then you can grab the column and divide them through to get your answer.
>>> df['count'] / df['total']
group
A 0.555556
B 0.333333
dtype: float64
You can do this in one line by taking advantage of the DataFrame.pipe operator:
test_df
.groupby('group')
.sum()
.pipe(lambda df: df['count'] / df['total'])
I'd use a combination of agg and eval
test_df.groupby('group').agg('sum').eval('cnt / total')
group
A 0.555556
B 0.333333
dtype: float64

Categories