I'm attempting to compute the mean by date for all categories. However, each category (called mygroup in the example) does not have a value for each date. I would like to use an apply in pandas to compute the mean at each date, filling in the value using the closest date less than or equal to the current date. For instance if I have:
pd.DataFrame({'date':['1','2','3','6','1','3','4','5','1','2','3','4'],
'mygroup':['a','a','a','a','b','b','b','b','c','c','c','c'],
'myval':[10,20,30,40,50,60,70,80,90,100,110,120]})
date mygroup myval
0 1 a 10
1 2 a 20
2 3 a 30
3 6 a 40
4 1 b 50
5 3 b 60
6 4 b 70
7 5 b 80
8 1 c 90
9 2 c 100
10 3 c 110
11 4 c 120
Computing the mean for date == 1 should be equal to (10 + 50 + 90)/3 = 50 which can be done with a typical mean apply groupby date. However, for date == 6 I would like to use the last known values for each mygroup. The average then for date == 6 would be calculated as
(40 + 80 + 120)/3 = 80 since a has a value at date == 6 of 40, b does not have a value at date == 6, so the last known value was at date == 5 which is 80 and the last known value for c was at date == 4 of 120. The final result should look like:
date meanvalue
1 50
2 56.67
3 66.67
4 73.33
5 76.67
6 80
Is it possible to compute the mean by date with a groupby and apply in this manner, using each mygroup and filling in with the last known value if there is no value for the current date? This will have to be done for thousands of dates and tens of thousands of categories, so for loops are to be avoided.
df.set_index(['mygroup', 'date']).unstack().ffill(axis=1) \
.stack().groupby(level=1).mean()
myval
date
1 50.000000
2 56.666667
3 66.666667
4 73.333333
5 76.666667
6 80.000000
set your index to the key columns
unstack the date level into columns
fill the gaps horizontally - you have know a dense matrix you can calc against
put the date back in
group by date that is your expect output
apply the math - here you want a mean
The key point to remember that's useful for a number of problems is that stacking / unstacking / pivoting, etc... "rubikscubing" your dataframe is always filling gaps of a sparse format (like the columnar format you have to begin with) into a dense one full of NAs.
So if you're able to do the calculation easily with a full dense matrix, then I encourage you to always focus first on obtaining that dense matrix, so that you can do the easy math afterwards.
You can convert all implicit missing values to explicit and fill missing values with forward fill scheme and then do a normal groupby average:
from itertools import product
import pandas as pd
# get all combinations of date and mygroup using product function from itertools
all_combinations = list(product(df.date.drop_duplicates(), df.mygroup.drop_duplicates()))
# convert implicit missing values to explicit missing values by merging all combinations
# with original data frame
df1 = pd.merge(df, pd.DataFrame.from_records(all_combinations,
columns = ['date', 'mygroup']), 'outer')
# fill missing date values with previous date values within each group
df1.sort_values(['mygroup', 'date']).ffill().groupby('date').mean()
# myval
#date
#1 50.000000
#2 56.666667
#3 66.666667
#4 73.333333
#5 76.666667
#6 80.000000
Related
I have a dataframe like this one:
Name Team Shots Goals Actions Games Minutes
1 Player 1 ABC 5 3 20 2 15
2 Player 2 ATL 6 2 15 1 30
3 Player 3 RMA 3 3 16 1 20
4 Player 4 BAR 9 0 22 3 28
5 Player 5 ATL 8 1 19 2 32
Actually, in my df I have around 120 columns, but this is an example to see how would be the solution. I need the same df but with the values of most of the columns divided by one. In this case I would like to have the values os 'Shots', 'Goals' and 'Actions' divided by 'Minutes' but I don't want to apply this condition to 'Games' (and some 3 or 4 other specific columns in my real case).
Do you know any code to apply what I need telling the exceptions columns I don't need to apply the division?
try:
exclude=['Games','Minutes']
#create a list of excluded columns
cols=df.columns[df.dtypes!='O']
#Filterout columns that are of type int\float
cols=cols[~cols.isin(exclude)]
#Filter out columns other than that are present in exclude list
Finally:
out=df[cols].div(df['Minutes'],axis=0)
Update:
If you want complete and final df with the excludes columns and the values in this one then you can use join() method:
finaldf=out.join(df[exclude])
#If you want to join only excluded column
OR
cols=df.columns[df.dtypes=='O'].tolist()+exclude
finaldf=out.join(df[cols])
#If you want all the columns excluded+string ones
You can use df.div() to divide multiple columns by one column in place:
df[['Shots','Goals','Actions']].div(df.Minutes, axis=0)
I am looking for a way, if there exists a one, to perform an aggregatation on a df using only lambda approach, subject to a condition from another column. Here is a small microcosm of the problem.
df = pd.DataFrame({'ID':[1,1,1,1,2,2],
'revenue':[40,55,75,80,35,60],
'month':['2012-01-01','2012-02-01','2012-01-01','2012-03-01','2012-02-01','2012-03-01']})
print(df)
ID month revenue
0 1 2012-01-01 40
1 1 2012-02-01 55
2 1 2012-01-01 75
3 1 2012-03-01 80
4 2 2012-02-01 35
5 2 2012-03-01 60
If you need to have unique months for every ID, then the following code is good (this code is just for demonstration 'month':'nunique' works here).
df = df.groupby(['ID']).agg({'month':lambda x:x.nunique()}).reset_index()
print(df)
ID month
0 1 3
1 2 2
But, I need to count unique months when revenue was greater than 50 by taking two variables (revenue & month) in lambda something like lambda x,y: ... .
I could have done it like df[df['revenue'] > 50].groupby.(....), but there are many other columns in the agg() where this condition is not needed. So, does there exist an approach where lambda could take 2 variables simultaneously??
Expected output:
ID month
0 1 3
1 2 1
Unfortunately it is possible not easy/ performance way, because GroupBy.agg processing each column separately:
Dont use it, because extremly slow if large df or many groups.
def f(x):
a = df.loc[x.index]
return a.loc[a['revenue'] > 50, 'month'].nunique()
df1 = df.groupby(['ID']).agg({'month':f}).reset_index()
print(df1)
ID month
0 1 3
1 2 1
So one possible solution is filter before or using GroupBy.apply.
What I wanna do:
Column 'angle' has tracked about 20 angles per second (can vary). But my 'Time' timestamp has only an accuracy of 1s (therefore always about ~20 rows are having the same timestamp)(total rows of over 1 million in the dataframe).
My result shall be a new dataframe with a changing timestamp for each row. The angle for the timestamp shall be the median of the ~20 timestamps in that intervall.
My Idea:
I iterate through the rows and check if the timestamp has changed.
If so, I select all timestamps until it changes, calculate the median, and append it to a new dataframe.
Nevertheless I have many many big data files and I am wondering if there is a faster way to achieve my goal.
Right now my code is the following (see below).
It is not fast and I think there must be a better way to do that with pandas/numpy (or something else?).
a = 0
for i in range(1,len(df1.index)):
if df1.iloc[[a],[1]].iloc[0][0]==df1.iloc[[i],[1]].iloc[0][0]:
continue
else:
if a == 0:
df_result = df1[a:i-1].median()
else:
df_result = df_result.append(df1[a:i-1].median(), ignore_index = True)
a = i
You can use groupby here. Below, I made a simple dummy dataframe.
import pandas as pd
df1 = pd.DataFrame({'time': [1,1,1,1,1,1,2,2,2,2,2,2],
'angle' : [8,9,7,1,4,5,11,4,3,8,7,6]})
df1
time angle
0 1 8
1 1 9
2 1 7
3 1 1
4 1 4
5 1 5
6 2 11
7 2 4
8 2 3
9 2 8
10 2 7
11 2 6
Then, we group by the timestamp and take the median of the angle column within that group, and convert the result to a pandas dataframe.
df2 = pd.DataFrame(df1.groupby('time')['angle'].median())
df2 = df2.reset_index()
df2
time angle
0 1 6.0
1 2 6.5
You can use the .agg after grouping function to select operation according to the column
df1.groupby('Time', as_index=False).agg({"angle":"median"})
I have a dataset like this:
Participant Type Rating
1 A 6
1 A 5
1 B 4
1 B 3
2 A 9
2 A 8
2 B 7
2 B 6
I want obtain this:
Type MeanRating
A mean(6,9)
A mean(5,8)
B mean(4,7)
B mean(3,6)
So, for each type, I want the mean of the higher value in each group, then the mean of the second higher value in each group, etc.
I can't think up a proper way to do this with python pandas, since the means seem to apply always within groups, but not across them.
First use groupby.rank to create a column that allows you to align the highest values, second highest values, etc. Then perform another groupby using the newly created column to compute the means:
# Get the grouping column.
df['Grouper'] = df.groupby(['Type', 'Participant']).rank(method='first', ascending=False)
# Perform the groupby and format the result.
result = df.groupby(['Type', 'Grouper'])['Rating'].mean().rename('MeanRating')
result = result.reset_index(level=1, drop=True).reset_index()
The resulting output:
Type MeanRating
0 A 7.5
1 A 6.5
2 B 5.5
3 B 4.5
I used the method='first' parameter of groupby.rank to handle the case of duplicate ratings within a ['Type', 'Participant'] group. You can omit it if this is not a possibility within your dataset, but it won't change the output if you leave it and there are no duplicates.
I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
A B C D
0 2 0 11 0.053095
1 2 0 11 0.059815
2 0 35 11 0.055268
3 0 35 11 0.054573
4 0 1 11 0.054081
5 0 2 11 0.054426
6 0 1 11 0.054426
7 0 1 11 0.054426
8 42 7 3 0.048208
9 42 7 3 0.050765
10 42 7 3 0.05325
....
The problem is, the data is naturally "clustered" into groups, but this data is not given. From the above, rows 0-1 are one group, rows 2-3 are a group, rows 4-7 are a group, and 8-10 are a group.
I need to impute this information. One could use machine learning; however, is it possible to do this only using pandas?
Can users groupby the values of the columns to create these groups? The problem is the values are not exact. For the third group, column B has group 1, 2, 1, 1.
A pure pandas solution would involve binning, assuming that your values are close to each other and your bin size is large enough for cluster variation but smaller than distance between cluster values. That answer depends on your data.
The binning approach uses the cut function in pandas. You provide a series (or array) and the number of bins you want to the function. The function evenly subdivides the range of your series into the given number of bins and determines where each value in the input falls. The output for the below set of columns will be which bin the value fell in and will be what you can group by, following your original train of thought.
The way this would come out in practice for bins of size ~5 is
for col in df.columns:
binned_name = col + '_binned'
num_bins = np.ceil(df[col].max()/5)
df[binned_name] = pd.cut(df[col],num_bins,labels=False)