I have a dataframe that looks like:
value 1 value 2
1 10
4 1
5 8
6 10
10 12
I want to go down each entry of value 1, average the previous value, and then create a new column beside value 2 with the average.
The output needs to look like:
value 1 value 2 avg
1 10 nan
4 1 2.5
5 8 4.5
6 10 5.5
10 12 8.0
.
.
How would I go about doing this?
shift
You can sum a series with the shifted version of itself:
df['avg'] = (df['value1'] + df['value1'].shift()) / 2
print(df)
value1 value2 avg
0 1 10 NaN
1 4 1 2.5
2 5 8 4.5
3 6 10 5.5
4 10 12 8.0
Related
I would like to create an additional column in my data-frame without having to loop through the steps
This is created in the following steps.
1.Start from end of the data.For each date resample every nth row
(in this case its 5th) from the end.
2.Take the rolling sum of x numbers from 1 (x=2)
a worked example for
11/22:5,7,3,2 (every 5th row being picked) but x=2 so 5+7=12
11/15:6,5,2 (every 5th row being picked) but x=2 so 6+5=11
cumulative
8/30/2019 2
9/6/2019 4
9/13/2019 1
9/20/2019 2
9/27/2019 3 5
10/4/2019 3 7
10/11/2019 5 6
10/18/2019 5 7
10/25/2019 7 10
11/1/2019 4 7
11/8/2019 9 14
11/15/2019 6 11
11/22/2019 5 12
Let's assume we have a set of 15 integers:
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], columns=['original_data'])
We define which nth row should be added n and how many times x we add the nth row
n = 5
x = 2
(
df
# Add `x` columsn which are all shifted `n` rows
.assign(**{
'n{} x{}'.format(n, x): df['original_data'].shift(n*x)
for x in range(1, reps)})
# take the rowwise sum
.sum(axis=1)
)
Output:
original_data n5 x1
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 1.0
6 7 2.0
7 8 3.0
8 9 4.0
9 10 5.0
10 11 6.0
11 12 7.0
12 13 8.0
13 14 9.0
14 15 10.0
following is my input data frame
>>data frame after getting avg
a b c d avg
0 1 4 7 8 5
1 3 4 5 6 4.5
2 6 8 2 9 6.25
3 2 9 5 6 5.5
Output required after adding criteria
>>
a b c d avg avg_criteria
0 1 4 7 8 5 7.5 (<=5)
1 3 4 5 6 4.5 5.5 (<=4.5)
2 6 8 2 9 6.25 8.5 (<=6.25)
3 2 9 5 6 5.5 7.5 (<=5.5)
> This is the code I have tried
read file
df_input_data = pd.DataFrame(pd.read_excel(file_path,header=2).dropna(axis=1, how= 'all'))
adding column after calculating average
df_avg = df_input_data.assign(Avg=df_input_data.mean(axis=1, skipna=True))
criteria
criteria = df_input_data.iloc[, :] >= df_avg.iloc[1][-1]
#creating output data frame
df_output = df_input_data.assign(Avg_criteria= criteria)
I am unable to solve this issue. I have tried and googled it many times
From what I understand, you can try df.mask/df.where after comparing with the mean and then calculate mean:
m=df.drop("avg",1)
m.where(m.ge(df['avg'],axis=0)).mean(1)
0 7.5
1 5.5
2 8.5
3 7.5
dtype: float64
print(df.assign(Avg_criteria=m.where(m.ge(df['avg'],axis=0)).mean(1)))
a b c d avg Avg_criteria
0 1 4 7 8 5.00 7.5
1 3 4 5 6 4.50 5.5
2 6 8 2 9 6.25 8.5
3 2 9 5 6 5.50 7.5
I have a df like below
userId movieId rating
0 1 31 2.0
1 2 10 4.0
2 2 17 5.0
3 2 39 5.0
4 2 47 4.0
5 3 31 3.0
6 3 10 2.0
I need to add two column, one is mean for each movie, the other is diff which is the difference between rating and mean.
Please note that movieId can be repeated because different users may rate the same movie. Here row 0 and 5 is for movieId 31, row 1 and 6 is for movieId 10
userId movieId rating mean diff
0 1 31 2.0 2.5 -0.5
1 2 10 4.0 3 1
2 2 17 5.0 5 0
3 2 39 5.0 5 0
4 2 47 4.0 4 0
5 3 31 3.0 2.5 0.5
6 3 10 2.0 3 -1
here is some of my code which calculates the mean
df = df.groupby('movieId')['rating'].agg(['count','mean']).reset_index()
You can use transform to keep the same number of rows when calculating mean with groupby. Calculating the difference is straightforward from that:
df['mean'] = df.groupby('movieId')['rating'].transform('mean')
df['diff'] = df['rating'] - df['mean']
My data is like this:
ARTICLE Day Row
a 2 10
a 3 10
a 4 10
a 5 10
a 6 10
a 7 10
a 8 10
a 9 10
a 10 10
a 11 10
b 3 1
I want to generate a new column, called Date. Firstly, I group the data by ARTICLE. Then for every article group, if Row is 1, then the value in Date is the same as the one in Day. Otherwise, move all the values in Day one step upward and set the last value be 100. So, the new data should look like this:
ARTICLE Day Row Date
a 2 10 3
a 3 10 4
a 4 10 5
a 5 10 6
a 6 10 7
a 7 10 8
a 8 10 9
a 9 10 10
a 10 10 11
a 11 10 100
b 3 1 3
I assume this can be done by groupby and transform. A function is taken to generate Date. So, my code is:
def myFUN_PostDate1(NRow,Date):
if (NRow.unique()==1):
return Date
else:
Date1 = Date[1:Date.shape[0]]
Date1[Date1.shape[0] + 1] = 19800312
return Date1
a = pd.DataFrame({'ARTICLE': ['a','a','a','a','a','a','a','a','a','a','b'],
'Day': [2,3,4,5,6,7,8,9,10,11,3],
'Row':[10,10,10,10,10,10,10,10,10,10,1]})
a.loc[:,'Date'] = a.groupby(['ARTICLE']).transform(lambda x: myFUN_PostDate1(x.loc[:,'Row'],x.loc[:,'Day']))
But I have the error information:
pandas.core.indexing.IndexingError: ('Too many indexers', 'occurred at index Day')
I also tried groupby + np.where. But I have got the same error.
IIUC:
In [14]: df['Date'] = (df.groupby('ARTICLE')['Day']
.apply(lambda x: x.shift(-1).fillna(100) if len(x) > 1 else x))
In [15]: df
Out[15]:
ARTICLE Day Row Date
0 a 2 10 3.0
1 a 3 10 4.0
2 a 4 10 5.0
3 a 5 10 6.0
4 a 6 10 7.0
5 a 7 10 8.0
6 a 8 10 9.0
7 a 9 10 10.0
8 a 10 10 11.0
9 a 11 10 100.0
10 b 3 1 3.0
I have a dataframe of the following type
df = pd.DataFrame({'Days':[1,2,5,6,7,10,11,12],
'Value':[100.3,150.5,237.0,314.15,188.0,413.0,158.2,268.0]})
Days Value
0 1 100.3
1 2 150.5
2 5 237.0
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
and I would like to add a column '+5Ratio' whose date is the ratio betwen Value corresponding to the Days+5 and Days.
For example in first row I would have 3.13210368893 = 314.15/100.3, in the second I would have 1.24916943522 = 188.0/150.5 and so on.
Days Value +5Ratio
0 1 100.3 3.13210368893
1 2 150.5 1.24916943522
2 5 237.0 ...
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
I'm strugling to find a way to do it using lambda function.
Could someone give a help to find a way to solve this problem?
Thanks in advance.
Edit
In the case I am interested in the "Days" field can vary sparsly from 1 to 18180 for instance.
You can using merge , and the benefit from doing this , can handle missing value
s=df.merge(df.assign(Days=df.Days-5),on='Days')
s.assign(Value=s.Value_y/s.Value_x).drop(['Value_x','Value_y'],axis=1)
Out[359]:
Days Value
0 1 3.132104
1 2 1.249169
2 5 1.742616
3 6 0.503581
4 7 1.425532
Consider left merging on a helper dataframe, days, for consecutive daily points and then shift by 5 rows for ratio calculation. Finally remove the blank day rows:
days_df = pd.DataFrame({'Days':range(min(df.Days), max(df.Days)+1)})
days_df = days_df.merge(df, on='Days', how='left')
print(days_df)
# Days Value
# 0 1 100.30
# 1 2 150.50
# 2 3 NaN
# 3 4 NaN
# 4 5 237.00
# 5 6 314.15
# 6 7 188.00
# 7 8 NaN
# 8 9 NaN
# 9 10 413.00
# 10 11 158.20
# 11 12 268.00
days_df['+5ratio'] = days_df.shift(-5)['Value'] / days_df['Value']
final_df = days_df[days_df['Value'].notnull()].reset_index(drop=True)
print(final_df)
# Days Value +5ratio
# 0 1 100.30 3.132104
# 1 2 150.50 1.249169
# 2 5 237.00 1.742616
# 3 6 314.15 0.503581
# 4 7 188.00 1.425532
# 5 10 413.00 NaN
# 6 11 158.20 NaN
# 7 12 268.00 NaN