I would like to create an additional column in my data-frame without having to loop through the steps
This is created in the following steps.
1.Start from end of the data.For each date resample every nth row
(in this case its 5th) from the end.
2.Take the rolling sum of x numbers from 1 (x=2)
a worked example for
11/22:5,7,3,2 (every 5th row being picked) but x=2 so 5+7=12
11/15:6,5,2 (every 5th row being picked) but x=2 so 6+5=11
cumulative
8/30/2019 2
9/6/2019 4
9/13/2019 1
9/20/2019 2
9/27/2019 3 5
10/4/2019 3 7
10/11/2019 5 6
10/18/2019 5 7
10/25/2019 7 10
11/1/2019 4 7
11/8/2019 9 14
11/15/2019 6 11
11/22/2019 5 12
Let's assume we have a set of 15 integers:
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], columns=['original_data'])
We define which nth row should be added n and how many times x we add the nth row
n = 5
x = 2
(
df
# Add `x` columsn which are all shifted `n` rows
.assign(**{
'n{} x{}'.format(n, x): df['original_data'].shift(n*x)
for x in range(1, reps)})
# take the rowwise sum
.sum(axis=1)
)
Output:
original_data n5 x1
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 1.0
6 7 2.0
7 8 3.0
8 9 4.0
9 10 5.0
10 11 6.0
11 12 7.0
12 13 8.0
13 14 9.0
14 15 10.0
Related
I have a df that looks like this:
period value
1 2
2 3
3 4
4 6
5 8
6 10
7 11
I need a way to calculate the values for period 8, 9, 10 by calculating the mean of the 3 previous periods. Eg. P8 = mean(8,10,11) = 9.6, p9 = mean(10,11,9.6) = 10.2, p10 = mean(11,9.6,10.2) = 10.3
Resulting in the following DF:
period value
1 2
2 3
3 4
4 6
5 8
6 10
7 11
8 9.6
9 10.2
10 10.3
Iterate the required new sequence of periods, and keep on assigning the values i.e. period and mean of the prvious 3 values for each period using DataFrame.loc,
newPeriods = (8,9,10)
for p in newPeriods:
rowCount = df.shape[0]
df.loc[rowCount] = [p, df.loc[rowCount-3:rowCount, 'value'].mean()]
OUTPUT:
period value
0 1.0 2.000000
1 2.0 3.000000
2 3.0 4.000000
3 4.0 6.000000
4 5.0 8.000000
5 6.0 10.000000
6 7.0 11.000000
7 8.0 9.666667
8 9.0 10.222222
9 10.0 10.296296
You can set the period as index first then run a for loop to calculate needed values and then set to the frame with loc. After the loop, we restore the period to be a column again. To keep track of last 3 values, we can use deque:
from collections import deque
# keep `period` aside
df = df.set_index("period")
# this will always store the last 3 values
last_three = deque(df.value.tail(3), maxlen=3)
# for 3 times, do..
for _ in range(3):
# get the mean
mean = np.mean(last_three)
# the new index to put is current last index + 1
df.loc[df.index[-1] + 1, "value"] = mean
# update the deque
last_three.append(mean)
# restore `period` to columns
df = df.reset_index()
to get
>>> df
period value
0 1 2.000000
1 2 3.000000
2 3 4.000000
3 4 6.000000
4 5 8.000000
5 6 10.000000
6 7 11.000000
7 8 9.666667
8 9 10.222222
9 10 10.296296
Let's say you have k as your original dataset
period=[1,2,3,4,5,6,7]
value=[2,3,4,6,8,10,11]
k=pd.DataFrame([period,value]).T
k.columns=['period','value']
k=pd.concat([k,pd.DataFrame([[i,None] for i in range(8,11)],columns=['period','value'])])
for i in range(8,11):
k.iloc[i-1,1]=np.mean(np.array([k.iloc[i-2,1],k.iloc[i-3,1],k.iloc[i-4,1]]))
I have a dataframe that looks like:
value 1 value 2
1 10
4 1
5 8
6 10
10 12
I want to go down each entry of value 1, average the previous value, and then create a new column beside value 2 with the average.
The output needs to look like:
value 1 value 2 avg
1 10 nan
4 1 2.5
5 8 4.5
6 10 5.5
10 12 8.0
.
.
How would I go about doing this?
shift
You can sum a series with the shifted version of itself:
df['avg'] = (df['value1'] + df['value1'].shift()) / 2
print(df)
value1 value2 avg
0 1 10 NaN
1 4 1 2.5
2 5 8 4.5
3 6 10 5.5
4 10 12 8.0
My data is like this:
ARTICLE Day Row
a 2 10
a 3 10
a 4 10
a 5 10
a 6 10
a 7 10
a 8 10
a 9 10
a 10 10
a 11 10
b 3 1
I want to generate a new column, called Date. Firstly, I group the data by ARTICLE. Then for every article group, if Row is 1, then the value in Date is the same as the one in Day. Otherwise, move all the values in Day one step upward and set the last value be 100. So, the new data should look like this:
ARTICLE Day Row Date
a 2 10 3
a 3 10 4
a 4 10 5
a 5 10 6
a 6 10 7
a 7 10 8
a 8 10 9
a 9 10 10
a 10 10 11
a 11 10 100
b 3 1 3
I assume this can be done by groupby and transform. A function is taken to generate Date. So, my code is:
def myFUN_PostDate1(NRow,Date):
if (NRow.unique()==1):
return Date
else:
Date1 = Date[1:Date.shape[0]]
Date1[Date1.shape[0] + 1] = 19800312
return Date1
a = pd.DataFrame({'ARTICLE': ['a','a','a','a','a','a','a','a','a','a','b'],
'Day': [2,3,4,5,6,7,8,9,10,11,3],
'Row':[10,10,10,10,10,10,10,10,10,10,1]})
a.loc[:,'Date'] = a.groupby(['ARTICLE']).transform(lambda x: myFUN_PostDate1(x.loc[:,'Row'],x.loc[:,'Day']))
But I have the error information:
pandas.core.indexing.IndexingError: ('Too many indexers', 'occurred at index Day')
I also tried groupby + np.where. But I have got the same error.
IIUC:
In [14]: df['Date'] = (df.groupby('ARTICLE')['Day']
.apply(lambda x: x.shift(-1).fillna(100) if len(x) > 1 else x))
In [15]: df
Out[15]:
ARTICLE Day Row Date
0 a 2 10 3.0
1 a 3 10 4.0
2 a 4 10 5.0
3 a 5 10 6.0
4 a 6 10 7.0
5 a 7 10 8.0
6 a 8 10 9.0
7 a 9 10 10.0
8 a 10 10 11.0
9 a 11 10 100.0
10 b 3 1 3.0
I have a dataframe of the following type
df = pd.DataFrame({'Days':[1,2,5,6,7,10,11,12],
'Value':[100.3,150.5,237.0,314.15,188.0,413.0,158.2,268.0]})
Days Value
0 1 100.3
1 2 150.5
2 5 237.0
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
and I would like to add a column '+5Ratio' whose date is the ratio betwen Value corresponding to the Days+5 and Days.
For example in first row I would have 3.13210368893 = 314.15/100.3, in the second I would have 1.24916943522 = 188.0/150.5 and so on.
Days Value +5Ratio
0 1 100.3 3.13210368893
1 2 150.5 1.24916943522
2 5 237.0 ...
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
I'm strugling to find a way to do it using lambda function.
Could someone give a help to find a way to solve this problem?
Thanks in advance.
Edit
In the case I am interested in the "Days" field can vary sparsly from 1 to 18180 for instance.
You can using merge , and the benefit from doing this , can handle missing value
s=df.merge(df.assign(Days=df.Days-5),on='Days')
s.assign(Value=s.Value_y/s.Value_x).drop(['Value_x','Value_y'],axis=1)
Out[359]:
Days Value
0 1 3.132104
1 2 1.249169
2 5 1.742616
3 6 0.503581
4 7 1.425532
Consider left merging on a helper dataframe, days, for consecutive daily points and then shift by 5 rows for ratio calculation. Finally remove the blank day rows:
days_df = pd.DataFrame({'Days':range(min(df.Days), max(df.Days)+1)})
days_df = days_df.merge(df, on='Days', how='left')
print(days_df)
# Days Value
# 0 1 100.30
# 1 2 150.50
# 2 3 NaN
# 3 4 NaN
# 4 5 237.00
# 5 6 314.15
# 6 7 188.00
# 7 8 NaN
# 8 9 NaN
# 9 10 413.00
# 10 11 158.20
# 11 12 268.00
days_df['+5ratio'] = days_df.shift(-5)['Value'] / days_df['Value']
final_df = days_df[days_df['Value'].notnull()].reset_index(drop=True)
print(final_df)
# Days Value +5ratio
# 0 1 100.30 3.132104
# 1 2 150.50 1.249169
# 2 5 237.00 1.742616
# 3 6 314.15 0.503581
# 4 7 188.00 1.425532
# 5 10 413.00 NaN
# 6 11 158.20 NaN
# 7 12 268.00 NaN
import pandas as pd
df = pd.DataFrame({'A':[3,5,3,4,2,3,2,3,4,3,2,2,2,3],
'B':[10,20,30,40,20,30,40,10,20,30,15,60,20,15]})
A B
0 3 10
1 5 20
2 3 30
3 4 40
4 2 20
5 3 30
6 2 40
7 3 10
8 4 20
9 3 30
10 2 15
11 2 60
12 2 20
13 3 15
I'd like to append a C column, containing rolling average of B (rolling period = A).
For example, the C value at row index(2) should be df.B.rolling(3).mean() = mean(10,20,30), and the C value at row index(4) should be df.B.rolling(2).mean() = mean(40,20).
probably stupid slow... but this get's it done
def crazy_apply(row):
p = df.index.get_loc(row.name)
a = row.A
return df.B.iloc[p-a+1:p+1].mean()
df.apply(crazy_apply, 1)
0 NaN
1 NaN
2 20.000000
3 25.000000
4 30.000000
5 30.000000
6 35.000000
7 26.666667
8 25.000000
9 20.000000
10 22.500000
11 37.500000
12 40.000000
13 31.666667
dtype: float64
explanation
apply iterates through each column or each row. We iterate through each row because we used the parameter axis=1 (see 1 as the second argument in the call to apply). So every iteration of apply passes the a pandas series object that represents the current row. the current index value is in the name attribute of the row. The index of the row object is the same as the columns of df.
So, df.index.get_loc(row.name) finds the ordinal position of the current index value held in row.name. row.A is the column A for that row.