My pandas array looks like this...
DOY Value
0 5 5118
1 10 5098
2 15 5153
I've been trying to resample my data and fill in the gaps using pandas resample function. My worry is that since I'm trying to resample without using direct datetime values, I won't be able to resample my data.
My attempt to solve this was using the following line of code but got an error saying I was using Range Index. Perhaps I need to use Period Index somehow, but I'm not sure how to go about it.
inter.resample('1D').mean().interpolate()
Here's my intended result
DOY Value
0 5 5118
1 6 5114
2 7 5110
3 8 5106
4 9 5102
5 10 5098
: : :
10 15 5153
Convert to_datetime, perform the resample and then drop the unwanted column:
df["date"] = pd.to_datetime(df["DOY"].astype(str),format="%j")
output = df.resample("D", on="date").last().drop("date", axis=1).interpolate().reset_index(drop=True)
>>> output
DOY Value
0 5.0 5118.0
1 6.0 5114.0
2 7.0 5110.0
3 8.0 5106.0
4 9.0 5102.0
5 10.0 5098.0
6 11.0 5109.0
7 12.0 5120.0
8 13.0 5131.0
9 14.0 5142.0
10 15.0 5153.0
pd.DataFrame.interpolate works on the index. So let's start with setting an appropriate index and then a new one over which we will interpolate.
d0 = df.set_index('DOY')
idx = pd.RangeIndex(d0.index.min(), d0.index.max()+1, name='DOY')
d0.reindex(idx).interpolate().reset_index()
DOY Value
0 5 5118.0
1 6 5114.0
2 7 5110.0
3 8 5106.0
4 9 5102.0
5 10 5098.0
6 11 5109.0
7 12 5120.0
8 13 5131.0
9 14 5142.0
10 15 5153.0
Related
I have a pd.DataFrame df with one column, say:
A = [1,2,3,4,5,6,7,8,2,4]
df = pd.DataFrame(A,columns = ['A'])
For each row, I want to take previous 2 values, current value and next 2 value (a window= 5) and get the sum and store it in new column. Desire output,
A A_sum
1 6
2 10
3 15
4 20
5 25
6 30
7 28
8 27
2 21
4 14
I have tried,
df['A_sum'] = df['A'].rolling(2).sum()
Tried with shift, but all doing either forward or backward, I'm looking for a combination of both.
Use rolling by 5, add parameter center=True and min_periods=1 to Series.rolling:
df['A_sum'] = df['A'].rolling(5, center=True, min_periods=1).sum()
print (df)
A A_sum
0 1 6.0
1 2 10.0
2 3 15.0
3 4 20.0
4 5 25.0
5 6 30.0
6 7 28.0
7 8 27.0
8 2 21.0
9 4 14.0
If you are allowed to use numpy, then you might use numpy.convolve to get desired output
import numpy as np
import pandas as pd
A = [1,2,3,4,5,6,7,8,2,4]
B = np.convolve(A,[1,1,1,1,1], 'same')
df = pd.DataFrame({"A":A,"A_sum":B})
print(df)
output
A A_sum
0 1 6
1 2 10
2 3 15
3 4 20
4 5 25
5 6 30
6 7 28
7 8 27
8 2 21
9 4 14
You can use shift for this (straightforward if not elegant):
df["A_sum"] = df.A + df.A.shift(-2).fillna(0) + df.A.shift(-1).fillna(0) + df.A.shift(1).fillna(0)
output:
A A_sum
0 1 6.0
1 2 10.0
2 3 14.0
3 4 18.0
4 5 22.0
5 6 26.0
6 7 23.0
7 8 21.0
8 2 14.0
9 4 6.0
I would like to create an additional column in my data-frame without having to loop through the steps
This is created in the following steps.
1.Start from end of the data.For each date resample every nth row
(in this case its 5th) from the end.
2.Take the rolling sum of x numbers from 1 (x=2)
a worked example for
11/22:5,7,3,2 (every 5th row being picked) but x=2 so 5+7=12
11/15:6,5,2 (every 5th row being picked) but x=2 so 6+5=11
cumulative
8/30/2019 2
9/6/2019 4
9/13/2019 1
9/20/2019 2
9/27/2019 3 5
10/4/2019 3 7
10/11/2019 5 6
10/18/2019 5 7
10/25/2019 7 10
11/1/2019 4 7
11/8/2019 9 14
11/15/2019 6 11
11/22/2019 5 12
Let's assume we have a set of 15 integers:
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], columns=['original_data'])
We define which nth row should be added n and how many times x we add the nth row
n = 5
x = 2
(
df
# Add `x` columsn which are all shifted `n` rows
.assign(**{
'n{} x{}'.format(n, x): df['original_data'].shift(n*x)
for x in range(1, reps)})
# take the rowwise sum
.sum(axis=1)
)
Output:
original_data n5 x1
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 1.0
6 7 2.0
7 8 3.0
8 9 4.0
9 10 5.0
10 11 6.0
11 12 7.0
12 13 8.0
13 14 9.0
14 15 10.0
I am trying to create a new column that will list down the last recorded peak values, until the next peak comes along. For example, suppose this is my existing DataFrame:
index values
0 10
1 20
2 15
3 17
4 15
5 22
6 20
I want to get something like this:
index values last_recorded_peak
0 10 10
1 20 20
2 15 20
3 17 17
4 15 17
5 22 22
6 20 22
So far, I have tried with np.maximum.accumulate, which 'accumulates' the max value but not quite the "peaks" (some peaks might be lower than the max value).
I have also tried with scipy.signal.find_peaks which returns an array of indexes where my peaks are (in the example, index 1, 3, 5), which is not what I'm looking for.
I'm relatively new to coding, any pointer is very much appreciated!
You're on the right track, scipy.signal.find_peaks is the way I would go, you just need to work a little bit from the result:
from scipy import signal
peaks = signal.find_peaks(df['values'])[0]
df['last_recorded_peak'] = (df.assign(last_recorded_peak=float('nan'))
.last_recorded_peak
.combine_first(df.loc[peaks,'values'])
.ffill()
.combine_first(df['values']))
print(df)
index values last_recorded_peak
0 0 10 10.0
1 1 20 20.0
2 2 15 20.0
3 3 17 17.0
4 4 15 17.0
5 5 22 22.0
6 6 20 22.0
If I understand your correcly, your are looking for rolling max:
note: you might have to play around with the window size which I set on 2 for your example dataframe
df['last_recorded_peak'] = df['values'].rolling(2).max().fillna(df['values'])
Output
values last_recorded_peak
0 10 10.0
1 20 20.0
2 15 20.0
3 17 17.0
4 15 17.0
5 22 22.0
6 20 22.0
I have a dataframe of the following type
df = pd.DataFrame({'Days':[1,2,5,6,7,10,11,12],
'Value':[100.3,150.5,237.0,314.15,188.0,413.0,158.2,268.0]})
Days Value
0 1 100.3
1 2 150.5
2 5 237.0
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
and I would like to add a column '+5Ratio' whose date is the ratio betwen Value corresponding to the Days+5 and Days.
For example in first row I would have 3.13210368893 = 314.15/100.3, in the second I would have 1.24916943522 = 188.0/150.5 and so on.
Days Value +5Ratio
0 1 100.3 3.13210368893
1 2 150.5 1.24916943522
2 5 237.0 ...
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
I'm strugling to find a way to do it using lambda function.
Could someone give a help to find a way to solve this problem?
Thanks in advance.
Edit
In the case I am interested in the "Days" field can vary sparsly from 1 to 18180 for instance.
You can using merge , and the benefit from doing this , can handle missing value
s=df.merge(df.assign(Days=df.Days-5),on='Days')
s.assign(Value=s.Value_y/s.Value_x).drop(['Value_x','Value_y'],axis=1)
Out[359]:
Days Value
0 1 3.132104
1 2 1.249169
2 5 1.742616
3 6 0.503581
4 7 1.425532
Consider left merging on a helper dataframe, days, for consecutive daily points and then shift by 5 rows for ratio calculation. Finally remove the blank day rows:
days_df = pd.DataFrame({'Days':range(min(df.Days), max(df.Days)+1)})
days_df = days_df.merge(df, on='Days', how='left')
print(days_df)
# Days Value
# 0 1 100.30
# 1 2 150.50
# 2 3 NaN
# 3 4 NaN
# 4 5 237.00
# 5 6 314.15
# 6 7 188.00
# 7 8 NaN
# 8 9 NaN
# 9 10 413.00
# 10 11 158.20
# 11 12 268.00
days_df['+5ratio'] = days_df.shift(-5)['Value'] / days_df['Value']
final_df = days_df[days_df['Value'].notnull()].reset_index(drop=True)
print(final_df)
# Days Value +5ratio
# 0 1 100.30 3.132104
# 1 2 150.50 1.249169
# 2 5 237.00 1.742616
# 3 6 314.15 0.503581
# 4 7 188.00 1.425532
# 5 10 413.00 NaN
# 6 11 158.20 NaN
# 7 12 268.00 NaN
I have a dataframe with the quarterly U.S. GDP as column values. I would like to look at the values, 3 at a time, and find the index where the GDP fell for the next two consecutive quarters. This means I need to compare individual elements within df['GDP'] with each other, in groups of 3.
Here's an example dataframe.
df = pd.DataFrame(data=np.random.randint(0,10,10), columns=['GDP'])
df
GDP
0 4
1 4
2 4
3 1
4 4
5 4
6 8
7 2
8 3
9 9
I'm using df.rolling().apply(find_recession), but I don't know how I can access individual elements of the rolling window within my find_recession() function.
gdp['Recession_rolling'] = gdp['GDP'].rolling(window=3).apply(find_recession_start)
How can I access individual elements within the rolling window, so I can make a comparison such as gdp_val_2 < gdp_val_1 < gdp_val?
The .rolling().apply() will go through the entire dataframe, 3 values at a time, so let's take a look at one particular window, which starts at index location 6:
GDP
6 8 # <- gdp_val
7 2 # <- gdp_val_1
8 3 # <- gdp_val_2
How can I access gdp_val, gdp_val_1, and gdp_val_2 within the current window?
Using a lambda expression within .apply() will pass an array into the custom function (find_recession_start), and so I can just access the elements as I would any list/array e.g. arr[0], arr[1], arr[2]
df = pd.DataFrame(data=np.random.randint(0,10,10), columns=['GDP'])
def my_func(arr):
if((arr[2] < arr[1]) & (arr[1] < arr[0])):
return 1
else:
return 0
df['Result'] = df.rolling(window=3).apply(lambda x: my_func(x))
df
GDP Result
0 8 NaN
1 0 NaN
2 8 0.0
3 1 0.0
4 9 0.0
5 7 0.0
6 9 0.0
7 8 0.0
8 3 1.0
9 9 0.0
The short answer is: you can't, but you can use your knowledge about the structure of the dataframe/series.
You know the size of the window, you know the current index - therefore, you can output the shift relative to the current index:
Let's pretend, here is your gdp:
In [627]: gdp
Out[627]:
0 8
1 0
2 0
3 4
4 0
5 3
6 6
7 2
8 5
9 5
dtype: int64
The naive approach is just to return the (argmin() - 2) and add it to the current index:
In [630]: gdp.rolling(window=3).apply(lambda win: win.argmin() - 2) + gdp.index
Out[630]:
0 NaN
1 NaN
2 1.0
3 1.0
4 2.0
5 4.0
6 4.0
7 7.0
8 7.0
9 7.0
dtype: float64
The naive approach won't return the correct result, since you can't predict which index it would return when there are equal values, and when there is a rise in the middle. But you understand the idea.