I have a dataframe:
df = pd.DataFrame( {
'epoch' : [1, 4, 7, 8, 9, 11, 12, 15, 16, 17],
'price' : [1, 2, 3, 3, 1, 4, 2, 3, 4, 4]
} )
epoch price
0 1 1
1 4 2
2 7 3
3 8 3
4 9 1
5 11 4
6 12 2
7 15 3
8 16 4
9 17 4
I have to create a new column that should be calculated in the following way:
For each row
Find current row's epoch (let's say e_cur)
Calculate e_cur-3 = e_cur – 3 (three is a constant here but it will be variable)
Calculate price max value where epoch >= e-3_cur and epoch <= e_cur
In other words, find maximum price in rows that are three epoch away from current row's epoch.
For example:
Index=0, e_cur = epoch = 1, e_cur-3 = 1 -3 = -2, there is only one (first) row whose epoch is between -2 and 1 so the price from the first row is maximum price
Index =6, e_cur = epoch = 12, e_cur-3 = 12 – 3 = 9, there are three rows whose epoch is between 9 and 12, but row with index=5 has the maximum price = 4.
Here are the results for every row that I calculated manually:
epoch price max_price_where_epoch_is_between_e_cur-3_and_e_cur
0 1 1 1
1 4 2 2
2 7 3 3
3 8 3 3
4 9 1 3
5 11 4 4
6 12 2 4
7 15 3 3
8 16 4 4
9 17 4 4
As you can see, epoch something goes one by one, but sometimes there are "holes".
How to calculate that with pandas?
Using rolling window:
In [161]: df['between'] = df.epoch.map(df.set_index('epoch')
...: .reindex(np.arange(df.epoch.min(), df.epoch.max()+1))
...: .rolling(3, min_periods=1)
...: .max()['price'])
...:
In [162]: df
Out[162]:
epoch price between
0 1 1 1.0
1 4 2 2.0
2 7 3 3.0
3 8 3 3.0
4 9 1 3.0
5 11 4 4.0
6 12 2 4.0
7 15 3 3.0
8 16 4 4.0
9 17 4 4.0
Explanation:
Helper DF:
In [165]: df.set_index('epoch').reindex(np.arange(df.epoch.min(), df.epoch.max()+1))
Out[165]:
price
epoch
1 1.0
2 NaN
3 NaN
4 2.0
5 NaN
6 NaN
7 3.0
8 3.0
9 1.0
10 NaN
11 4.0
12 2.0
13 NaN
14 NaN
15 3.0
16 4.0
17 4.0
In [166]: df.set_index('epoch').reindex(np.arange(df.epoch.min(), df.epoch.max()+1)).rolling(3, min_periods=1).max()
Out[166]:
price
epoch
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
7 3.0
8 3.0
9 3.0
10 3.0
11 4.0
12 4.0
13 4.0
14 2.0
15 3.0
16 4.0
17 4.0
Consider applying function on epoch column where you can locate the required rows and calculate their price max value
>> df['between'] = df['epoch'].apply(lambda e: df.loc[
>> (df['epoch'] >= e - 3) & (df['epoch'] <= e), 'price'].max())
>> df
epoch price between
0 1 1 1
1 4 2 2
2 7 3 3
3 8 3 3
4 9 1 3
5 11 4 4
6 12 2 4
7 15 3 3
8 16 4 4
9 17 4 4
I have tried both solutions, from tarashypka and MaxU.
The first solution I have tried was Tarashypka's. I tested it on 100k rows. It took about one minute.
Than I tried MaxU's solution, that has finished in about 4 seconds.
I prefer MaxU's solution because of the speed, but with Tarashypka's solution I also learned how to use lambda function with DataFrame.
Thank you very very much to all of you.
Best regards and wishes.
Related
I have a pd.DataFrame df with one column, say:
A = [1,2,3,4,5,6,7,8,2,4]
df = pd.DataFrame(A,columns = ['A'])
For each row, I want to take previous 2 values, current value and next 2 value (a window= 5) and get the sum and store it in new column. Desire output,
A A_sum
1 6
2 10
3 15
4 20
5 25
6 30
7 28
8 27
2 21
4 14
I have tried,
df['A_sum'] = df['A'].rolling(2).sum()
Tried with shift, but all doing either forward or backward, I'm looking for a combination of both.
Use rolling by 5, add parameter center=True and min_periods=1 to Series.rolling:
df['A_sum'] = df['A'].rolling(5, center=True, min_periods=1).sum()
print (df)
A A_sum
0 1 6.0
1 2 10.0
2 3 15.0
3 4 20.0
4 5 25.0
5 6 30.0
6 7 28.0
7 8 27.0
8 2 21.0
9 4 14.0
If you are allowed to use numpy, then you might use numpy.convolve to get desired output
import numpy as np
import pandas as pd
A = [1,2,3,4,5,6,7,8,2,4]
B = np.convolve(A,[1,1,1,1,1], 'same')
df = pd.DataFrame({"A":A,"A_sum":B})
print(df)
output
A A_sum
0 1 6
1 2 10
2 3 15
3 4 20
4 5 25
5 6 30
6 7 28
7 8 27
8 2 21
9 4 14
You can use shift for this (straightforward if not elegant):
df["A_sum"] = df.A + df.A.shift(-2).fillna(0) + df.A.shift(-1).fillna(0) + df.A.shift(1).fillna(0)
output:
A A_sum
0 1 6.0
1 2 10.0
2 3 14.0
3 4 18.0
4 5 22.0
5 6 26.0
6 7 23.0
7 8 21.0
8 2 14.0
9 4 6.0
I would like to create an additional column in my data-frame without having to loop through the steps
This is created in the following steps.
1.Start from end of the data.For each date resample every nth row
(in this case its 5th) from the end.
2.Take the rolling sum of x numbers from 1 (x=2)
a worked example for
11/22:5,7,3,2 (every 5th row being picked) but x=2 so 5+7=12
11/15:6,5,2 (every 5th row being picked) but x=2 so 6+5=11
cumulative
8/30/2019 2
9/6/2019 4
9/13/2019 1
9/20/2019 2
9/27/2019 3 5
10/4/2019 3 7
10/11/2019 5 6
10/18/2019 5 7
10/25/2019 7 10
11/1/2019 4 7
11/8/2019 9 14
11/15/2019 6 11
11/22/2019 5 12
Let's assume we have a set of 15 integers:
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], columns=['original_data'])
We define which nth row should be added n and how many times x we add the nth row
n = 5
x = 2
(
df
# Add `x` columsn which are all shifted `n` rows
.assign(**{
'n{} x{}'.format(n, x): df['original_data'].shift(n*x)
for x in range(1, reps)})
# take the rowwise sum
.sum(axis=1)
)
Output:
original_data n5 x1
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 1.0
6 7 2.0
7 8 3.0
8 9 4.0
9 10 5.0
10 11 6.0
11 12 7.0
12 13 8.0
13 14 9.0
14 15 10.0
following is my input data frame
>>data frame after getting avg
a b c d avg
0 1 4 7 8 5
1 3 4 5 6 4.5
2 6 8 2 9 6.25
3 2 9 5 6 5.5
Output required after adding criteria
>>
a b c d avg avg_criteria
0 1 4 7 8 5 7.5 (<=5)
1 3 4 5 6 4.5 5.5 (<=4.5)
2 6 8 2 9 6.25 8.5 (<=6.25)
3 2 9 5 6 5.5 7.5 (<=5.5)
> This is the code I have tried
read file
df_input_data = pd.DataFrame(pd.read_excel(file_path,header=2).dropna(axis=1, how= 'all'))
adding column after calculating average
df_avg = df_input_data.assign(Avg=df_input_data.mean(axis=1, skipna=True))
criteria
criteria = df_input_data.iloc[, :] >= df_avg.iloc[1][-1]
#creating output data frame
df_output = df_input_data.assign(Avg_criteria= criteria)
I am unable to solve this issue. I have tried and googled it many times
From what I understand, you can try df.mask/df.where after comparing with the mean and then calculate mean:
m=df.drop("avg",1)
m.where(m.ge(df['avg'],axis=0)).mean(1)
0 7.5
1 5.5
2 8.5
3 7.5
dtype: float64
print(df.assign(Avg_criteria=m.where(m.ge(df['avg'],axis=0)).mean(1)))
a b c d avg Avg_criteria
0 1 4 7 8 5.00 7.5
1 3 4 5 6 4.50 5.5
2 6 8 2 9 6.25 8.5
3 2 9 5 6 5.50 7.5
I have a panda data frame that I made and I pivoted it the exact way I want it. Now, I want to unpivot everything to get the position data (row and column) with the newly formed data frame and see which. For example, I want for the first row (in the new data frame that is unpivoted with the position data) to have 1 under "row", 1 under "a", and 1 as the value (example below). Can someone please figure out how I can unpivot to get the row and column values? I have tried used pd.melt but it didn't seem to work (it made no difference). Please respond soon. Thanks! Directly below is code to make the pivoted data frame.
import pandas as pd
row = [1, 2, 3, 4, 5]
df67 = {'row':row,}
df67 = pd.DataFrame(df67,columns=['row'])
df67['a'] = [1, 2, 3, 4, 5]
df67['b'] =[13, 18, 5, 10, 6]
#df67 (dataframe before pivot)
df68 = df67.pivot(index='row', columns = 'a')
#df68 (dataframe after pivot)
What I want the result to be for the first line:
row | a | value
1 | 1 | 13
Use DataFrame.stack with DataFrame.reset_index:
df = df68.stack().reset_index()
print (df)
row a b
0 1 1 13.0
1 2 2 18.0
2 3 3 5.0
3 4 4 10.0
4 5 5 6.0
EDIT:
For avoid removed missing values use dropna=False parameter:
df = df68.stack(dropna=False).reset_index()
print (df)
row a b
0 1 1 13.0
1 1 2 NaN
2 1 3 NaN
3 1 4 NaN
4 1 5 NaN
5 2 1 NaN
6 2 2 18.0
7 2 3 NaN
8 2 4 NaN
9 2 5 NaN
10 3 1 NaN
11 3 2 NaN
12 3 3 5.0
13 3 4 NaN
14 3 5 NaN
15 4 1 NaN
16 4 2 NaN
17 4 3 NaN
18 4 4 10.0
19 4 5 NaN
20 5 1 NaN
21 5 2 NaN
22 5 3 NaN
23 5 4 NaN
24 5 5 6.0
I am trying to decile the column score of a DataFrame.
I use the following code:
np.percentile(df['score'], np.arange(0, 100, 10))
My problem is in score, there are lots of zeros. How can I filter out these 0 values and only decile the rest of values?
Filter them with boolean indexing:
df.loc[df['score']!=0, 'score']
or
df['score'][lambda x: x!=0]
and pass that to the percentile function.
np.percentile(df['score'][lambda x: x!=0], np.arange(0,100,10))
Consider the dataframe df
df = pd.DataFrame(
dict(score=np.random.rand(20))
).where(
np.random.choice([True, False], (20, 1), p=(.8, .2)),
0
)
score
0 0.380777
1 0.559356
2 0.103099
3 0.800843
4 0.262055
5 0.389330
6 0.477872
7 0.393937
8 0.189949
9 0.571908
10 0.133402
11 0.033404
12 0.650236
13 0.593495
14 0.000000
15 0.013058
16 0.334851
17 0.000000
18 0.999757
19 0.000000
Use pd.qcut to decile
pd.qcut(df.loc[df.score != 0, 'score'], 10, range(10))
0 4
1 6
2 1
3 9
4 3
5 4
6 6
7 5
8 2
9 7
10 1
11 0
12 8
13 8
15 0
16 3
18 9
Name: score, dtype: category
Categories (10, int64): [0 < 1 < 2 < 3 ... 6 < 7 < 8 < 9]
Or all together
df.assign(decile=pd.qcut(df.loc[df.score != 0, 'score'], 10, range(10)))
score decile
0 0.380777 4.0
1 0.559356 6.0
2 0.103099 1.0
3 0.800843 9.0
4 0.262055 3.0
5 0.389330 4.0
6 0.477872 6.0
7 0.393937 5.0
8 0.189949 2.0
9 0.571908 7.0
10 0.133402 1.0
11 0.033404 0.0
12 0.650236 8.0
13 0.593495 8.0
14 0.000000 NaN
15 0.013058 0.0
16 0.334851 3.0
17 0.000000 NaN
18 0.999757 9.0
19 0.000000 NaN
You can simply mask zeros and then remove them from your column using boolean indexing:
score = df['score']
score_no_zero = score[score != 0]
np.percentile(score_no_zero, np.arange(0,100,10))
or in one step:
np.percentile(df['score'][df['score'] != 0], np.arange(0,100,10))