How to find the row index in pandas column? - python

I am very new to pandas and trying to get the row index for the any value higher than the lprice. Can someone give me a quick idea on what I am doing wrong?
Dataframe
StrikePrice
0 40.00
1 50.00
2 60.00
3 70.00
4 80.00
5 90.00
6 100.00
7 110.00
8 120.00
9 130.00
10 140.00
11 150.00
12 160.00
13 170.00
14 180.00
15 190.00
16 200.00
17 210.00
18 220.00
19 230.00
20 240.00
Now I am trying to figure out how to get the row index for any value which is higher than the lprice
lprice = 99
for strike in df['StrikePrice']:
strike = float(strike)
# print(strike)
if strike >= lprice:
print('The high strike is:' + str(strike))
ce_1 = strike
print(df.index['StrikePrice' == ce_1])
The above gives 0 as the index
I am not sure what I am doing wrong here.

Using the index attribute after boolean slicing.
lprice = 99
df[df.StrikePrice >= lprice].index
Int64Index([6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64')
If you insist on iterating and finding when you've found it, you can modify your code:
lprice = 99
for idx, strike in df['StrikePrice'].iteritems():
strike = float(strike)
# print(strike)
if strike >= lprice:
print('The high strike is:' + str(strike))
ce_1 = strike
print(idx)

I think best is filter index by boolean indexing:
a = df.index[df['StrikePrice'] >= 99]
#alternative
#a = df.index[df['StrikePrice'].ge(99)]
Your code should be changed similar:
lprice = 99
for strike in df['StrikePrice']:
if strike >= lprice:
print('The high strike is:' + str(strike))
print(df.index[df['StrikePrice'] == strike])

numpy.where(condition[, x, y]) does exactly this if we specify only condition.
np.where() returns the tuple
condition.nonzero(), the indices where condition is True, if only condition is given.
In [36]: np.where(df.StrikePrice >= lprice)[0]
Out[36]: array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype=int64)
PS thanks #jezrael for the hint -- np.where() returns numerical index positions instead of DF index values:
In [41]: df = pd.DataFrame({'val':np.random.rand(10)}, index=pd.date_range('2018-01-01', freq='9999S', periods=10))
In [42]: df
Out[42]:
val
2018-01-01 00:00:00 0.459097
2018-01-01 02:46:39 0.148380
2018-01-01 05:33:18 0.945564
2018-01-01 08:19:57 0.105181
2018-01-01 11:06:36 0.570019
2018-01-01 13:53:15 0.203373
2018-01-01 16:39:54 0.021001
2018-01-01 19:26:33 0.717460
2018-01-01 22:13:12 0.370547
2018-01-02 00:59:51 0.462997
In [43]: np.where(df['val']>0.5)[0]
Out[43]: array([2, 4, 7], dtype=int64)
workaround:
In [44]: df.index[np.where(df['val']>0.5)[0]]
Out[44]: DatetimeIndex(['2018-01-01 05:33:18', '2018-01-01 11:06:36', '2018-01-01 19:26:33'], dtype='datetime64[ns]', freq=None)

Related

Create column based on date conditions, but I get this error AttributeError: 'SeriesGroupBy' object has no attribute 'sub'?

Hey a python newbie here.
Suppose I have the first two columns of this data dataframe:
df = pd.DataFrame({'group': ["Sun", "Moon", "Sun", "Moon", "Mars", "Mars"],
'score': [2, 13, 24, 15, 11, 44],
'datetime': ["2017-08-30 07:00:00", "2017-08-30 08:00:00", "2017-08-31 07:00:00", "2017-08-31 08:00:00", "2017-08-29 21:00:00", "2017-08-28 21:00:00"],
'difference': [2, 13, 22, 2, -33, 44]})
I want to create a new column named difference (I have put it there as an illustration), such that
it is equal:
score value in that row - score value of the day before in the same hour, for that group
e.g. difference in row 3 is equal to:
score in that row - score on the day before (30th) at 08:00:00 for that group (i.e. Moon), i.e. 15 - 13 = 2. If the day before and same time do not exist, then the value of the score of that row is taken (e.g. in row 0, for time 2017-08-30 07:00:00 there is no 2017-08-29 07:00:00, hence only the 2 is taken).
I write the following:
df['datetime'] = pd.to_datetime(df['datetime'])
before = df['datetime'] - pd.DateOffset(days=1)
df['difference'] = df.groupby(["group", "datetime"])['score'].sub(
before.map(df.set_index('datetime')['score']), fill_value=0)
but I get the error:
AttributeError: 'SeriesGroupBy' object has no attribute 'sub'
What am I missing? IS there any more elegant solution?
MultiIndex.map
We can set the group column along with the before column as the index of the dataframe, then map the multiindex with score values belonging to the same group then subtract the mapped score values from the score column to calculate the difference.
s = df.set_index(['group', before]).index.map(df.set_index(['group', 'datetime'])['score'])
df['difference'] = df['score'].sub(list(s), fill_value=0)
>>> df
group score datetime difference
0 Sun 2 2017-08-30 07:00:00 2.0
1 Moon 13 2017-08-30 08:00:00 13.0
2 Sun 24 2017-08-31 07:00:00 22.0
3 Moon 15 2017-08-31 08:00:00 2.0
4 Mars 11 2017-08-29 21:00:00 -33.0
5 Mars 44 2017-08-28 21:00:00 44.0

interpolate pandas frame using time index from another data frame

So, I have 2 data frames where the first one has the following structure:
'ds', '1_sensor_id', '1_val_1', '1_val_2'
0 2019-09-13 12:40:00 33469 30 43
1 2019-09-13 12:45:00 33469 43 43
The second one has the following structure:
'ds', '2_sensor_id', '2_val_1', '2_val_2'
0 2019-09-13 12:42:00 20006 6 50
1 2019-09-13 12:47:00 20006 5 80
So what I want to do is merge the two pandas frame through interpolation. So ultimately, the merged frame should have values defined at the time stamps (ds) defined in frame 1 and the 2_val_1 and 2_val_2 columns would be interpolated and the merged frame would have a row for each value in ds column in frame 1. What would be the best way to do this in pandas? I tried the merge_asof function but this does nearest neighbourhood interpolation and I did not get all the time stamps back.
You can append one frame to another and use interpolate(), example:
import datetime
import pandas as pd
df1 = pd.DataFrame(columns=['ds', '1_sensor_id', '1_val_1', '1_val_2'],
data=[[datetime.datetime(2019, 9, 13, 12, 40, 00), 33469, 30, 43],
[datetime.datetime(2019, 9, 13, 12, 45, 00), 33469, 33, 43]])
df2 = pd.DataFrame(columns=['ds', '2_sensor_id', '2_val_1', '2_val_2'],
data=[[datetime.datetime(2019, 9, 13, 12, 42, 00), 20006, 6, 50],
[datetime.datetime(2019, 9, 13, 12, 47, 00), 20006, 5, 80]])
df = df1.append(df2, sort=False)
df.set_index('ds', inplace=True)
df.interpolate(method = 'time', limit_direction='backward', inplace=True)
print(df)
1_sensor_id 1_val_1 ... 2_val_1 2_val_2
ds ...
2019-09-13 12:40:00 33469.0 30.0 ... 6.0 50.0
2019-09-13 12:45:00 33469.0 33.0 ... 5.4 68.0
2019-09-13 12:42:00 NaN NaN ... 6.0 50.0
2019-09-13 12:47:00 NaN NaN ... 5.0 80.0

Convert a datetime index to sequential numbers for x value of machine leanring

This seems like a basic question. I want to use the datetime index in a pandas dataframe as the x values of a machine leanring algorithm for a univarte time series comparisons.
I tried to isolate the index and then convert it to a number but i get an error.
df=data["Close"]
idx=df.index
df.index.get_loc(idx)
Date
2014-03-31 0.9260
2014-04-01 0.9269
2014-04-02 0.9239
2014-04-03 0.9247
2014-04-04 0.9233
this is what i get when i add your code
2019-04-24 00:00:00 0.7097
2019-04-25 00:00:00 0.7015
2019-04-26 00:00:00 0.7018
2019-04-29 00:00:00 0.7044
x (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
Name: Close, Length: 1325, dtype: object
I ne
ed a column of 1 to the number of values in my dataframe
First select column Close by double [] for one column DataFrame, so possible add new column:
df = data[["Close"]]
df["x"] = np.arange(1, len(df) + 1)
print (df)
Close x
Date
2014-03-31 0.9260 1
2014-04-01 0.9269 2
2014-04-02 0.9239 3
2014-04-03 0.9247 4
2014-04-04 0.9233 5
You can add a column with value range(1, len(data) + 1) as so:
df = pd.DataFrame({"y": [5, 4, 3, 2, 1]}, index=pd.date_range(start="2019-08-01", periods=5))
In [3]: df
Out[3]:
y
2019-08-01 5
2019-08-02 4
2019-08-03 3
2019-08-04 2
2019-08-05 1
df["x"] = range(1, len(df) + 1)
In [7]: df
Out[7]:
y x
2019-08-01 5 1
2019-08-02 4 2
2019-08-03 3 3
2019-08-04 2 4
2019-08-05 1 5

set_codes in multiIndexed pandas series

I want to multiIndex an array of data.
Initially, I was indexing my data with datetime, but for some later applications, I had to add another numeric index (that goes from 0 the len(array)-1).
I have written those little lines:
O = [0.701733664614, 0.699495411782, 0.572129320819, 0.613315597684, 0.58079660603, 0.596638918579, 0.48453382119]
Ab = [datetime.datetime(2018, 12, 11, 14, 0), datetime.datetime(2018, 12, 21, 10, 0), datetime.datetime(2018, 12, 21, 14, 0), datetime.datetime(2019, 1, 1, 10, 0), datetime.datetime(2019, 1, 1, 14, 0), datetime.datetime(2019, 1, 11, 10, 0), datetime.datetime(2019, 1, 11, 14, 0)]
tst = pd.Series(O,index=Ab)
ld = len(tst)
index = pd.MultiIndex.from_product([(x for x in range(0,ld)),Ab], names=['id','dtime'])
print (index)
data = pd.Series(O,index=index)
But when printting index, I get some bizzare ''codes'':
The levels & names are perfect, but the codes go from 0 to 763...764 times (instead of one)!
I tried to add the set_codes command:
index.set_codes([x for x in range(0,ld)], level=0)
print (index)
I vain, I have the following error :
ValueError: Unequal code lengths: [764, 583696]
the initial pandas series:
print (tst)
2005-01-01 14:00:00 0.544177
2005-01-01 14:00:00 0.544177
2005-01-21 14:00:00 0.602239
...
2019-05-21 10:00:00 0.446813
2019-05-21 14:00:00 0.466573
Length: 764, dtype: float64
the new expected one
id dtime
0 2005-01-01 14:00:00 0.544177
1 2005-01-01 14:00:00 0.544177
2 2005-01-21 14:00:00 0.602239
...
762 2019-05-21 10:00:00 0.446813
763 2019-05-21 14:00:00 0.466573
Thanks in advance
You can create new index by MultiIndex.from_arrays and reassign to Series:
s.index = pd.MultiIndex.from_arrays([np.arange(len(s)), s.index], names=['id','dtime'])

Pandas: De-seasonalizing time-series data

I have the following dataframe df:
[Out]:
VOL
2011-04-01 09:30:00 11297
2011-04-01 09:30:10 6526
2011-04-01 09:30:20 14021
2011-04-01 09:30:30 19472
2011-04-01 09:30:40 7602
...
2011-04-29 15:59:30 79855
2011-04-29 15:59:40 83050
2011-04-29 15:59:50 602014
This df consist of volume observations at every 10 second for 22 non-consecutive days. I want to DE-seasonalized my time-series by dividing each observations by the average volume of their respective 5 minute time interval. To do so, I need to take the time-series average of volume at every 5 minutes across the 22 days. So I would end up with a time-series of averages at every 5 minutes 9:30:00 - 9:35:00; 9:35:00 - 9:40:00; 9:40:00 - 9:45:00 ... until 16:00:00. The average for the interval 9:30:00 - 9:35:00 is the average of volume for this time interval across all 22 days (i.e. So the average between 9:30:00 to 9:35:00 is the total volume between 9:30:00 to 9:35:00 on (day 1 + day 2 + day 3 ... day 22) / 22 . Does it makes sense?). I would then divide each observations in df that are between 9:30:00 - 9:35:00 by the average of this time interval.
Is there a package in Python / Pandas that can do this?
Edited answer:
date_times = pd.date_range(datetime.datetime(2011, 4, 1, 9, 30),
datetime.datetime(2011, 4, 16, 0, 0),
freq='10s')
VOL = np.random.sample(date_times.size) * 10000.0
df = pd.DataFrame(data={'VOL': VOL,'time':date_times}, index=date_times)
df['h'] = df.index.hour
df['m'] = df.index.minute
df1 = df.resample('5Min', how={'VOL': np.mean})
times = pd.to_datetime(df1.index)
df2 = df1.groupby([times.hour,times.minute]).VOL.mean().reset_index()
df2.columns = ['h','m','VOL']
df.merge(df2,on=['h','m'])
df_norm = df.merge(df2,on=['h','m'])
df_norm['norm'] = df_norm['VOL_x']/df_norm['VOL_y']
** Older answer (keeping it temporarily)
Use resample function
df.resample('5Min', how={'VOL': np.mean})
eg:
date_times = pd.date_range(datetime.datetime(2011, 4, 1, 9, 30),
datetime.datetime(2011, 4, 16, 0, 0),
freq='10s')
VOL = np.random.sample(date_times.size) * 10000.0
df = pd.DataFrame(data={'VOL': VOL}, index=date_times)
df.resample('5Min', how={'VOL': np.mean})

Categories