I made a function that uses the psar function from the pandas_ta library. This function seems to work incorrectly, it gives the PSARl, PSARs and PSARr values on the wrong dates. While using an interval of 1 day on BTC-USD I get the following output:
Used function:
psar = df.ta.psar(high=df['High'], low=df['Low'], close=df['Close'], af0=0.02, af=0.02, max_af=0.2)
print(psar)
Output:
PSARl_0.02_0.2 PSARs_0.02_0.2 PSARaf_0.02_0.2 PSARr_0.02_0.2
Date
2021-10-29 NaN NaN 0.02 0
2021-10-30 NaN 62927.609375 0.02 1
2021-10-31 NaN 62927.609375 0.04 0
2021-11-01 NaN 62813.478125 0.06 0
2021-11-02 59695.183594 NaN 0.02 1
2021-11-03 59695.183594 NaN 0.02 0
2021-11-04 59786.135781 NaN 0.02 0
2021-11-05 59875.268925 NaN 0.02 0
2021-11-06 59962.619406 NaN 0.02 0
2021-11-07 60048.222877 NaN 0.02 0
2021-11-08 60132.114279 NaN 0.04 0
2021-11-09 60433.779395 NaN 0.06 0
2021-11-10 60919.572788 NaN 0.08 0
2021-11-11 61549.176965 NaN 0.08 0
2021-11-12 62128.412808 NaN 0.08 0
2021-11-13 62333.914062 NaN 0.08 0
2021-11-14 62333.914062 NaN 0.08 0
2021-11-15 62850.370938 NaN 0.08 0
2021-11-16 NaN 68789.625000 0.02 1
2021-11-17 NaN 68594.159219 0.04 0
2021-11-18 NaN 68191.009256 0.06 0
2021-11-19 NaN 67492.596279 0.08 0
2021-11-20 NaN 66549.602952 0.08 0
2021-11-21 NaN 65682.049091 0.08 0
2021-11-22 NaN 64883.899538 0.10 0
2021-11-23 NaN 63963.493569 0.12 0
2021-11-24 NaN 62963.805747 0.12 0
2021-11-25 NaN 62084.080463 0.12 0
2021-11-26 NaN 61309.922214 0.14 0
2021-11-27 NaN 60226.300292 0.14 0
2021-11-28 NaN 59294.385438 0.14 0
2021-11-29 NaN 58492.938664 0.14 0
While looking at the yfinance chart for the BTC-USD I should not get a 1 in the PSARr column at 2021-10-30 but I somehow am. It's really random because some values are correct but some of them aren't. What am I doing wrong or is there something wrong within the function?
Thanks!
picture:
Other data:
Open High Low Close \
Date
2021-10-29 60624.871094 62927.609375 60329.964844 62227.964844
2021-10-30 62239.363281 62330.144531 60918.386719 61888.832031
2021-10-31 61850.488281 62406.171875 60074.328125 61318.957031
2021-11-01 61320.449219 62419.003906 59695.183594 61004.406250
2021-11-02 60963.253906 64242.792969 60673.054688 63226.402344
2021-11-03 63254.335938 63516.937500 61184.238281 62970.046875
2021-11-04 62941.804688 63123.289062 60799.664062 61452.230469
2021-11-05 61460.078125 62541.468750 60844.609375 61125.675781
2021-11-06 61068.875000 61590.683594 60163.781250 61527.480469
2021-11-07 61554.921875 63326.988281 61432.488281 63326.988281
2021-11-08 63344.066406 67673.742188 63344.066406 67566.828125
2021-11-09 67549.734375 68530.335938 66382.062500 66971.828125
2021-11-10 66953.335938 68789.625000 63208.113281 64995.230469
2021-11-11 64978.890625 65579.015625 64180.488281 64949.960938
2021-11-12 64863.980469 65460.816406 62333.914062 64155.941406
2021-11-13 64158.121094 64915.675781 63303.734375 64469.527344
2021-11-14 64455.371094 65495.179688 63647.808594 65466.839844
2021-11-15 65521.289062 66281.570312 63548.144531 63557.871094
2021-11-16 63721.195312 63721.195312 59016.335938 60161.246094
2021-11-17 60139.621094 60823.609375 58515.410156 60368.011719
2021-11-18 60360.136719 60948.500000 56550.792969 56942.136719
2021-11-19 56896.128906 58351.113281 55705.179688 58119.578125
2021-11-20 58115.082031 59859.878906 57469.726562 59697.195312
2021-11-21 59730.507812 60004.425781 58618.929688 58730.476562
2021-11-22 58706.847656 59266.359375 55679.839844 56289.289062
2021-11-23 56304.554688 57875.515625 55632.761719 57569.074219
2021-11-24 57565.851562 57803.066406 55964.222656 56280.425781
2021-11-25 57165.417969 59367.968750 57146.683594 57274.679688
2021-11-26 58960.285156 59183.480469 53569.765625 53569.765625
2021-11-27 53736.429688 55329.257812 53668.355469 54815.078125
2021-11-28 54813.023438 57393.843750 53576.734375 57248.457031
2021-11-29 57474.843750 58749.250000 56856.371094 58749.250000
Adj Close Volume
Date
2021-10-29 62227.964844 36856881767
2021-10-30 61888.832031 32157938616
2021-10-31 61318.957031 32241199927
2021-11-01 61004.406250 36150572843
2021-11-02 63226.402344 37746665647
2021-11-03 62970.046875 36124731509
2021-11-04 61452.230469 32615846901
2021-11-05 61125.675781 30605102446
2021-11-06 61527.480469 29094934221
2021-11-07 63326.988281 24726754302
2021-11-08 67566.828125 41125608330
2021-11-09 66971.828125 42357991721
2021-11-10 64995.230469 48730828378
2021-11-11 64949.960938 35880633236
2021-11-12 64155.941406 36084893887
2021-11-13 64469.527344 30474228777
2021-11-14 65466.839844 25122092191
2021-11-15 63557.871094 30558763548
2021-11-16 60161.246094 46844335592
2021-11-17 60368.011719 39178392930
2021-11-18 56942.136719 41388338699
2021-11-19 58119.578125 38702407772
2021-11-20 59697.195312 30624264863
2021-11-21 58730.476562 26123447605
2021-11-22 56289.289062 35036121783
2021-11-23 57569.074219 37485803899
2021-11-24 56280.425781 36635566789
2021-11-25 57274.679688 34284016248
2021-11-26 53569.765625 41810748221
2021-11-27 54815.078125 30560857714
2021-11-28 57248.457031 28116886357
2021-11-29 58749.250000 33326104576
psar = df.ta.psar(high=df['High'], low=df['Low'], close=df['Close'], af0=0.02, af=0.02, max_af=0.2)
print(psar)
#PSAR r u need
mypsar = pd.DataFrame(psar)["PSARr_0.2_0.2"].iloc[-1]
if psar > 0:
mysartrend = "Bull"
else:
mysartrend = "Bear"
print(mysartrend)
Related
My goal is selecting the column Sabah in dataframe prdt and entering every value to repeated rows called Sabah in dataframe prcal
prcal
Vakit Start_Date End_Date Start_Time End_Time
0 Sabah 2022-01-01 2022-01-01 NaN NaN
1 Güneş 2022-01-01 2022-01-01 NaN NaN
2 Öğle 2022-01-01 2022-01-01 NaN NaN
3 İkindi 2022-01-01 2022-01-01 NaN NaN
4 Akşam 2022-01-01 2022-01-01 NaN NaN
..........................................................
2184 Sabah 2022-12-31 2022-12-31 NaN NaN
2185 Güneş 2022-12-31 2022-12-31 NaN NaN
2186 Öğle 2022-12-31 2022-12-31 NaN NaN
2187 İkindi 2022-12-31 2022-12-31 NaN NaN
2188 Akşam 2022-12-31 2022-12-31 NaN NaN
2189 rows × 5 columns
prdt
Day Sabah Güneş Öğle İkindi Akşam Yatsı
0 2022-01-01 06:51:00 08:29:00 13:08:00 15:29:00 17:47:00 19:20:00
1 2022-01-02 06:51:00 08:29:00 13:09:00 15:30:00 17:48:00 19:21:00
2 2022-01-03 06:51:00 08:29:00 13:09:00 15:30:00 17:48:00 19:22:00
3 2022-01-04 06:51:00 08:29:00 13:09:00 15:31:00 17:49:00 19:22:00
4 2022-01-05 06:51:00 08:29:00 13:10:00 15:32:00 17:50:00 19:23:00
...........................................................................
360 2022-12-27 06:49:00 08:27:00 13:06:00 15:25:00 17:43:00 19:16:00
361 2022-12-28 06:50:00 08:28:00 13:06:00 15:26:00 17:43:00 19:17:00
362 2022-12-29 06:50:00 08:28:00 13:07:00 15:26:00 17:44:00 19:18:00
363 2022-12-30 06:50:00 08:28:00 13:07:00 15:27:00 17:45:00 19:18:00
364 2022-12-31 06:50:00 08:28:00 13:07:00 15:28:00 17:46:00 19:19:00
365 rows × 7 columns
Selected every row called sabah prcal.iloc[::6,:]
Made a list for prdt['Sabah'].
When integrating prcal.iloc[::6,:] = prdt['Sabah'][0:365] I get a value error:
ValueError: Must have equal len keys and value when setting with an iterable
I have a df sample with one of the columns named date_code, dtype datetime64[ns]:
date_code
2022-03-28
2022-03-29
2022-03-30
2022-03-31
2022-04-01
2022-04-07
2022-04-07
2022-04-08
2022-04-12
2022-04-12
2022-04-14
2022-04-14
2022-04-15
2022-04-16
2022-04-16
2022-04-17
2022-04-18
2022-04-19
2022-04-20
2022-04-20
2022-04-21
2022-04-22
2022-04-25
2022-04-25
2022-04-26
I would like to create a column based on some conditions comparing current row with previous. I trying to create a function like:
def start_date(row):
if (row['date_code'] - row['date_code'].shift(-1)).days >1:
val = row['date_code'].shift(-1)
elif row['date_code'] == row['date_code'].shift(-1):
val = row['date_code']
else:
val = np.nan()
return val
But once I apply
sample['date_zero_recorded'] = sample.apply(start_date, axis=1)
I get error:
AttributeError: 'Timestamp' object has no attribute 'shift'
How I should compare current row with previous with condition?
Edited: expected outoput
if current row more than previous by 2 or more, get previous
if current row equal past, get current
else, return NaN (incl. if current >1 than previous)
date_code date_zero_recorded
2022-03-28 NaN
2022-03-29 NaN
2022-03-30 NaN
2022-03-31 NaN
2022-04-01 NaN
2022-04-07 2022-04-01
2022-04-07 2022-04-07
2022-04-08 NaN
2022-04-12 2022-04-08
2022-04-12 2022-04-12
2022-04-14 2022-04-12
2022-04-14 2022-04-14
2022-04-15 NaN
2022-04-16 NaN
2022-04-16 2022-04-16
2022-04-17 NaN
2022-04-18 NaN
2022-04-19 NaN
2022-04-20 NaN
2022-04-20 2022-04-20
2022-04-21 NaN
2022-04-22 NaN
2022-04-25 2022-04-22
2022-04-25 2022-04-25
2022-04-26 NaN
You shouldn't use iterrows and use vectorial code instead.
For example:
sample['date_code'] = pd.to_datetime(sample['date_code'])
sample['date_zero_recorded'] = (
sample['date_code'].shift()
.where(sample['date_code'].diff().ne('1d'))
)
output:
date_code date_zero_recorded
0 2022-03-28 NaT
1 2022-03-29 NaT
2 2022-03-30 NaT
3 2022-03-31 NaT
4 2022-04-01 NaT
5 2022-04-07 2022-04-01
6 2022-04-07 2022-04-07
7 2022-04-08 NaT
8 2022-04-12 2022-04-08
9 2022-04-12 2022-04-12
10 2022-04-14 2022-04-12
11 2022-04-14 2022-04-14
12 2022-04-15 NaT
13 2022-04-16 NaT
14 2022-04-16 2022-04-16
15 2022-04-17 NaT
16 2022-04-18 NaT
17 2022-04-19 NaT
18 2022-04-20 NaT
19 2022-04-20 2022-04-20
20 2022-04-21 NaT
21 2022-04-22 NaT
22 2022-04-25 2022-04-22
23 2022-04-25 2022-04-25
24 2022-04-26 NaT
how do I compute the returns for the following dataframe? Let the name of the dataframe be refined_df
0 1
Date
2020-02-03 TSLA MSFT
2020-02-19 AMZN ADBE
2020-03-05 OYST GPRO
2020-03-20 AMZN OYST
2020-04-06 SGEN AEYE
2020-04-22 AEYE TSLA
2020-05-07 AAPL SGEN
and we also have another dataframe, storage_openprices
AAL AAPL ADBE AEYE AMZN GOOG GPRO MSFT OYST PACB RADI SGEN TSLA
Date
2020-01-14 27.910000 79.175003 347.010010 5.300000 1885.880005 1439.010010 4.230 163.389999 28.010000 4.850000 NaN 104.849998 108.851997
2020-01-15 27.450001 77.962502 346.420013 5.020000 1872.250000 1430.209961 4.160 162.619995 26.510000 4.800000 NaN 108.550003 105.952003
2020-01-16 27.790001 78.397499 345.980011 5.060000 1882.989990 1447.439941 4.280 164.350006 25.530001 4.930000 NaN 107.330002 98.750000
2020-01-17 28.299999 79.067497 349.000000 4.840000 1885.890015 1462.910034 4.360 167.419998 24.740000 5.030000 NaN 108.410004 101.522003
2020-01-21 27.969999 79.297501 346.369995 4.880000 1865.000000 1479.119995 4.280 166.679993 26.190001 4.950000 NaN 108.379997 106.050003
What I want is to return a new dataframe with the log returns of particular stock for the specific duration.
For example, the (0,0) entry of the returned dataframe is the log return for holding TSLA from 2020-02-03 till 2020-02-19. We refer to the open prices for tesla from refined_df
Similarly, for the (1,0) entry we return the log return for holding AMZN from 2020-02-19 till 2020-03-05.
Unsure if I should be using the apply and lambda function. My issue is calling the next row in computing the log returns.
EDIT:
The output, a dataframe should look like
0 1
Date
2020-02-03 0.14 0.21
2020-02-19 0.18 0.19
2020-03-05 XXXX XXXX
2020-03-20 XXXX XXXX
2020-04-06 XXXX XXXX
2020-04-22 XXXX XXXX
2020-05-07 XXXX XXXX
where 0.14 (a made-up number) is the log return of TSLA from 2020-02-03 to 2020-02-19, i.e. log(TSLA open price on 2020-02-19) - log(TSLA open price on 2020-02-03)
Thanks!
You can use merge_asof and direction='forward' parameter with reshaped both DataFrames by DataFrame.stack:
print (refined_df)
0 1
Date
2020-02-03 TSLA MSFT
2020-02-19 AMZN ADBE
2020-03-05 OYST GPRO
2020-03-20 AMZN OYST
2020-04-06 SGEN AEYE
2020-04-22 AEYE TSLA
2020-05-07 AAPL SGEN
#changed datetiems for match
print (storage_openprices)
AAL AAPL ADBE AEYE AMZN GOOG \
Date
2020-02-14 27.910000 79.175003 347.010010 5.30 1885.880005 1439.010010
2020-02-15 27.450001 77.962502 346.420013 5.02 1872.250000 1430.209961
2020-02-16 27.790001 78.397499 345.980011 5.06 1882.989990 1447.439941
2020-02-17 28.299999 79.067497 349.000000 4.84 1885.890015 1462.910034
2020-02-21 27.969999 79.297501 346.369995 4.88 1865.000000 1479.119995
GPRO MSFT OYST PACB RADI SGEN TSLA
Date
2020-02-14 4.23 163.389999 28.010000 4.85 NaN 104.849998 108.851997
2020-02-15 4.16 162.619995 26.510000 4.80 NaN 108.550003 105.952003
2020-02-16 4.28 164.350006 25.530001 4.93 NaN 107.330002 98.750000
2020-02-17 4.36 167.419998 24.740000 5.03 NaN 108.410004 101.522003
2020-02-21 4.28 166.679993 26.190001 4.95 NaN 108.379997 106.050003
df1 = storage_openprices.stack().rename_axis(['Date','type']).reset_index(name='new')
df2 = refined_df.stack().rename_axis(['Date','cols']).reset_index(name='type')
new = (pd.merge_asof(df2, df1, on='Date',by='type', direction='forward')
.pivot('Date','cols','new'))
print (new)
cols 0 1
Date
2020-02-03 108.851997 163.389999
2020-02-19 1865.000000 346.369995
2020-03-05 NaN NaN
2020-03-20 NaN NaN
2020-04-06 NaN NaN
2020-04-22 NaN NaN
2020-05-07 NaN NaN
I had a df such as
ID | Half Hour Bucket | clock in time | clock out time | Rate
232 | 4/1/19 8:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54
342 | 4/1/19 8:30 PM | 4/1/19 7:12 PM | 4/1/19 7:22 PM | 0.23
232 | 4/1/19 7:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54
I want my output to be
ID | Half Hour Bucket | clock in time | clock out time | Rate | Mins
232 | 4/1/19 8:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54 |
342 | 4/1/19 8:30 PM | 4/1/19 7:12 PM | 4/1/19 7:22 PM | 0.23 |
232 | 4/1/19 7:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54 |
Where minutes represents the difference between clock out time and clock in time.
But I can only contain the minutes value for the half hour bucket on the same row it corresponds to.
For example for id 342 it would be ten minutes and the 10 mins would be on that row.
But for ID 232 the clock in to clock out time spans 3 hours. I would only want the 30 mins for 8 to 830 in the first row and the 18 mins in the third row. for the minutes in the half hour bucket like 830-9 or 9-930 that dont exist in the first row, I would want to create a new row in that same df that contains nans for everything except the half hour bucket and mins field for the minutes that do not exist in the original row.
the 30 mins from 8-830 would stay in the first row, but I would want 5 new rows for all the half hour buckets that aren't 4/1/19 8:00 PM as new rows with only the half hour bucket and the rate carrying over from the row. Is this possible?
I thank anyone for their time!
Realised my first answer probably wasn't what you wanted. This version, hopefully, is. It was a bit more involved than I first assumed!
Create Data
First of all create a dataframe to work with, based on that supplied in the question. The resultant formatting isn't quite the same but that would be easily fixed, so I've left it as-is here.
import math
import numpy as np
import pandas as pd
# Create a dataframe to work with from the data provided in the question
columns = ['id', 'half_hour_bucket', 'clock_in_time', 'clock_out_time' , 'rate']
data = [[232, '4/1/19 8:00 PM', '4/1/19 7:12 PM', '4/1/19 10:45 PM', 0.54],
[342, '4/1/19 8:30 PM', '4/1/19 7:12 PM', '4/1/19 07:22 PM ', 0.23],
[232, '4/1/19 7:00 PM', '4/1/19 7:12 PM', '4/1/19 10:45 PM', 0.54]]
df = pd.DataFrame(data, columns=columns)
def convert_cols_to_dt(df):
# Convert relevant columns to datetime format
for col in df:
if col not in ['id', 'rate']:
df[col] = pd.to_datetime(df[col])
return df
df = convert_cols_to_dt(df)
# Create the mins column
df['mins'] = (df.clock_out_time - df.clock_in_time)
Output:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 0 days 03:33:00.000000000
1 342 2019-04-01 20:30:00 2019-04-01 19:12:00 2019-04-01 19:22:00 0.23 0 days 00:10:00.000000000
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 0 days 03:33:00.000000000
Solution
Next define a simple function to return a list of length equal to the number of 30-minute intervals in the min column.
def upsample_list(x):
multiplier = math.ceil(x.total_seconds() / (60 * 30))
return list(range(multiplier))
And apply this to the dataframe:
df['samples'] = df.mins.apply(upsample_list)
Next, create a new row for each list item in the 'samples' column (using the answer provided by Roman Pekar here):
s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'sample'
Join s to the dataframe and clean up the extra columns:
df = df.drop('samples', axis=1).join(s, how='inner').drop('sample', axis=1)
Which gives us this:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
1 342 2019-04-01 20:30:00 2019-04-01 19:12:00 2019-04-01 19:22:00 0.23 00:10:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
Nearly there!
Reset the index:
df = df.reset_index(drop=True)
Set duplicate rows to NaN:
df = df.mask(df.duplicated())
Which gives:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232.0 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
1 NaN NaT NaT NaT NaN NaT
2 NaN NaT NaT NaT NaN NaT
3 NaN NaT NaT NaT NaN NaT
4 NaN NaT NaT NaT NaN NaT
5 NaN NaT NaT NaT NaN NaT
6 NaN NaT NaT NaT NaN NaT
7 NaN NaT NaT NaT NaN NaT
8 342.0 2019-04-01 20:30:00 2019-04-01 19:12:00 2019-04-01 19:22:00 0.23 00:10:00
9 232.0 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
10 NaN NaT NaT NaT NaN NaT
11 NaN NaT NaT NaT NaN NaT
12 NaN NaT NaT NaT NaN NaT
13 NaN NaT NaT NaT NaN NaT
14 NaN NaT NaT NaT NaN NaT
15 NaN NaT NaT NaT NaN NaT
16 NaN NaT NaT NaT NaN NaT
Lastly, forward fill the half_hour_bucket and rate columns.
df[['half_hour_bucket', 'rate']] = df[['half_hour_bucket', 'rate']].ffill()
Final output:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232.0 2019-04-01 20:00:00 2019-04-01_19:12:00 2019-04-01_22:45:00 0.54 03:33:00
1 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
2 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
3 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
4 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
5 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
6 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
7 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
8 342.0 2019-04-01 20:30:00 2019-04-01_19:12:00 2019-04-01_19:22:00 0.23 00:10:00
9 232.0 2019-04-01 19:00:00 2019-04-01_19:12:00 2019-04-01_22:45:00 0.54 03:33:00
10 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
11 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
12 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
13 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
14 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
15 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
16 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
I'm having an issue changing a pandas DataFrame index to a datetime from an integer. I want to do it so that I can call reindex and fill in the dates between those listed in the table. Note that I have to use pandas 0.7.3 at the moment because I'm also using qstk, and qstk relies on pandas 0.7.3
First, here's my layout:
(Pdb) df
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
12 0 0 0 4000 2011-12-20 16:00:00
(Pdb) type(df['date'])
<class 'pandas.core.series.Series'>
(Pdb) df2 = DataFrame(index=df['date'])
(Pdb) df2
Empty DataFrame
Columns: array([], dtype=object)
Index: array([2011-01-13 16:00:00, 2011-01-26 16:00:00, 2011-02-02 16:00:00,
2011-02-10 16:00:00, 2011-03-03 16:00:00, 2011-06-03 16:00:00,
2011-05-03 16:00:00, 2011-06-10 16:00:00, 2011-08-01 16:00:00,
2011-12-20 16:00:00], dtype=object)
(Pdb) df2.merge(df,left_index=True,right_on='date')
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
12 0 0 0 4000 2011-12-20 16:00:00
I have tried multiple things to get a datetime index:
1.) Using the reindex() method with a list of datetime values. This creates a datetime index, but then fills in NaNs for the data in the DataFrame. I'm guessing that this is because the original values are tied to the integer index and reindexing to datetime tries to fill the new indices with default values (NaNs if no fill method is indicated). Thusly:
(Pdb) df.reindex(index=df['date'])
AAPL GOOG IBM XOM date
date
2011-01-13 16:00:00 NaN NaN NaN NaN NaN
2011-01-26 16:00:00 NaN NaN NaN NaN NaN
2011-02-02 16:00:00 NaN NaN NaN NaN NaN
2011-02-10 16:00:00 NaN NaN NaN NaN NaN
2011-03-03 16:00:00 NaN NaN NaN NaN NaN
2011-06-03 16:00:00 NaN NaN NaN NaN NaN
2011-05-03 16:00:00 NaN NaN NaN NaN NaN
2011-06-10 16:00:00 NaN NaN NaN NaN NaN
2011-08-01 16:00:00 NaN NaN NaN NaN NaN
2011-12-20 16:00:00 NaN NaN NaN NaN NaN
2.) Using DataFrame.merge with my original df and a second dataframe, df2, that is basically just a datetime index with nothing else. So I end up doing something like:
(pdb) df2.merge(df,left_index=True,right_on='date')
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
(and vice-versa). But I always end up with this kind of thing, with integer indices.
3.) Starting with an empty DataFrame with a datetime index (created from the 'date' field of df) and a bunch of empty columns. Then I attempt to assign each column by setting the columns with the same
names to be equal to the columns from df:
(Pdb) df2['GOOG']=0
(Pdb) df2
GOOG
date
2011-01-13 16:00:00 0
2011-01-26 16:00:00 0
2011-02-02 16:00:00 0
2011-02-10 16:00:00 0
2011-03-03 16:00:00 0
2011-06-03 16:00:00 0
2011-05-03 16:00:00 0
2011-06-10 16:00:00 0
2011-08-01 16:00:00 0
2011-12-20 16:00:00 0
(Pdb) df2['GOOG'] = df['GOOG']
(Pdb) df2
GOOG
date
2011-01-13 16:00:00 NaN
2011-01-26 16:00:00 NaN
2011-02-02 16:00:00 NaN
2011-02-10 16:00:00 NaN
2011-03-03 16:00:00 NaN
2011-06-03 16:00:00 NaN
2011-05-03 16:00:00 NaN
2011-06-10 16:00:00 NaN
2011-08-01 16:00:00 NaN
2011-12-20 16:00:00 NaN
So, how in pandas 0.7.3 do I get df to be re-created with an datetime index instead of the integer index? What am I missing?
I think you are looking for set_index:
In [11]: df.set_index('date')
Out[11]:
AAPL GOOG IBM XOM
date
2011-01-13 16:00:00 0 0 4000 0
2011-01-26 16:00:00 0 1000 4000 0
2011-02-02 16:00:00 0 1000 4000 0
2011-02-10 16:00:00 0 1000 4000 4000
2011-03-03 16:00:00 0 0 1800 4000
2011-06-03 16:00:00 0 0 3300 4000
2011-05-03 16:00:00 0 0 0 4000
2011-06-10 16:00:00 1200 0 0 4000
2011-08-01 16:00:00 1200 0 0 4000
2011-12-20 16:00:00 0 0 0 4000