Adding a row into DataFrame with multiindex - python

I would like to create a new DataFrame and a bunch of stock data per each date.
Declaring a DataFrame with a multi-index - date and stock ticker.
Adding data for 2020-06-07
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
Adding data for 2020-06-08
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
2020-06-08 AAPL 32.50 34.20 31.1 32.30
2020-06-08 MSFT 58.50 59.20 52.1 53.30
What would be the best and most efficient solution?
Here's my current version that doesn't work as I expect.
df = pd.DataFrame()
for date in dates:
universe500 = get_universe(date) #returns stocks on a specific date
for security in universe500:
prices = data.get_prices(security, ['open','high','low','close'], 1, '1d') # returns pd.DataFrame
df.iloc[(date, security),:] = prices

If prices is a DataFrame formatted in the same manner as the original df, you can use concat:
In[0]:
#consttucting a fake entry
arrays = [['06-07-2020'], ['ABCD']]
multi = pd.MultiIndex.from_arrays(arrays, names=('date', 'stock'))
to_add = pd.DataFrame({'open':1, 'high':2, 'low':3, 'close':4},index=multi)
print(to_add)
Out[0]:
open high low close
date stock
2020-06-09 ABCD 1 2 3 4
In[1]:
#now adding to your data
df = pd.concat([df, to_add])
print(df)
Out[1]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
If the data (prices) were just an array of 4 numbers (the open, high, low, and close) values, then loc would work in the place you used iloc:
In[2]:
df.loc[('2020-06-10','WXYZ'),:] = [10,20,30,40]
Out[2]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
2020-06-10 WXYZ 10.0 20.0 30.0 40.0

Related

NaN when converting df to a series

I have a dataframe with OHLC data. I need to get the close price into the pandas series, using the timestamp column as the index.
I am reading from a sqlite db into my df:
conn = sql.connect('allStockData.db')
price = pd.read_sql_query("SELECT * from ohlc_minutes", conn)
price['timestamp'] = pd.to_datetime(price['timestamp'])
print(price)
Which returns:
timestamp open high low close volume trade_count vwap symbol volume_10_day
0 2022-09-16 08:00:00+00:00 3.19 3.570 3.19 3.350 66475 458 3.404240 AAOI NaN
1 2022-09-16 08:05:00+00:00 3.35 3.440 3.33 3.430 28925 298 3.381131 AAOI NaN
2 2022-09-16 08:10:00+00:00 3.44 3.520 3.35 3.400 62901 643 3.445096 AAOI NaN
3 2022-09-16 08:15:00+00:00 3.37 3.390 3.31 3.360 17943 184 3.339721 AAOI NaN
4 2022-09-16 08:20:00+00:00 3.36 3.410 3.34 3.400 29123 204 3.383370 AAOI NaN
... ... ... ... ... ... ... ... ... ... ...
8759 2022-09-08 23:35:00+00:00 1.35 1.360 1.35 1.355 3835 10 1.350613 RUBY 515994.5
8760 2022-09-08 23:40:00+00:00 1.36 1.360 1.35 1.350 2780 7 1.353687 RUBY 515994.5
8761 2022-09-08 23:45:00+00:00 1.35 1.355 1.35 1.355 7080 11 1.350424 RUBY 515994.5
8762 2022-09-08 23:50:00+00:00 1.35 1.360 1.33 1.360 11664 30 1.351104 RUBY 515994.5
8763 2022-09-08 23:55:00+00:00 1.36 1.360 1.33 1.340 21394 32 1.348223 RUBY 515994.5
[8764 rows x 10 columns]
When I try to get the close into a series with the timestamp:
price = pd.Series(price['close'], index=price['timestamp'])
It returns a bunch of NaNs:
2022-09-16 08:00:00+00:00 NaN
2022-09-16 08:05:00+00:00 NaN
2022-09-16 08:10:00+00:00 NaN
2022-09-16 08:15:00+00:00 NaN
2022-09-16 08:20:00+00:00 NaN
..
2022-09-08 23:35:00+00:00 NaN
2022-09-08 23:40:00+00:00 NaN
2022-09-08 23:45:00+00:00 NaN
2022-09-08 23:50:00+00:00 NaN
2022-09-08 23:55:00+00:00 NaN
Name: close, Length: 8764, dtype: float64
If I remove the index:
price = pd.Series(price['close'])
The close is returned normally:
0 3.350
1 3.430
2 3.400
3 3.360
4 3.400
...
8759 1.355
8760 1.350
8761 1.355
8762 1.360
8763 1.340
Name: close, Length: 8764, dtype: float64
How can I return the close column as a pandas series, using my timestamp column as the index?
It's because price['close'] has it's own index which is incompatible with timestamp. Try use .values instead:
price = pd.Series(price['close'].values, index=price['timestamp'])
I needed to set the timestamp to the index before getting the the close as a series:
conn = sql.connect('allStockData.db')
price = pd.read_sql_query("SELECT * from ohlc_minutes", conn)
price['timestamp'] = pd.to_datetime(price['timestamp'])
price = price.set_index('timestamp')
print(price)
price = pd.Series(price['close'])
print(price)
Gives:
2022-09-16 08:00:00+00:00 3.350
2022-09-16 08:05:00+00:00 3.430
2022-09-16 08:10:00+00:00 3.400
2022-09-16 08:15:00+00:00 3.360
2022-09-16 08:20:00+00:00 3.400
...
2022-09-08 23:35:00+00:00 1.355
2022-09-08 23:40:00+00:00 1.350
2022-09-08 23:45:00+00:00 1.355
2022-09-08 23:50:00+00:00 1.360
2022-09-08 23:55:00+00:00 1.340
Name: close, Length: 8764, dtype: float64

Conditional count per day from pandas dataframe

I have a dataset with a reading (Tank Level) every minute from a piece of equipment and want to create a new dataset (dataframe) with a count of the number of samples per day and the number of readings above a set value.
Noxious Tank Level.MIN Noxious Tank Level.MAX Date_Time
0 9.32 9.33 2019-12-31 05:01:00
1 9.32 9.34 2019-12-31 05:02:00
2 9.32 9.35 2019-12-31 05:03:00
3 9.31 9.35 2019-12-31 05:04:00
4 9.31 9.35 2019-12-31 05:05:00
... ... ... ...
528175 2.98 3.01 2020-12-31 23:56:00
528176 2.98 3.02 2020-12-31 23:57:00
528177 2.98 3.01 2020-12-31 23:58:00
528178 2.98 3.02 2020-12-31 23:59:00
528179 2.98 2.99 2021-01-01 00:00:00
Using a lamdba function I can see whether each value is an overflow (Tank Level > setpoint) - I have also indexed the dataframe by Date_Time:
df['Overflow'] = df.apply(lambda x: True if x['Noxious Tank Level.MIN'] > 89 else False , axis=1)
Noxious Tank Level.MIN Noxious Tank Level.MAX Overflow
Date_Time
2019-12-31 05:01:00 9.32 9.33 False
2019-12-31 05:02:00 9.32 9.34 False
2019-12-31 05:03:00 9.32 9.35 False
2019-12-31 05:04:00 9.31 9.35 False
2019-12-31 05:05:00 9.31 9.35 False
... ... ... ...
2020-12-31 23:56:00 2.98 3.01 False
2020-12-31 23:57:00 2.98 3.02 False
2020-12-31 23:58:00 2.98 3.01 False
2020-12-31 23:59:00 2.98 3.02 False
2021-01-01 00:00:00 2.98 2.99 False
Now I want to count the number of samples per day and the number of 'True' values in the Overflow column to work out what fraction per day is in Overflow
I get the feeling that resample or groupby will be the way to go but I can't figure out how to create a new dataset with just these counts and include the conditional count from the Overflow column
First use:
df['Overflow'] = df['Noxious Tank Level.MIN'] > 89
And then for count Trues use sum nad for count values use size per days/ dates:
df1 = df.resample('d')['Overflow'].agg(['sum','size'])
Or:
df1 = df.groupby(pd.Grouper(freq='D'))['Overflow'].agg(['sum','size'])
Or:
df2 = df.groupby(df.index.date)['Overflow'].agg(['sum','size'])

Compute returns from data frame

how do I compute the returns for the following dataframe? Let the name of the dataframe be refined_df
0 1
Date
2020-02-03 TSLA MSFT
2020-02-19 AMZN ADBE
2020-03-05 OYST GPRO
2020-03-20 AMZN OYST
2020-04-06 SGEN AEYE
2020-04-22 AEYE TSLA
2020-05-07 AAPL SGEN
and we also have another dataframe, storage_openprices
AAL AAPL ADBE AEYE AMZN GOOG GPRO MSFT OYST PACB RADI SGEN TSLA
Date
2020-01-14 27.910000 79.175003 347.010010 5.300000 1885.880005 1439.010010 4.230 163.389999 28.010000 4.850000 NaN 104.849998 108.851997
2020-01-15 27.450001 77.962502 346.420013 5.020000 1872.250000 1430.209961 4.160 162.619995 26.510000 4.800000 NaN 108.550003 105.952003
2020-01-16 27.790001 78.397499 345.980011 5.060000 1882.989990 1447.439941 4.280 164.350006 25.530001 4.930000 NaN 107.330002 98.750000
2020-01-17 28.299999 79.067497 349.000000 4.840000 1885.890015 1462.910034 4.360 167.419998 24.740000 5.030000 NaN 108.410004 101.522003
2020-01-21 27.969999 79.297501 346.369995 4.880000 1865.000000 1479.119995 4.280 166.679993 26.190001 4.950000 NaN 108.379997 106.050003
What I want is to return a new dataframe with the log returns of particular stock for the specific duration.
For example, the (0,0) entry of the returned dataframe is the log return for holding TSLA from 2020-02-03 till 2020-02-19. We refer to the open prices for tesla from refined_df
Similarly, for the (1,0) entry we return the log return for holding AMZN from 2020-02-19 till 2020-03-05.
Unsure if I should be using the apply and lambda function. My issue is calling the next row in computing the log returns.
EDIT:
The output, a dataframe should look like
0 1
Date
2020-02-03 0.14 0.21
2020-02-19 0.18 0.19
2020-03-05 XXXX XXXX
2020-03-20 XXXX XXXX
2020-04-06 XXXX XXXX
2020-04-22 XXXX XXXX
2020-05-07 XXXX XXXX
where 0.14 (a made-up number) is the log return of TSLA from 2020-02-03 to 2020-02-19, i.e. log(TSLA open price on 2020-02-19) - log(TSLA open price on 2020-02-03)
Thanks!
You can use merge_asof and direction='forward' parameter with reshaped both DataFrames by DataFrame.stack:
print (refined_df)
0 1
Date
2020-02-03 TSLA MSFT
2020-02-19 AMZN ADBE
2020-03-05 OYST GPRO
2020-03-20 AMZN OYST
2020-04-06 SGEN AEYE
2020-04-22 AEYE TSLA
2020-05-07 AAPL SGEN
#changed datetiems for match
print (storage_openprices)
AAL AAPL ADBE AEYE AMZN GOOG \
Date
2020-02-14 27.910000 79.175003 347.010010 5.30 1885.880005 1439.010010
2020-02-15 27.450001 77.962502 346.420013 5.02 1872.250000 1430.209961
2020-02-16 27.790001 78.397499 345.980011 5.06 1882.989990 1447.439941
2020-02-17 28.299999 79.067497 349.000000 4.84 1885.890015 1462.910034
2020-02-21 27.969999 79.297501 346.369995 4.88 1865.000000 1479.119995
GPRO MSFT OYST PACB RADI SGEN TSLA
Date
2020-02-14 4.23 163.389999 28.010000 4.85 NaN 104.849998 108.851997
2020-02-15 4.16 162.619995 26.510000 4.80 NaN 108.550003 105.952003
2020-02-16 4.28 164.350006 25.530001 4.93 NaN 107.330002 98.750000
2020-02-17 4.36 167.419998 24.740000 5.03 NaN 108.410004 101.522003
2020-02-21 4.28 166.679993 26.190001 4.95 NaN 108.379997 106.050003
df1 = storage_openprices.stack().rename_axis(['Date','type']).reset_index(name='new')
df2 = refined_df.stack().rename_axis(['Date','cols']).reset_index(name='type')
new = (pd.merge_asof(df2, df1, on='Date',by='type', direction='forward')
.pivot('Date','cols','new'))
print (new)
cols 0 1
Date
2020-02-03 108.851997 163.389999
2020-02-19 1865.000000 346.369995
2020-03-05 NaN NaN
2020-03-20 NaN NaN
2020-04-06 NaN NaN
2020-04-22 NaN NaN
2020-05-07 NaN NaN

How to assign a values to dataframe's column by comparing values in another dataframe

I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)

How to select rows from different point in pandas Dataframe

How do i select/slice from multiple points which is in this case, starting from max() for the for all the columns. each stock has its own max value, so it shall begin selections from that particular point.
df
>>> TSLA MSFT
2017-05-15 00:00:00+00:00 314 68
2017-05-16 00:00:00+00:00 319 69
2017-05-17 00:00:00+00:00 320 61
2017-05-18 00:00:00+00:00 313 66
2017-05-19 00:00:00+00:00 316 70
2017-05-22 00:00:00+00:00 314 65
2017-05-23 00:00:00+00:00 310 63
max_idx = df.idxmax() # returns index of max value
>>> TSLA 2017-05-17 00:00:00+00:00
>>> MSFT 2017-05-19 00:00:00+00:00
max_value = df.max() # returns max value
>>> TSLA = 320
>>> MSFT = 70
Is their any way like using,
df2 = df.loc[max_idx:]
i want the output such that i can later find the max_value and max_idx again on this new output starting from,
TSLA 2017-05-17 00:00:00+00:00
MSFT 2017-05-19 00:00:00+00:00
EDIT : am expecting the following output:
df2
>>> TSLA MSFT
2017-05-17 00:00:00+00:00 320 2017-05-19 00:00:00+00:00 70
2017-05-18 00:00:00+00:00 313 2017-05-22 00:00:00+00:00 65
2017-05-19 00:00:00+00:00 316 2017-05-23 00:00:00+00:00 63
2017-05-22 00:00:00+00:00 314
2017-05-23 00:00:00+00:00 310
Similar to how #Unutbu used Multindexing, the new dataframe can be multindexed if possible.
Just for example, i only posted 2 columns, but their will be 100's of columns,
so please keep in mind such big data. Thanks!
You could use the apply method:
In [204]: df.apply(lambda s: s.loc[s.idxmax():])
Out[204]:
MSFT TSLA
2017-05-17 NaN 320
2017-05-18 NaN 313
2017-05-19 70.0 316
2017-05-22 65.0 314
2017-05-23 63.0 310
or, building on MaxU's answer,
In [205]: pd.concat({c:df.loc[max_idx[c]:, c] for c in df.columns}).unstack(level=0)
Out[205]:
MSFT TSLA
2017-05-17 NaN 320.0
2017-05-18 NaN 313.0
2017-05-19 70.0 316.0
2017-05-22 65.0 314.0
2017-05-23 63.0 310.0
Both of these solutions loop over the columns. (df.apply's loop is done under
the hood, but it amounts to a Python-speed loop performance-wise.) I know you
are looking for a vectorized solution but in this case I don't see a way to
avoid the loop.
If you want to avoid the NaNs, you could leave the answer unstacked:
In [208]: pd.concat({c:df.loc[max_idx[c]:, c] for c in df.columns})
Out[208]:
MSFT 2017-05-19 70
2017-05-22 65
2017-05-23 63
TSLA 2017-05-17 320
2017-05-18 313
2017-05-19 316
2017-05-22 314
2017-05-23 310
dtype: int64
or, if you're using df.apply, call stack to move the columns labels into a level of the row index:
In [213]: df.apply(lambda s: s.loc[s.idxmax():]).T.stack()
Out[213]:
MSFT 2017-05-19 70.0
2017-05-22 65.0
2017-05-23 63.0
TSLA 2017-05-17 320.0
2017-05-18 313.0
2017-05-19 316.0
2017-05-22 314.0
2017-05-23 310.0
dtype: float64
So let's look at performance. With this setup (to test on a bigger DataFrame):
shape = (1000,2000)
bigdf = pd.DataFrame(np.random.randint(100, size=shape),
index=pd.date_range('2000-1-1', periods=N))
def using_apply(df):
return df.apply(lambda s: s.loc[s.idxmax():])
def using_loop(df):
max_idx = df.idxmax()
return pd.concat({c:df.loc[max_idx[c]:, c] for c in df.columns}).unstack(level=0)
MaxU's using_loop is slightly faster than using_apply:
In [202]: %timeit using_apply(bigdf)
1 loop, best of 3: 1.45 s per loop
In [203]: %timeit using_loop(bigdf)
1 loop, best of 3: 1.22 s per loop
Note, however, that it is best to test benchmarks on your own machine as results
may vary.
We could do something like this:
In [120]: {c:df.loc[max_idx[c]:, c].max() for c in df.columns}
Out[120]: {'MSFT': 70, 'TSLA': 320}
if you want to slice based on the index of the max, you can use:
df[(df.index > max_idx.TSLA) & (df.index > max_idx.TSLA)]
which gives you the rows with a timestamp greater than both maxima (you could choose one or the other, I wasn't sure what you wanted.)

Categories