Reindexing pandas dataframe for stacking with new unique index

Reindexing pandas dataframe for stacking with new unique index - python

I have several dataframes which look like the following:
In [2]: skew
Out[2]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 96 entries, 2006-01-31 00:00:00 to 2013-12-31 00:00:00
Freq: BM
Data columns (total 3 columns):
AAPL 96 non-null values
GOOG 96 non-null values
MSFT 96 non-null values
dtypes: float64(3)
In [3]: skew.head()
Out[3]:
AAPL GOOG MSFT
2006-01-31 0.531769 -0.567731 2.132850
2006-02-28 -0.389711 0.028723 0.724277
2006-03-31 1.184884 1.009587 -0.959136
2006-04-28 1.664745 0.852869 -4.020731
2006-05-31 -0.419757 -0.288422 0.240444
In [5]: skew.index
Out[5]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-01-31 00:00:00, ..., 2013-12-31 00:00:00]
Length: 96, Freq: BM, Timezone: None
I want to generate a single column of them with a unique index so that I can merge it with the columns from the other dataframes at a later point, which would looks somewhat like this, but with an unique index:
frame
Out[6]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 288 entries, 2006-01-31 00:00:00 to 2013-12-31 00:00:00
Data columns (total 3 columns):
Returns 285 non-null values
Skew 288 non-null values
WinLose 288 non-null values
dtypes: bool(1), float64(2)
In [7]: frame.head()
Out[7]:
Returns Skew WinLose
2006-01-31 NaN 0.531769 True
2006-02-28 -0.092968 -0.389711 False
2006-03-31 -0.084246 1.184884 True
2006-04-28 0.122290 1.664745 False
2006-05-31 -0.150874 -0.419757 False
i.e, something like:
In [7]: frame.head()
Out[7]:
Returns Skew WinLose
2006-01-31-AAPL NaN 0.531769 True
2006-02-28-MSFT -0.092968 -0.389711 False
2006-03-31-AAPL -0.084246 1.184884 True
2006-04-28-GOGL 0.122290 1.664745 False
2006-05-31-AAPL -0.150874 -0.419757 False
The code is:
import pandas as pd
import pandas.io.data as web
#Class parameters
names = ['AAPL','GOOG','MSFT']
# Functions
def get_px(stock, start, end):
return web.get_data_yahoo(stock, start, end)['Close']
def getWinnerLoser(stock, medRet, retsM):
return retsM[stock].shift(-1) >= medRet.shift(-1)
def getSkew( stock, rets, period):
return pd.rolling_skew(rets[stock],period).asfreq('BM').fillna(method='pad')
px = pd.DataFrame(data={n: get_px(n,'1/1/2006','1/1/2014') for n in names})
px = px.asfreq('B').fillna(method = 'pad')
rets = px.pct_change()
# Monthly returns and median return
retsM = px.asfreq('BM').fillna(method = 'pad').pct_change()
medRet = retsM.median(axis = 1)
# Dataframes
winLose = pd.DataFrame(data = {n: getWinnerLoser(n,medRet,retsM) for n in names})
skew = pd.DataFrame(data = {n: getSkew(n,rets,20) for n in names})
# Concatenating
retsMCon = pd.concat(retsM[n] for n in names)
winLoseCon = pd.concat(winLose[n] for n in names)
skewCon = pd.concat(skew[n] for n in names)
frame = pd.DataFrame({'Returns':retsMCon, 'Skew':skewCon, 'WinLose':winLoseCon})
I have yet to find a good solution to this

Related

How can I return two variables from a function on python 3?

I have this dataframe:
df.head()
Open High Low Close Volume day_month
2006-04-13 10:00:00 1921.75 1922.00 1918.00 1918.25 11782 2006-04-13
2006-04-13 10:30:00 1918.25 1931.75 1918.00 1931.00 39744 2006-04-13
2006-04-13 11:00:00 1931.25 1934.00 1929.00 1930.25 34385 2006-04-13
2006-04-13 11:30:00 1930.50 1932.00 1928.50 1931.25 13539 2006-04-13
2006-04-13 12:00:00 1931.25 1932.25 1928.25 1928.75 10045 2006-04-13
df.tail()
Open High Low Close Volume day_month
2021-06-18 14:30:00 14077.50 14085.25 14033.00 14039.00 19573 2021-06-18
2021-06-18 15:00:00 14039.00 14085.50 14023.50 14077.00 27464 2021-06-18
2021-06-18 15:30:00 14077.00 14092.75 14028.75 14041.75 39410 2021-06-18
2021-06-18 16:00:00 14041.75 14049.00 14019.50 14042.75 17071 2021-06-18
2021-06-18 16:30:00 14040.00 14042.25 14015.00 14017.75 3167 2021-06-18
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 57233 entries, 2006-04-13 10:00:00 to 2021-06-18 16:30:00
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Open 57233 non-null float64
1 High 57233 non-null float64
2 Low 57233 non-null float64
3 Close 57233 non-null float64
4 Volume 57233 non-null int32
5 day_month 57233 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int32(1)
I am using marketprofile package to create a function and I want to store the output into different variables
from market_profile import MarketProfile
I create this function to store the values
def mp_va(df):
mp = MarketProfile(df, tick_size = 0.25)
mp_slice = mp[df.index.min():df.index.max()]
return mp_slice.value_area[0], mp_slice.value_area[1], mp_slice.poc_price
I want to store the three output into a data frame in python for all days in my dataset
Using the below code I applied the function I just create into all days in my data frame
df_mp = df.groupby(['day_month']).apply(mp_va)
This was the output:
df_mp
day_month
2006-04-13 (1927.5, 1931.25, 1930.25)
2006-04-17 (1898.5, 1922.5, 1898.5)
2006-04-18 (1923.75, 1938.25, 1935.25)
2006-04-19 (1935.75, 1941.25, 1936.5)
2006-04-20 (1939.25, 1941.75, 1939.25)
...
2021-06-14 (13998.75, 14055.5, 14021.0)
2021-06-15 (14030.25, 14097.25, 14097.25)
2021-06-16 (13916.5, 14016.5, 13922.75)
2021-06-17 (14024.75, 14160.0, 14024.75)
2021-06-18 (14052.0, 14106.5, 14096.5)
Length: 3913, dtype: object
This was one of the suggestions:
def mp_va(df):
global df_2
mp = MarketProfile(df, tick_size = 0.25)
mp_slice = mp[df.index.min():df.index.max()]
data = {mp_slice.value_area[0], mp_slice.value_area[1], mp_slice.poc_price}
df_2 = pd.DataFrame(data)
return df_2
df_mp = df.groupby(['day_month']).apply(mp_va)
df_mp
The output for this code was:
0
day_month
2006-04-13 0 1930.25
1 1931.25
2 1927.50
2006-04-17 0 1898.50
1 1922.50
... ...
2021-06-17 0 14024.75
1 14160.00
2021-06-18 0 14096.50
1 14106.50
2 14052.00
[10202 rows x 1 columns]
ValueError: not enough values to unpack (expected 3, got 1)
Detailed traceback:
File "<string>", line 1, in <module>
This was my first attempt to create the data frame with all three variables by unpacking
va_high, va_low, op_price = df_mp
This is the output from that code which gives me an error
ValueError: too many values to unpack (expected 3)
I also tried:
def mp_va(df):
mp = MarketProfile(df, tick_size = 0.25)
mp_slice = mp[df.index.min():df.index.max()]
data = {mp_slice.value_area[0], mp_slice.value_area[1], mp_slice.poc_price}
But this code gives me an error:
ValueError: not enough values to unpack (expected 2, got 0)
I also tried this:
def mp_va(df):
global df_2
mp = MarketProfile(df, tick_size = 0.25)
mp_slice = mp[df.index.min():df.index.max()]
data = {mp_slice.value_area[0], mp_slice.value_area[1], mp_slice.poc_price}
df_2 = pd.DataFrame(data)
df_mp = df.groupby(['day_month']).apply(mp_va)
va_high, va_low, op_price = df_mp
But this gives me an error message:
ValueError: not enough values to unpack (expected 3, got 0)
My question is: Is there a way to store the values of the three outputs in a data frame from the above function?
The expected output will be:
va_low va_high op_price
2006-04-13 1927.5 1931.25 1930.25
2006-04-17 1898.5 1922.5 1934.50

I'm not able to test this at the time, but based on the traceback; Your function isn't returning any values which is why it states (expected 3, got 0) if you add return df_2 to the end of mp_va() function, it should fix your issue
def mp_va(df):
global df_2
mp = MarketProfile(df, tick_size = 0.25)
mp_slice = mp[df.index.min():df.index.max()]
data = {mp_slice.value_area[0], mp_slice.value_area[1], mp_slice.poc_price}
df_2 = pd.DataFrame(data)
return df_2
UPDATE:
the below code now works
def mp_va(df):
global df_2
mp = MarketProfile(df, tick_size = 0.25)
mp_slice = mp[df.index.min():df.index.max()] data = {'vn Low':
[mp_slice.value_area[0]], 'vn High' : [mp_slice.value_area[1]], 'op Price': [mp_slice.poc_price]}
df_2 = pd.DataFrame(data)
return df_2
df_map = df.groupby(['day_month']).apply(mp_va)

Currency/Date datframe merge failing

Having issues with merging two datframes (xrate and df) based on currency_str and created_date_time
display(xrate.info())
Int64Index: 1611 entries, 6 to 112
Data columns (total 3 columns):
Date 1611 non-null datetime64[ns]
PX_LAST 1611 non-null object
Currency 1611 non-null object
display(xrate.head(3))
Date PX_LAST Currency
2018-05-30 1 CAD
2018-05-29 1 CAD
2018-05-28 1 CAD
I created a new date to merge on:
#df['formatted_created_date_time'] = df['created_date_time'].dt.strftime('%d%m%Y')
df['formatted_created_date_time'] = df['created_date_time'].dt.strftime('%d-%m-%Y')
#convert to date
#df['formatted_created_date_time'] = pd.to_datetime(df['formatted_created_date_time'], format='%d%m%Y')
df['formatted_created_date_time'] = pd.to_datetime(df['formatted_created_date_time'], format='%d-%m-%Y')
display(df.info())
RangeIndex: 3488 entries, 0 to 3487
Data columns (total 43 columns):
created_date_time 3488 non-null datetime64[ns]
rfq_create_date_time 3488 non-null datetime64[ns]
currency_str 3488 non-null object
display(df.head(3))
Now the two dataframes are merged:
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'], right_on=['Currency', 'Date'], how='left')
display(result.info())
RangeIndex: 3488 entries, 0 to 3487
Data columns (total 43 columns):
created_date_time 3488 non-null datetime64[ns]
rfq_create_date_time 3488 non-null datetime64[ns]
.
.
formatted_created_date_time 3488 non-null datetime64[ns]
The match has failed:
display(result.head(3))
System Datetime:
Any ideas on this one?

It should working nice.
But another solution is merge by strings:
df['formatted_created_date_time'] = df['created_date_time'].dt.strftime('%d-%m-%Y')
xrate['Date'] = xrate['Date'].dt.strftime('%d-%m-%Y')
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'],
right_on=['Currency', 'Date'], how='left')
Your solution should be simplify by floor or date
df['formatted_created_date_time'] = df['created_date_time'].dt.floor('d')
xrate['Date'] = xrate['Date'].dt.floor('d')
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'],
right_on=['Currency', 'Date'], how='left')
df['formatted_created_date_time'] = df['created_date_time'].dt.date
xrate['Date'] = xrate['Date'].dt.date
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'],
right_on=['Currency', 'Date'], how='left')

Using set_index within a custom function

I would like to convert the date observations from a column into the index for my dataframe. I am able to do this with the code below:
Sample data:
test = pd.DataFrame({'Values':[1,2,3], 'Date':["1/1/2016 17:49","1/2/2016 7:10","1/3/2016 15:19"]})
Indexing code:
test['Date Index'] = pd.to_datetime(test['Date'])
test = test.set_index('Date Index')
test['Index'] = test.index.date
However when I try to include this code in a function, I am able to create the 'Date Index' column but set_index does not seem to work as expected.
def date_index(df):
df['Date Index'] = pd.to_datetime(df['Date'])
df = df.set_index('Date Index')
df['Index'] = df.index.date
If I inspect the output of not using a function info() returns:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3 entries, 2016-01-01 17:49:00 to 2016-01-03 15:19:00
Data columns (total 3 columns):
Date 3 non-null object
Values 3 non-null int64
Index 3 non-null object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes
If I inspect the output of the function info() returns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
Date 3 non-null object
Values 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 120.0+ bytes
I would like the DatetimeIndex.
How can set_index be used within a function? Am I using it incorrectly?

IIUC return df is missing:
df1 = pd.DataFrame({'Values':[1,2,3], 'Exam Completed Date':["1/1/2016 17:49","1/2/2016 7:10","1/3/2016 15:19"]})
def date_index(df):
df['Exam Completed Date Index'] = pd.to_datetime(df['Exam Completed Date'])
df = df.set_index('Exam Completed Date Index')
df['Index'] = df.index.date
return df
print (date_index(df1))
Exam Completed Date Values Index
Exam Completed Date Index
2016-01-01 17:49:00 1/1/2016 17:49 1 2016-01-01
2016-01-02 07:10:00 1/2/2016 7:10 2 2016-01-02
2016-01-03 15:19:00 1/3/2016 15:19 3 2016-01-03
print (date_index(df1).info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3 entries, 2016-01-01 17:49:00 to 2016-01-03 15:19:00
Data columns (total 3 columns):
Exam Completed Date 3 non-null object
Values 3 non-null int64
Index 3 non-null object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes
None

Using rolling_apply on a DataFrame object

I am trying to calculate Volume Weighted Average Price on a rolling basis.
To do this, I have a function vwap that does this for me, like so:
def vwap(bars):
return ((bars.Close*bars.Volume).sum()/bars.Volume.sum()).round(2)
When I try to use this function with rolling_apply, as shown, I get an error:
import pandas.io.data as web
bars = web.DataReader('AAPL','yahoo')
print pandas.rolling_apply(bars,30,vwap)
AttributeError: 'numpy.ndarray' object has no attribute 'Close'
The error makes sense to me because the rolling_apply requires not DataSeries or a ndarray as an input and not a dataFrame.. the way I am doing it.
Is there a way to use rolling_apply to a DataFrame to solve my problem?

This is not directly enabled, but you can do it like this
In [29]: bars
Out[29]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 942 entries, 2010-01-04 00:00:00 to 2013-09-30 00:00:00
Data columns (total 6 columns):
Open 942 non-null values
High 942 non-null values
Low 942 non-null values
Close 942 non-null values
Volume 942 non-null values
Adj Close 942 non-null values
dtypes: float64(5), int64(1)
window=30
In [30]: concat([ (Series(vwap(bars.iloc[i:i+window]),
index=[bars.index[i+window]])) for i in xrange(len(df)-window) ])
Out[30]:
2010-02-17 203.21
2010-02-18 202.95
2010-02-19 202.64
2010-02-22 202.41
2010-02-23 202.19
2010-02-24 201.85
2010-02-25 201.65
2010-02-26 201.50
2010-03-01 201.31
2010-03-02 201.35
2010-03-03 201.42
2010-03-04 201.09
2010-03-05 200.95
2010-03-08 201.50
2010-03-09 202.02
...
2013-09-10 485.94
2013-09-11 487.38
2013-09-12 486.77
2013-09-13 487.23
2013-09-16 487.20
2013-09-17 486.09
2013-09-18 485.52
2013-09-19 485.30
2013-09-20 485.37
2013-09-23 484.87
2013-09-24 485.81
2013-09-25 486.41
2013-09-26 486.07
2013-09-27 485.30
2013-09-30 484.74
Length: 912

A cleaned up version for reference, hopefully got the indexing correct:
def myrolling_apply(df, N, f, nn=1):
ii = [int(x) for x in arange(0, df.shape[0] - N + 1, nn)]
out = [f(df.iloc[i:(i + N)]) for i in ii]
out = pandas.Series(out)
out.index = df.index[N-1::nn]
return(out)

Modified #mathtick's answer to include na_fill. Also note that your function f needs to return a single value, this can't return a dataframe with multiple columns.
def rolling_apply_df(dfg, N, f, nn=1, na_fill=True):
ii = [int(x) for x in np.arange(0, dfg.shape[0] - N + 1, nn)]
out = [f(dfg.iloc[i:(i + N)]) for i in ii]
if(na_fill):
out = pd.Series(np.concatenate([np.repeat(np.nan, N-1),np.array(out)]))
out.index = dfg.index[::nn]
else:
out = pd.Series(out)
out.index = dfg.index[N-1::nn]
return(out)

Missing one value when retrieving Pandas DataFrame from HDFStore using select with terms on DateTimeIndex

I'm trying to retrieve stored data from a HDFStore using Pandas, using select and terms. A simple select(), without terms, returns all data. However, when I try to filter data based on a DateTimeIndex, everything but the last row is returned.
I suspect there is something fishy regarding how timestamps are stored internally and the precision of them, but I fail to see why it is not working or what I can do about it. Any pointers would be helpful, as I'm quite new at this.
I've created a small "unit test" to investigate ...
import os
import tempfile
import uuid
import pandas as pd
import numpy as np
import time
import unittest
import sys
class PandasTestCase(unittest.TestCase):
def setUp(self):
print "Pandas version: {0}".format(pd.version.version)
print "Python version: {0}".format(sys.version)
self._filename = os.path.join(tempfile.gettempdir(), '{0}.{1}'.format(str(uuid.uuid4()), 'h5'))
self._store = pd.HDFStore(self._filename)
def tearDown(self):
self._store.close()
if os.path.isfile(self._filename):
os.remove(self._filename)
def test_filtering(self):
t_start = time.time() * 1e+9
t_end = t_start + 1e+9 # 1 second later, i.e. 10^9 ns
sample_count = 1000
timestamps = np.linspace(t_start, t_end, num=sample_count).tolist()
data = {'channel_a': range(sample_count)}
time_index = pd.to_datetime(timestamps, utc=True, unit='ns')
df = pd.DataFrame(data, index=time_index, dtype=long)
key = 'test'
self._store.append(key, df)
retrieved_df = self._store.select(key)
retrieved_timestamps = np.array(retrieved_df.index.values, dtype=np.uint64).tolist()
print "Retrieved {0} timestamps, w/o filter.".format(len(retrieved_timestamps))
self.assertItemsEqual(retrieved_timestamps, timestamps)
stored_time_index = self._store[key].index
# Create a filter based on first and last values of index, i.e. from <= index <= to.
from_filter = pd.Term('index>={0}'.format(pd.to_datetime(stored_time_index[0], utc=True, unit='ns')))
to_filter = pd.Term('index<={0}'.format(pd.to_datetime(stored_time_index[-1], utc=True, unit='ns')))
retrieved_df_interval = self._store.select(key, [from_filter, to_filter])
retrieved_timestamps_interval = np.array(retrieved_df_interval.index.values, dtype=np.uint64).tolist()
print "Retrieved {0} timestamps, using filter".format(len(retrieved_timestamps_interval))
self.assertItemsEqual(retrieved_timestamps_interval, timestamps)
if __name__ == '__main__':
unittest.main()
... which outputs the following:
Pandas version: 0.12.0
Python version: 2.7.3 (default, Apr 10 2013, 06:20:15)
[GCC 4.6.3]
Retrieved 1000 timestamps, w/o filter.
Retrieved 999 timestamps, using filter
F
======================================================================
FAIL: test_filtering (__main__.PandasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "pandastest.py", line 53, in test_filtering
self.assertItemsEqual(retrieved_timestamps_interval, timestamps)
AssertionError: Element counts were not equal:
First has 1, Second has 0: 1.377701660170978e+18
----------------------------------------------------------------------
Ran 1 test in 0.039s
FAILED (failures=1)
Process finished with exit code 1
Update: After modifying the creation of terms, using the alternate constructor, everything works just fine. Like so:
# Create a filter based on first and last values of index, i.e. from <= index <= to.
#from_filter = pd.Term('index>={0}'.format(pd.to_datetime(stored_time_index[0], utc=True, unit='ns')))
from_filter = pd.Term('index','>=', stored_time_index[0])
#to_filter = pd.Term('index<={0}'.format(pd.to_datetime(stored_time_index[-1], utc=True, unit='ns')))
to_filter = pd.Term('index','<=', stored_time_index[-1])

The string formatting on the Timestamp is defaulted to 6 decimal places (which is what your formatting on the Term is doing)
ns are 9 places, use the alternative form of the Term constructor
Term("index","<=",stamp)
Here's an example
In [2]: start = Timestamp('20130101 9:00:00')
In [3]: start.value
Out[3]: 1357030800000000000
In [5]: index = pd.to_datetime([ start.value + i for i in list(ran
Out[8]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:00:00, ..., 2013-01-01 09:00:00.000000999]
Length: 1000, Freq: None, Timezone: None
In [9]: df = DataFrame(randn(1000,2),index=index)
In [10]: df.to_hdf('test.h5','df',mode='w',fmt='t')
In [12]: pd.read_hdf('test.h5','df')
Out[12]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2013-01-01 09:00:00 to 2013-01-01 09:00:00
Data columns (total 2 columns):
0 1000 non-null values
1 1000 non-null values
dtypes: float64(2)
In [15]: pd.read_hdf('test.h5','df',where=[pd.Term('index','<=',index[-1])])
Out[15]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2013-01-01 09:00:00 to 2013-01-01 09:00:00
Data columns (total 2 columns):
0 1000 non-null values
1 1000 non-null values
dtypes: float64(2)
In [16]: pd.read_hdf('test.h5','df',where=[pd.Term('index','<=',index[-1].value-1)])
Out[16]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 999 entries, 2013-01-01 09:00:00 to 2013-01-01 09:00:00
Data columns (total 2 columns):
0 999 non-null values
1 999 non-null values
dtypes: float64(2)
Note that in 0.13 (this example uses master), this will be even easier (and you can directly include it like: 'index<=index[-1]' (the index on the rhs of the expression is actually the local variable index

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reindexing pandas dataframe for stacking with new unique index - python

Related

How can I return two variables from a function on python 3?

Currency/Date datframe merge failing

Using set_index within a custom function

Using rolling_apply on a DataFrame object

Missing one value when retrieving Pandas DataFrame from HDFStore using select with terms on DateTimeIndex

Categories

Resources