Convert datetime64 to integer hours using Python - python

I have the following dataframe
Timestamp KE EH LA PR
0 2013-02-27 00:00:00 1.000000 2.0 0.03 201.289993
1 2013-02-27 00:15:00 0.990000 2.0 0.03 210.070007
2 2013-02-27 00:30:00 0.950000 2.0 0.02 207.779999
3 2013-02-27 00:45:00 0.990000 2.0 0.03 151.960007
4 2013-02-27 01:00:00 341.209991 2.0 0.04 0.000000
... ... ... ... ... ...
3449 2013-04-03 22:15:00 NaN 2.0 0.03 0.000000
3450 2013-04-03 22:30:00 NaN NaN 0.07 0.000000
I would like to turn the timestamp column into a numeric value while keeping only the 'time' like illustrated below
Timestamp KE EH LA PR
0 0 1.000000 2.0 0.03 201.289993
1 15 0.990000 2.0 0.03 210.070007
2 30 0.950000 2.0 0.02 207.779999
3 45 0.990000 2.0 0.03 151.960007
4 100 341.209991 2.0 0.04 0.000000
... ... ... ... ... ...
3449 2215 NaN 2.0 0.03 0.000000
3450 2230 NaN NaN 0.07 0.000000
I tried the code below based on this post[enter link description here][1]
h3_hours["Timestamp"] = h3_hours["Timestamp"].astype(str).astype(int)
But this gives me the error below
ValueError: invalid literal for int() with base 10: '2013-02-27 00:00:00'
Then I found online this approach
# Turn datetime into hours and minutes
h3_hours["Timestamp"] = h3_hours["Timestamp"].dt.strftime("%H:%M")
h3_hours["Timestamp"] = h3_hours["Timestamp"].astype(int)
But this gives me the same error as above.
Any idea on how I can turn my Timestamp column into integer hours?
[1]: Pandas: convert dtype 'object' to int

You are close, need remove : in %H%M for HHMM format:
h3_hours["Timestamp"] = h3_hours["Timestamp"].dt.strftime("%H%M").astype(int)

Related

Interpolating a data set in pandas while ignoring missing data

I have a question of how o get interpolated data across a several different "blocks" of time. In a nut shell, I have a dataset like this:
>>> import pandas as pd
>>> test_data = pd.read_csv("test_data.csv")
>>> test_data
ID Condition_num Condition_type Rating Timestamp_ms
0 101 1 Active 58.0 30
1 101 1 Active 59.0 60
2 101 1 Active 65.0 90
3 101 1 Active 70.0 120
4 101 1 Active 80.0 150
5 101 2 Break NaN 180
6 101 3 Active_2 55.0 210
7 101 3 Active_2 60.0 240
8 101 3 Active_2 63.0 270
9 101 3 Active_2 70.0 300
10 101 4 Break NaN 330
11 101 5 Active_3 69.0 360
12 101 5 Active_3 71.0 390
13 101 5 Active_3 50.0 420
14 101 5 Active_3 41.0 450
15 101 5 Active_3 43.0 480
I need to "resample" the final column to a another time interval (e.g. 40 ms) to match it to an external data set. I have been using the following code:
#Setting the column with timestamps as a datetime with the correct units, then set index
test_data['Timestamp_ms'] = pd.to_datetime(test_data['Timestamp_ms'], unit='ms')
test_data = test_data.set_index('Timestamp_ms')
#Resample index to start at 0, resample to the highest resolution 1ms, then resample to 800ms
test_data = test_data.reindex(
pd.date_range(start=pd.to_datetime(0, unit='ms'), end=test_data.index.max(), freq='ms')
)
test_data = test_data.resample('1ms').interpolate().resample('40ms').interpolate()
#Round ms to intergers
test_data.xpos = test_data..round()
Which gives me this:
ID Condition_num Condition_type Rating
1970-01-01 00:00:00.000 NaN NaN NaN NaN
1970-01-01 00:00:00.040 101.0 1.000000 NaN 58.333333
1970-01-01 00:00:00.080 101.0 1.000000 NaN 63.000000
1970-01-01 00:00:00.120 101.0 1.000000 Active 70.000000
1970-01-01 00:00:00.160 101.0 1.333333 NaN 75.833333
1970-01-01 00:00:00.200 101.0 2.666667 NaN 59.166667
1970-01-01 00:00:00.240 101.0 3.000000 Active_2 60.000000
1970-01-01 00:00:00.280 101.0 3.000000 NaN 65.333333
1970-01-01 00:00:00.320 101.0 3.666667 NaN 69.666667
1970-01-01 00:00:00.360 101.0 5.000000 Active_3 69.000000
1970-01-01 00:00:00.400 101.0 5.000000 NaN 64.000000
1970-01-01 00:00:00.440 101.0 5.000000 NaN 44.000000
1970-01-01 00:00:00.480 101.0 5.000000 Active_3 43.000000
The only issue is I cannot figure out which ratings are happening during the "Active" conditions and whether the ratings I am seeing are caused by extrapolations of the "breaks" where there are no ratings. In so many words, I want the interpolation in the "Active" blocks but also have everything aligned to the beginning of the whole data set.
I have tried entering Zero ratings for NaN and interpolating from the top of each condition, but that seems only to make the problem worse by altering the ratings more.
Any advice would be greatly appreciated!
I think you need to do all of your logic inside of a groupby, IIUC:
mask = df.Condition_type.ne('Break')
df2 = (df[mask].groupby('Condition_type') # Groupby Condition_type, excluding "Break" rows.
.apply(lambda x: x.resample('1ms') # To each group... resample it.
.interpolate() # Interpolate
.ffill() # Fill values, this just applies to the Condition_type.
.resample('40ms')# Resample to 40ms
.asfreq()) # No need to interpolate in this direction.
.reset_index('Condition_type', drop=True)) # We no longer need this extra index~
# Force the index to our resample'd interval, this will reveal the breaks:
df2 = df2.asfreq('40ms')
print(df2)
Output:
ID Condition_num Condition_type Rating
Timestamp_ms
1970-01-01 00:00:00.000 NaN NaN NaN NaN
1970-01-01 00:00:00.040 101.0 1.0 Active 58.333333
1970-01-01 00:00:00.080 101.0 1.0 Active 63.000000
1970-01-01 00:00:00.120 101.0 1.0 Active 70.000000
1970-01-01 00:00:00.160 NaN NaN NaN NaN
1970-01-01 00:00:00.200 NaN NaN NaN NaN
1970-01-01 00:00:00.240 101.0 3.0 Active_2 60.000000
1970-01-01 00:00:00.280 101.0 3.0 Active_2 65.333333
1970-01-01 00:00:00.320 NaN NaN NaN NaN
1970-01-01 00:00:00.360 101.0 5.0 Active_3 69.000000
1970-01-01 00:00:00.400 101.0 5.0 Active_3 64.000000
1970-01-01 00:00:00.440 101.0 5.0 Active_3 44.000000
1970-01-01 00:00:00.480 101.0 5.0 Active_3 43.000000

Irregular export data to csv file with NaN values

I have this data on a csv file:
Date/Time kWh kVArh kVA PF
0 2021-01-01 00:30:00 471.84 0.00 943.6800 1.0000
1 2021-01-01 01:00:00 491.04 1.44 982.0842 1.0000
2 2021-01-01 01:30:00 475.20 0.00 950.4000 1.0000
3 2021-01-01 02:00:00 470.88 0.00 941.7600 1.0000
4 2021-01-01 02:30:00 466.56 0.00 933.1200 1.0000
... ... ... ... ... ...
9223 2021-07-14 04:00:00 1104.00 53.28 2210.5698 0.9988
9224 2021-07-14 04:30:00 1156.30 49.92 2314.7542 0.9991
9225 2021-07-14 05:00:00 1176.00 37.92 2353.2224 0.9995
9226 2021-07-14 05:30:00 1177.00 27.36 2354.6359 0.9997
9227 2021-07-14 06:00:00 1196.60 22.56 2393.6253 0.9998
And I use this code to read it and later export it to a csv file, after I calculate the average for every hour.
import pandas as pd
file = pd.read_csv('Electricity_data.csv',
sep = ',',
skiprows = 0,
dayfirst = True,
parse_dates = ['Date/Time'])
pd_mean = file.groupby(pd.Grouper(key = 'Date/Time', freq = 'H')).mean().reset_index()
pd_mean.to_csv("data_1h_year_.csv")
However, when I run it, my final file has a gap.
Data before the code launches (Date: 03/01/2021):
Date/Time kWh kVArh kVA PF
90 2021-02-01 21:30:00 496.83 0.00 993.6600 1.0
91 2021-02-01 22:00:00 486.72 0.00 973.4400 1.0
92 2021-02-01 22:30:00 490.08 0.00 980.1600 1.0
93 2021-02-01 23:00:00 503.00 1.92 1006.0073 1.0
94 2021-02-01 23:30:00 484.84 0.00 969.6800 1.0
95 2021-03-01 00:00:00 484.80 0.00 969.6000 1.0
96 2021-03-01 00:30:00 487.68 0.00 975.3600 1.0
97 2021-03-01 01:00:00 508.30 1.44 1016.6041 1.0
98 2021-03-01 01:30:00 488.66 0.00 977.3200 1.0
99 2021-03-01 02:00:00 486.24 0.00 972.4800 1.0
100 2021-03-01 02:30:00 495.36 1.44 990.7242 1.0
101 2021-03-01 03:00:00 484.32 0.00 968.6400 1.0
102 2021-03-01 03:30:00 485.76 0.00 971.5200 1.0
103 2021-03-01 04:00:00 492.48 1.44 984.9642 1.0
104 2021-03-01 04:30:00 476.16 0.00 952.3200 1.0
105 2021-03-01 05:00:00 477.12 0.00 954.2400 1.0
Data after the code launches (Date: 03/01/2021):
Date/Time kWh kVArh kVA PF
45 2021-01-02 21:00:00 1658.650 292.32 3368.45000 0.98485
46 2021-01-02 22:00:00 1622.150 291.60 3296.34415 0.98420
47 2021-01-02 23:00:00 1619.300 261.36 3280.52380 0.98720
48 2021-01-03 00:00:00 NaN NaN NaN NaN
49 2021-01-03 01:00:00 NaN NaN NaN NaN
50 2021-01-03 02:00:00 NaN NaN NaN NaN
51 2021-01-03 03:00:00 NaN NaN NaN NaN
52 2021-01-03 04:00:00 NaN NaN NaN NaN
53 2021-01-03 05:00:00 NaN NaN NaN NaN
54 2021-01-03 06:00:00 1202.400 158.40 2425.57730 0.99140
55 2021-01-03 07:00:00 1209.375 168.00 2441.98105 0.99050
56 2021-01-03 08:00:00 1260.950 162.72 2542.89820 0.99175
57 2021-01-03 09:00:00 1308.975 195.60 2647.07935 0.98900
58 2021-01-03 10:00:00 1334.150 193.20 2696.17005 0.98965
I do not know why this is happening, but it didn't calculate the mean values and I got the NaN forming gaps around the final csv file.
Pandas does not interpret correctly your dates. Specify the format yourself.
Use the code below to solve your problem:
parser = lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M')
df = pd.read_csv('data.csv', sep=',', skiprows=0,
parse_dates=['Date/Time'], date_parser=parser)
pd_mean = df.groupby(pd.Grouper(key='Date/Time', freq='H')).mean()
Check your dates before operation:
93 2021-02-01 23:00:00 # February, 1st
94 2021-02-01 23:30:00 # February, 1st
95 2021-03-01 00:00:00 # March, 1st
96 2021-03-01 00:30:00 # March, 1st

Reindexing timeseries data

I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.
I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation
I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)

How come pd.concat(axis=1) is row-binding instead of column binding?

I am using pandas and I have two dataframes which have the same number of rows:
All I want to do is to merge them together by ticker, which I tried with
a_no_nan2_last.join(ref, on = "ticker")
which gave me this error
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
So I checked the types of the column ticker in both and I checked that ticker is the only column in overlaps
set(a_no_nan2_last.columns).intersection(set(res.columns))
which gives
{'ticker'}
and
type(a_no_nan2_last["ticker"].iloc[0])
<class 'str'>
type(res["ticker"].iloc[0])
<class 'str'>
the data looks like
a_no_nan2_last
ticker Date Close Close_lead1
2141 1AD 2018-08-14 0.2845 0.160
5354 1AG 2018-08-14 0.0450 0.036
75158 1AL 2018-08-14 0.9287 0.800
188 1ST 2018-08-14 0.0370 0.079
81195 3DP 2018-08-14 0.0320 0.041
... ... ... ...
111688 ZMI 2018-08-14 0.1200 0.095
111310 ZNC 2018-08-14 0.1650 0.078
111877 ZNO 2018-08-13 0.1150 0.079
111373 ZTA 2018-08-10 0.0700 0.070
111940 ZYB 2018-08-09 0.0070 0.014
[1792 rows x 4 columns]
and
res
variable ticker ... Close__variance_larger_than_standard_deviation
0 1AD ... 0.0
1 1AG ... 0.0
2 1AL ... 0.0
3 1ST ... 0.0
4 3DP ... 0.0
... ... ...
1787 ZMI ... 0.0
1788 ZNC ... 0.0
1789 ZNO ... 0.0
1790 ZTA ... 0.0
1791 ZYB ... 0.0
[1792 rows x 795 columns]
so I tried concat, which appends by row not by column giving me the wrong result
pd.concat([a_no_nan2_last, res], axis=1)
ticker ... Close__variance_larger_than_standard_deviation
0 NaN ... 0.0
1 NaN ... 0.0
2 NaN ... 0.0
3 NaN ... 0.0
4 NaN ... 0.0
... ... ...
111688 ZMI ... NaN
111751 Z1P ... NaN
111814 ZIP ... NaN
111877 ZNO ... NaN
111940 ZYB ... NaN
[3556 rows x 799 columns]
I expected my output to be 1792 rows long.
Dropping the common column ticker also doesn't help, see
res3 = res.drop("ticker", axis=1)
pd.concat([a_no_nan2_last, res3], axis=1)
ticker ... Close__variance_larger_than_standard_deviation
0 NaN ... 0.0
1 NaN ... 0.0
2 NaN ... 0.0
3 NaN ... 0.0
4 NaN ... 0.0
... ... ...
111688 ZMI ... NaN
111751 Z1P ... NaN
111814 ZIP ... NaN
111877 ZNO ... NaN
111940 ZYB ... NaN
[3556 rows x 798 columns]
You need to reset_index(drop=True) e.g.
res1 = pd.concat([a_no_nan2_last.reset_index(drop=True), res.reset_index(drop=True)], axis=1)

How to split a column into multiple columns in pandas?

I have this data in a pandas dataframe,
name date close quantity daily_cumm_returns
0 AARTIIND 2000-01-03 3.84 21885.82 0.000000
1 AARTIIND 2000-01-04 3.60 56645.64 -0.062500
2 AARTIIND 2000-01-05 3.52 24460.62 -0.083333
3 AARTIIND 2000-01-06 3.58 42484.24 -0.067708
4 AARTIIND 2000-01-07 3.42 16736.21 -0.109375
5 AARTIIND 2000-01-10 3.42 20598.42 -0.109375
6 AARTIIND 2000-01-11 3.41 20598.42 -0.111979
7 AARTIIND 2000-01-12 3.27 100417.29 -0.148438
8 AARTIIND 2000-01-13 3.43 20598.42 -0.106771
9 AARTIIND 2000-01-14 3.60 5149.61 -0.062500
10 AARTIIND 2000-01-17 3.46 14161.42 -0.098958
11 AARTIIND 2000-01-18 3.50 136464.53 -0.088542
12 AARTIIND 2000-01-19 3.52 21885.82 -0.083333
13 AARTIIND 2000-01-20 3.73 75956.66 -0.028646
14 AARTIIND 2000-01-21 3.84 77244.07 0.000000
15 AARTIIND 2000-02-01 4.21 90118.08 0.000000
16 AARTIIND 2000-02-02 4.52 238169.21 0.073634
17 AARTIIND 2000-02-03 4.38 163499.94 0.040380
18 AARTIIND 2000-02-04 4.44 108141.71 0.054632
19 AARTIIND 2000-02-07 4.26 68232.27 0.011876
20 AARTIIND 2000-02-08 4.00 108141.71 -0.049881
21 AARTIIND 2000-02-09 3.96 32185.04 -0.059382
22 AARTIIND 2000-02-10 4.13 43771.63 -0.019002
23 AARTIIND 2000-02-11 3.96 3862.20 -0.059382
24 AARTIIND 2000-02-14 3.94 12874.01 -0.064133
25 AARTIIND 2000-02-15 3.90 33472.42 -0.073634
26 AARTIIND 2000-02-16 3.90 25748.02 -0.073634
27 AARTIIND 2000-02-17 3.90 60507.86 -0.073634
28 AARTIIND 2000-02-18 4.22 45059.04 0.002375
29 AARTIIND 2000-02-21 4.42 81106.27 0.049881
I wish to select every months data and transpose that into a new row,
for e.g. the first 15 rows should become one row with name AARTIIND, date 2000-01-03 and then 15 columns having daily cummulative returns.
name date first second third fourth fifth .... fifteenth
0 AARTIIND 2000-01-03 0.00 -0.062 -0.083 -0.067 -0.109 .... 0.00
To group the data month wise I am using,
group = df.groupby([pd.Grouper(freq='1M', key='date'), 'name'])
Setting the rows individually by using the code below is very slow and my dataset has 1 million rows
data = pd.DataFrame(columns = ('name', 'date', 'daily_zscore_1', 'daily_zscore_2', 'daily_zscore_3', 'daily_zscore_4', 'daily_zscore_5', 'daily_zscore_6', 'daily_zscore_7', 'daily_zscore_8', 'daily_zscore_9', 'daily_zscore_10', 'daily_zscore_11', 'daily_zscore_12', 'daily_zscore_13', 'daily_zscore_14', 'daily_zscore_15'))
data.loc[0] = [x['name'].iloc[0], x['date'].iloc[0]].extend(x['daily_cumm_returns'])
Is there any other faster way to accomplish this, as I see it this is just transposing one column and hence should be very fast. I tried pivot and melt but don't understand how to use them in this situation.
This is a bit sloppy but it gets the job done.
# grab AAPL data
from pandas_datareader import data
df = data.DataReader('AAPL', 'google', start='2014-01-01')[['Close', 'Volume']]
# add name column
df['name'] = 'AAPL'
# get daily return relative to first of month
df['daily_cumm_return'] = df.resample('M')['Close'].transform(lambda x: (x - x[0]) / x[0])
# get the first of the month for each date
df['first_month_date'] = df.assign(index_col=df.index).resample('M')['index_col'].transform('first')
# get a ranking of the days 1 to n
df['day_rank']= df.resample('M')['first_month_date'].rank(method='first')
# pivot to get final
df_final = df.pivot_table(index=['name', 'first_month_date'], columns='day_rank', values='daily_cumm_return')
Sample Output
day_rank 1.0 2.0 3.0 4.0 5.0 6.0 \
name first_month_date
AAPL 2014-01-02 0.0 -0.022020 -0.016705 -0.023665 -0.017464 -0.029992
2014-02-03 0.0 0.014375 0.022052 0.021912 0.036148 0.054710
2014-03-03 0.0 0.006632 0.008754 0.005704 0.005173 0.006102
2014-04-01 0.0 0.001680 -0.005299 -0.018222 -0.033600 -0.033600
2014-05-01 0.0 0.001775 0.015976 0.004970 0.001420 -0.005917
2014-06-02 0.0 0.014141 0.025721 0.029729 0.026834 0.043314
2014-07-01 0.0 -0.000428 0.005453 0.026198 0.019568 0.019996
day_rank 7.0 8.0 9.0 10.0 11.0 \
name first_month_date
AAPL 2014-01-02 -0.036573 -0.031511 -0.012149 0.007593 0.002025
2014-02-03 0.068667 0.068528 0.085555 0.084578 0.088625
2014-03-03 0.015785 0.016846 0.005571 -0.005704 -0.001857
2014-04-01 -0.020936 -0.033600 -0.040708 -0.036831 -0.043810
2014-05-01 -0.010059 0.002249 0.003787 0.004024 -0.004497
2014-06-02 0.049438 0.045095 0.027614 0.016368 0.026612
2014-07-01 0.016253 0.018178 0.031330 0.019247 0.013473
day_rank 12.0 13.0 14.0 15.0 16.0 \
name first_month_date
AAPL 2014-01-02 -0.022526 -0.007340 -0.002911 0.005442 -0.012782
2014-02-03 0.071458 0.059037 0.047313 0.051779 0.040893
2014-03-03 0.006897 0.006632 0.001857 0.009683 0.021754
2014-04-01 -0.041871 -0.030887 -0.019385 -0.018351 -0.031274
2014-05-01 0.010178 0.022130 0.022367 0.025089 0.026627
2014-06-02 0.025276 0.026389 0.022826 0.012248 0.011357
2014-07-01 -0.004598 0.009731 0.004491 0.012831 0.039243
day_rank 17.0 18.0 19.0 20.0 21.0 \
name first_month_date
AAPL 2014-01-02 -0.004809 -0.084282 -0.094660 -0.096431 -0.095039
2014-02-03 0.031542 0.052059 0.049267 NaN NaN
2014-03-03 0.032763 0.022815 0.018437 0.017244 0.017111
2014-04-01 0.048204 0.055958 0.096795 0.093564 0.089429
2014-05-01 0.038225 0.057751 0.054911 0.074201 0.070178
2014-06-02 0.005233 0.006124 0.012137 0.024162 0.034740
2014-07-01 0.037532 0.044376 0.058811 0.051967 0.049508
day_rank 22.0 23.0
name first_month_date
AAPL 2014-01-02 NaN NaN
2014-02-03 NaN NaN
2014-03-03 NaN NaN
2014-04-01 NaN NaN
2014-05-01 NaN NaN
2014-06-02 NaN NaN
2014-07-01 0.022241 NaN
Admittedly this does not get exactly what you want...
I think one way to handle this problem would be to create new columns of month and day based on the datetime (date) column, then set a multiindex on the month and name, then pivot the table.
df['month'] = df.date.dt.month
df['day'] = df.date.dt.day
df.set_index(['month', 'name'], inplace=True)
df[['day', 'daily_cumm_returns']].pivot(index=df.index, columns='day')
Result is:
daily_cumm_returns \
day 1 2 3 4 5
month name
1 AARTIIND NaN NaN 0.00000 -0.062500 -0.083333
2 AARTIIND 0.0 0.073634 0.04038 0.054632 NaN
I can't figure out a way to keep the first date of each month group as a column, otherwise I think this is more or less what you're after.

Categories