How to split a column into multiple columns in pandas?

How to split a column into multiple columns in pandas? - python

I have this data in a pandas dataframe,
name date close quantity daily_cumm_returns
0 AARTIIND 2000-01-03 3.84 21885.82 0.000000
1 AARTIIND 2000-01-04 3.60 56645.64 -0.062500
2 AARTIIND 2000-01-05 3.52 24460.62 -0.083333
3 AARTIIND 2000-01-06 3.58 42484.24 -0.067708
4 AARTIIND 2000-01-07 3.42 16736.21 -0.109375
5 AARTIIND 2000-01-10 3.42 20598.42 -0.109375
6 AARTIIND 2000-01-11 3.41 20598.42 -0.111979
7 AARTIIND 2000-01-12 3.27 100417.29 -0.148438
8 AARTIIND 2000-01-13 3.43 20598.42 -0.106771
9 AARTIIND 2000-01-14 3.60 5149.61 -0.062500
10 AARTIIND 2000-01-17 3.46 14161.42 -0.098958
11 AARTIIND 2000-01-18 3.50 136464.53 -0.088542
12 AARTIIND 2000-01-19 3.52 21885.82 -0.083333
13 AARTIIND 2000-01-20 3.73 75956.66 -0.028646
14 AARTIIND 2000-01-21 3.84 77244.07 0.000000
15 AARTIIND 2000-02-01 4.21 90118.08 0.000000
16 AARTIIND 2000-02-02 4.52 238169.21 0.073634
17 AARTIIND 2000-02-03 4.38 163499.94 0.040380
18 AARTIIND 2000-02-04 4.44 108141.71 0.054632
19 AARTIIND 2000-02-07 4.26 68232.27 0.011876
20 AARTIIND 2000-02-08 4.00 108141.71 -0.049881
21 AARTIIND 2000-02-09 3.96 32185.04 -0.059382
22 AARTIIND 2000-02-10 4.13 43771.63 -0.019002
23 AARTIIND 2000-02-11 3.96 3862.20 -0.059382
24 AARTIIND 2000-02-14 3.94 12874.01 -0.064133
25 AARTIIND 2000-02-15 3.90 33472.42 -0.073634
26 AARTIIND 2000-02-16 3.90 25748.02 -0.073634
27 AARTIIND 2000-02-17 3.90 60507.86 -0.073634
28 AARTIIND 2000-02-18 4.22 45059.04 0.002375
29 AARTIIND 2000-02-21 4.42 81106.27 0.049881
I wish to select every months data and transpose that into a new row,
for e.g. the first 15 rows should become one row with name AARTIIND, date 2000-01-03 and then 15 columns having daily cummulative returns.
name date first second third fourth fifth .... fifteenth
0 AARTIIND 2000-01-03 0.00 -0.062 -0.083 -0.067 -0.109 .... 0.00
To group the data month wise I am using,
group = df.groupby([pd.Grouper(freq='1M', key='date'), 'name'])
Setting the rows individually by using the code below is very slow and my dataset has 1 million rows
data = pd.DataFrame(columns = ('name', 'date', 'daily_zscore_1', 'daily_zscore_2', 'daily_zscore_3', 'daily_zscore_4', 'daily_zscore_5', 'daily_zscore_6', 'daily_zscore_7', 'daily_zscore_8', 'daily_zscore_9', 'daily_zscore_10', 'daily_zscore_11', 'daily_zscore_12', 'daily_zscore_13', 'daily_zscore_14', 'daily_zscore_15'))
data.loc[0] = [x['name'].iloc[0], x['date'].iloc[0]].extend(x['daily_cumm_returns'])
Is there any other faster way to accomplish this, as I see it this is just transposing one column and hence should be very fast. I tried pivot and melt but don't understand how to use them in this situation.

This is a bit sloppy but it gets the job done.
# grab AAPL data
from pandas_datareader import data
df = data.DataReader('AAPL', 'google', start='2014-01-01')[['Close', 'Volume']]
# add name column
df['name'] = 'AAPL'
# get daily return relative to first of month
df['daily_cumm_return'] = df.resample('M')['Close'].transform(lambda x: (x - x[0]) / x[0])
# get the first of the month for each date
df['first_month_date'] = df.assign(index_col=df.index).resample('M')['index_col'].transform('first')
# get a ranking of the days 1 to n
df['day_rank']= df.resample('M')['first_month_date'].rank(method='first')
# pivot to get final
df_final = df.pivot_table(index=['name', 'first_month_date'], columns='day_rank', values='daily_cumm_return')
Sample Output
day_rank 1.0 2.0 3.0 4.0 5.0 6.0 \
name first_month_date
AAPL 2014-01-02 0.0 -0.022020 -0.016705 -0.023665 -0.017464 -0.029992
2014-02-03 0.0 0.014375 0.022052 0.021912 0.036148 0.054710
2014-03-03 0.0 0.006632 0.008754 0.005704 0.005173 0.006102
2014-04-01 0.0 0.001680 -0.005299 -0.018222 -0.033600 -0.033600
2014-05-01 0.0 0.001775 0.015976 0.004970 0.001420 -0.005917
2014-06-02 0.0 0.014141 0.025721 0.029729 0.026834 0.043314
2014-07-01 0.0 -0.000428 0.005453 0.026198 0.019568 0.019996
day_rank 7.0 8.0 9.0 10.0 11.0 \
name first_month_date
AAPL 2014-01-02 -0.036573 -0.031511 -0.012149 0.007593 0.002025
2014-02-03 0.068667 0.068528 0.085555 0.084578 0.088625
2014-03-03 0.015785 0.016846 0.005571 -0.005704 -0.001857
2014-04-01 -0.020936 -0.033600 -0.040708 -0.036831 -0.043810
2014-05-01 -0.010059 0.002249 0.003787 0.004024 -0.004497
2014-06-02 0.049438 0.045095 0.027614 0.016368 0.026612
2014-07-01 0.016253 0.018178 0.031330 0.019247 0.013473
day_rank 12.0 13.0 14.0 15.0 16.0 \
name first_month_date
AAPL 2014-01-02 -0.022526 -0.007340 -0.002911 0.005442 -0.012782
2014-02-03 0.071458 0.059037 0.047313 0.051779 0.040893
2014-03-03 0.006897 0.006632 0.001857 0.009683 0.021754
2014-04-01 -0.041871 -0.030887 -0.019385 -0.018351 -0.031274
2014-05-01 0.010178 0.022130 0.022367 0.025089 0.026627
2014-06-02 0.025276 0.026389 0.022826 0.012248 0.011357
2014-07-01 -0.004598 0.009731 0.004491 0.012831 0.039243
day_rank 17.0 18.0 19.0 20.0 21.0 \
name first_month_date
AAPL 2014-01-02 -0.004809 -0.084282 -0.094660 -0.096431 -0.095039
2014-02-03 0.031542 0.052059 0.049267 NaN NaN
2014-03-03 0.032763 0.022815 0.018437 0.017244 0.017111
2014-04-01 0.048204 0.055958 0.096795 0.093564 0.089429
2014-05-01 0.038225 0.057751 0.054911 0.074201 0.070178
2014-06-02 0.005233 0.006124 0.012137 0.024162 0.034740
2014-07-01 0.037532 0.044376 0.058811 0.051967 0.049508
day_rank 22.0 23.0
name first_month_date
AAPL 2014-01-02 NaN NaN
2014-02-03 NaN NaN
2014-03-03 NaN NaN
2014-04-01 NaN NaN
2014-05-01 NaN NaN
2014-06-02 NaN NaN
2014-07-01 0.022241 NaN

Admittedly this does not get exactly what you want...
I think one way to handle this problem would be to create new columns of month and day based on the datetime (date) column, then set a multiindex on the month and name, then pivot the table.
df['month'] = df.date.dt.month
df['day'] = df.date.dt.day
df.set_index(['month', 'name'], inplace=True)
df[['day', 'daily_cumm_returns']].pivot(index=df.index, columns='day')
Result is:
daily_cumm_returns \
day 1 2 3 4 5
month name
1 AARTIIND NaN NaN 0.00000 -0.062500 -0.083333
2 AARTIIND 0.0 0.073634 0.04038 0.054632 NaN
I can't figure out a way to keep the first date of each month group as a column, otherwise I think this is more or less what you're after.

Related

Overwrite one dataframe with values from another dataframe, based on repeated datetime index

I want to update and overwrite the values of one dataframe with the values in another, based on the datetime index, for a repeated datetime index. This code illustrates my problem, I have given df1 crazy values for illustrative purposes:
#import packages
import pandas as pd
import numpy as np
#create dataframes and indices
df = pd.DataFrame(np.random.randint(0,30,size=(10, 3)), columns=(['MeanT', 'MaxT', 'MinT']))
df1 = pd.DataFrame(np.random.randint(900,1000,size=(10, 3)), columns=(['MeanT', 'MaxT', 'MinT']))
df['Location'] =[2,2,3,3,4,4,5,5,6,6]
df1['Location'] =[2,2,3,3,4,4,5,5,6,6]
df.index = ["2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00"]
df1.index = ["2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00"]
df.index = pd.to_datetime(df.index)
df1.index = pd.to_datetime(df1.index)
Take a look at both dataframes, which shows dates 18th and 19th for df, and 19th and 20th for df1.
print(df)
MeanT MaxT MinT Location
2020-05-18 12:00:00 28 0 9 2
2020-05-19 12:00:00 22 7 11 2
2020-05-18 12:00:00 2 7 7 3
2020-05-19 12:00:00 10 24 18 3
2020-05-18 12:00:00 10 12 25 4
2020-05-19 12:00:00 25 7 20 4
2020-05-18 12:00:00 1 8 11 5
2020-05-19 12:00:00 27 19 12 5
2020-05-18 12:00:00 25 10 26 6
2020-05-19 12:00:00 29 11 27 6
print(df1)
MeanT MaxT MinT Location
2020-05-19 12:00:00 912 991 915 2
2020-05-20 12:00:00 936 917 965 2
2020-05-19 12:00:00 918 977 901 3
2020-05-20 12:00:00 974 971 927 3
2020-05-19 12:00:00 979 929 953 4
2020-05-20 12:00:00 988 955 939 4
2020-05-19 12:00:00 969 983 940 5
2020-05-20 12:00:00 902 904 916 5
2020-05-19 12:00:00 983 942 965 6
2020-05-20 12:00:00 928 994 933 6
I want to create a new dataframe which updates df with the values from df1, so the new df has values for the 18th from df, and the 19th and 20th from df1.
I have tried using combine_first like so:
df = df.set_index(df.groupby(level=0).cumcount(), append=True)
df1 = df1.set_index(df1.groupby(level=0).cumcount(), append=True)
df3 = df.combine_first(df1).sort_index(level=[1,0]).reset_index(level=1, drop=True)
which updates the dataframe, but doesn't overwrite the data for the 19th with values in df1. It produces this output:
print(df3)
MeanT MaxT MinT Location
2020-05-18 12:00:00 28.0 0.0 9.0 2.0
2020-05-19 12:00:00 22.0 7.0 11.0 2.0
2020-05-20 12:00:00 936.0 917.0 965.0 2.0
2020-05-18 12:00:00 2.0 7.0 7.0 3.0
2020-05-19 12:00:00 10.0 24.0 18.0 3.0
2020-05-20 12:00:00 974.0 971.0 927.0 3.0
2020-05-18 12:00:00 10.0 12.0 25.0 4.0
2020-05-19 12:00:00 25.0 7.0 20.0 4.0
2020-05-20 12:00:00 988.0 955.0 939.0 4.0
2020-05-18 12:00:00 1.0 8.0 11.0 5.0
2020-05-19 12:00:00 27.0 19.0 12.0 5.0
2020-05-20 12:00:00 902.0 904.0 916.0 5.0
2020-05-18 12:00:00 25.0 10.0 26.0 6.0
2020-05-19 12:00:00 29.0 11.0 27.0 6.0
2020-05-20 12:00:00 928.0 994.0 933.0 6.0
So the values for the 18th and the 20th are correct, but the values for the 19th are still from df. I want the values from df to be overwritten with those in df1. Please help!

you just need to use combine_first backwards.
We can also use 'Location' as index instead groupby.cumcount
df3 = (df1.set_index('Location', append=True)
.combine_first(df.set_index('Location', append=True))
.reset_index(level='Location')
.reindex(columns=df.columns)
.sort_values('Location'))
print(df3)
Location MeanT MaxT MinT
2020-05-18-12:00:00 2 28.0 0.0 9.0
2020-05-19-12:00:00 2 912.0 991.0 915.0
2020-05-20-12:00:00 2 936.0 917.0 965.0
2020-05-18-12:00:00 3 2.0 7.0 7.0
2020-05-19-12:00:00 3 918.0 977.0 901.0
2020-05-20-12:00:00 3 974.0 971.0 927.0
2020-05-18-12:00:00 4 10.0 12.0 25.0
2020-05-19-12:00:00 4 979.0 929.0 953.0
2020-05-20-12:00:00 4 988.0 955.0 939.0
2020-05-18-12:00:00 5 1.0 8.0 11.0
2020-05-19-12:00:00 5 969.0 983.0 940.0
2020-05-20-12:00:00 5 902.0 904.0 916.0
2020-05-18-12:00:00 6 25.0 10.0 26.0
2020-05-19-12:00:00 6 983.0 942.0 965.0
2020-05-20-12:00:00 6 928.0 994.0 933.0

Create well structured pandas dataframe using dataframe

I have a Panda DataFreme data from 2018 to 2020. I want to structure these data as follows.
Month | 2018 | 2019
Jan 115 73
Feb 112 63
....
up to December.
How can I solve this issue using panda data frame syntax?
Date
2018-01-01 115.0
2018-02-01 112.0
2018-03-01 104.5
2018-04-01 91.1
2018-05-01 85.5
2018-06-01 76.5
2018-07-01 86.5
2018-08-01 77.9
2018-09-01 65.0
2018-10-01 71.0
2018-11-01 76.0
2018-12-01 72.5
2019-01-01 73.0
2019-02-01 63.0
2019-03-01 63.0
2019-04-01 61.0
2019-05-01 58.3
2019-06-01 59.0
2019-07-01 67.0
2019-08-01 64.0
2019-09-01 59.9
2019-10-01 70.4
2019-11-01 78.9
2019-12-01 75.0
2020-01-01 73.9
Name: Close, dtype: float64

This is more like pivot but with crosstab
s = pd.crosstab(df.index.strftime('%b'),df.index.year,df.values,aggfunc='sum')
Out[87]:
col_0 2018 2019 2020
row_0
Apr 91.1 61.0 NaN
Aug 77.9 64.0 NaN
Dec 72.5 75.0 NaN
Feb 112.0 63.0 NaN
Jan 115.0 73.0 73.9
Jul 86.5 67.0 NaN
Jun 76.5 59.0 NaN
Mar 104.5 63.0 NaN
May 85.5 58.3 NaN
Nov 76.0 78.9 NaN
Oct 71.0 70.4 NaN
Sep 65.0 59.9 NaN

You can use groupby and unstack:
(s.groupby([s.index.month, s.index.year]).first().unstack()
.rename_axis(columns='Year',index='Month')
)
Output:
Year 2018 2019 2020
Month
1 115.0 73.0 73.9
2 112.0 63.0 NaN
3 104.5 63.0 NaN
4 91.1 61.0 NaN
5 85.5 58.3 NaN
6 76.5 59.0 NaN
7 86.5 67.0 NaN
8 77.9 64.0 NaN
9 65.0 59.9 NaN
10 71.0 70.4 NaN
11 76.0 78.9 NaN
12 72.5 75.0 NaN

Reindexing timeseries data

I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2

One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9

You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html

Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.

I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation

I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)

How to assign a values to dataframe's column by comparing values in another dataframe

I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?

IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)

How to group pandas DataFrame by varying dates?

I am trying to roll up daily data into fiscal quarter data. For example, I have a table with fiscal quarter end dates:
Company Period Quarter_End
M 2016Q1 05/02/2015
M 2016Q2 08/01/2015
M 2016Q3 10/31/2015
M 2016Q4 01/30/2016
WFM 2015Q2 04/12/2015
WFM 2015Q3 07/05/2015
WFM 2015Q4 09/27/2015
WFM 2016Q1 01/17/2016
and a table of daily data:
Company Date Price
M 06/20/2015 1.05
M 06/22/2015 4.05
M 07/10/2015 3.45
M 07/29/2015 1.86
M 08/24/2015 1.58
M 09/02/2015 8.64
M 09/22/2015 2.56
M 10/20/2015 5.42
M 11/02/2015 1.58
M 11/24/2015 4.58
M 12/03/2015 6.48
M 12/05/2015 4.56
M 01/03/2016 7.14
M 01/30/2016 6.34
WFM 06/20/2015 1.05
WFM 06/22/2015 4.05
WFM 07/10/2015 3.45
WFM 07/29/2015 1.86
WFM 08/24/2015 1.58
WFM 09/02/2015 8.64
WFM 09/22/2015 2.56
WFM 10/20/2015 5.42
WFM 11/02/2015 1.58
WFM 11/24/2015 4.58
WFM 12/03/2015 6.48
WFM 12/05/2015 4.56
WFM 01/03/2016 7.14
WFM 01/17/2016 6.34
And I would like to create the table below.
Company Period Quarter_end Sum(Price)
M 2016Q2 8/1/2015 10.41
M 2016Q3 10/31/2015 18.2
M 2016Q4 1/30/2016 30.68
WFM 2015Q3 7/5/2015 5.1
WFM 2015Q4 9/27/2015 18.09
WFM 2016Q1 1/17/2016 36.1
However, I don't know how to group by varying dates without looping through each record. Any help is greatly appreciated.
Thanks!

I think you can use merge_ordered:
#first convert columns to datetime
df1.Quarter_End = pd.to_datetime(df1.Quarter_End)
df2.Date = pd.to_datetime(df2.Date)
df = pd.merge_ordered(df1,
df2,
left_on=['Company','Quarter_End'],
right_on=['Company','Date'],
how='outer')
print (df)
Company Period Quarter_End Date Price
0 M 2016Q1 2015-05-02 NaT NaN
1 M NaN NaT 2015-06-20 1.05
2 M NaN NaT 2015-06-22 4.05
3 M NaN NaT 2015-07-10 3.45
4 M NaN NaT 2015-07-29 1.86
5 M 2016Q2 2015-08-01 NaT NaN
6 M NaN NaT 2015-08-24 1.58
7 M NaN NaT 2015-09-02 8.64
8 M NaN NaT 2015-09-22 2.56
9 M NaN NaT 2015-10-20 5.42
10 M 2016Q3 2015-10-31 NaT NaN
11 M NaN NaT 2015-11-02 1.58
12 M NaN NaT 2015-11-24 4.58
13 M NaN NaT 2015-12-03 6.48
14 M NaN NaT 2015-12-05 4.56
15 M NaN NaT 2016-01-03 7.14
16 M 2016Q4 2016-01-30 2016-01-30 6.34
17 WFM 2015Q2 2015-04-12 NaT NaN
18 WFM NaN NaT 2015-06-20 1.05
19 WFM NaN NaT 2015-06-22 4.05
20 WFM 2015Q3 2015-07-05 NaT NaN
21 WFM NaN NaT 2015-07-10 3.45
22 WFM NaN NaT 2015-07-29 1.86
23 WFM NaN NaT 2015-08-24 1.58
24 WFM NaN NaT 2015-09-02 8.64
25 WFM NaN NaT 2015-09-22 2.56
26 WFM 2015Q4 2015-09-27 NaT NaN
27 WFM NaN NaT 2015-10-20 5.42
28 WFM NaN NaT 2015-11-02 1.58
29 WFM NaN NaT 2015-11-24 4.58
30 WFM NaN NaT 2015-12-03 6.48
31 WFM NaN NaT 2015-12-05 4.56
32 WFM NaN NaT 2016-01-03 7.14
33 WFM 2016Q1 2016-01-17 2016-01-17 6.34
Then backfill NaN in columns Period and Quarter_End by bfill and aggregate sum. If need remove all NaN values, add Series.dropna and last reset_index:
df.Period = df.Period.bfill()
df.Quarter_End = df.Quarter_End.bfill()
print (df.groupby(['Company','Period','Quarter_End'])['Price'].sum().dropna().reset_index())
Company Period Quarter_End Price
0 M 2016Q2 2015-08-01 10.41
1 M 2016Q3 2015-10-31 18.20
2 M 2016Q4 2016-01-30 30.68
3 WFM 2015Q3 2015-07-05 5.10
4 WFM 2015Q4 2015-09-27 18.09
5 WFM 2016Q1 2016-01-17 36.10

set_index
pd.concat to align indices
groupby with agg
prd_df = period_df.set_index(['Company', 'Quarter_End'])
prc_df = price_df.set_index(['Company', 'Date'], drop=False)
df = pd.concat([prd_df, prc_df], axis=1)
df.groupby([df.index.get_level_values(0), df.Period.bfill()]) \
.agg(dict(Date='last', Price='sum')).dropna()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split a column into multiple columns in pandas? - python

Related

Overwrite one dataframe with values from another dataframe, based on repeated datetime index

Create well structured pandas dataframe using dataframe

Reindexing timeseries data

How to assign a values to dataframe's column by comparing values in another dataframe

How to group pandas DataFrame by varying dates?

Categories

Resources