We want to transform event-based data into multiple time series.
As an example we use pandas to plot some graphics of the changes in salary per employee in a company over time. An event of a change in salary is a entry in a table with a date, a name and the new salary.
employee salary
date
2000-01-01 anna 4500
2003-01-01 oli 5000
2010-01-01 anna 6500
2012-01-01 lena 5000
2013-01-01 oli 7000
2016-01-01 lena 6500
2017-01-09 joe 5000
2018-01-09 peter 5000
2019-01-09 joe 5500
2019-01-31 lena 0
2020-01-01 anna 8500
2020-01-09 peter 5500
2021-01-09 joe 6000
2022-02-28 peter 0
The changes happen in irregularly-spaced intervals thus to work with the data we want reindex to a common regularly-spaced index and then do fill operations on missing data points.
time_series_index = pd.date_range(df_events.index.min(), df_events.index.max())
df_time_series = pd.DataFrame()
for name, group in df_events.groupby('employee'):
time_series = group['salary'].reindex(time_series_index)
time_series = time_series.ffill().fillna(0)
df_time_series[name] = time_series
print(df_time_series)
anna joe lena oli peter
2000-01-01 4500.0 0.0 0.0 0.0 0.0
2000-01-02 4500.0 0.0 0.0 0.0 0.0
2000-01-03 4500.0 0.0 0.0 0.0 0.0
2000-01-04 4500.0 0.0 0.0 0.0 0.0
2000-01-05 4500.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
2022-02-24 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-25 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-26 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-27 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-28 8500.0 6000.0 0.0 7000.0 0.0
The loop above does the job of reindexing to a common index.
Now the question arose whether the approach is state-of-the-art or if there is more compact and straight forward way to do it. We assume the problem of transformation of events to time series is a common problem and therefore we expected there would be a standard to solve these kind of problems.
We tried to make it compact by removing the loop as follows.
df_time_series = df_events.groupby('employee')['salary'].reindex(time_series_index)
It throws AttributeError:
AttributeError: 'SeriesGroupBy' object has no attribute 'reindex'
This should work. If your index is already a datetime index, then you do not need the .rename(pd.to_datetime) part
(df.rename(pd.to_datetime)
.set_index('employee',append = True)
.unstack()
.asfreq('D')
.ffill()
.fillna(0))
Output:
salary
employee anna joe lena oli peter
2000-01-01 4500.0 0.0 0.0 0.0 0.0
2000-01-02 4500.0 0.0 0.0 0.0 0.0
2000-01-03 4500.0 0.0 0.0 0.0 0.0
2000-01-04 4500.0 0.0 0.0 0.0 0.0
2000-01-05 4500.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
2022-02-24 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-25 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-26 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-27 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-28 8500.0 6000.0 0.0 7000.0 0.0
I'm trying to insert a range of date labels in my dataframe, df1. I've managed some part of the way, but I still have som bumps that I want to smooth out.
I'm trying to generate a column with dates from 2017-01-01 to 2020-12-31 with all dates repeated 24 times, i.e., a column with 35,068 rows.
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
repeated_dates = pd.DataFrame(dates.repeat(num_repeats))
df1.insert(0, 'Date', repeated_dates)
However, it only generates some iterations of the last date, meaning that my column will be NaT for the remaining x hours.
output:
Date DK1 Up DK1 Down DK2 Up DK2 Down
0 2017-01-01 0.0 0.0 0.0 0.0
1 2017-01-01 0.0 0.0 0.0 0.0
2 2017-01-01 0.0 0.0 0.0 0.0
3 2017-01-01 0.0 0.0 0.0 0.0
4 2017-01-01 0.0 0.0 0.0 0.0
... ... ... ... ... ...
35063 2020-12-31 0.0 0.0 0.0 0.0
35064 NaT 0.0 0.0 0.0 0.0
35065 NaT 0.0 -54.1 0.0 0.0
35066 NaT 25.5 0.0 0.0 0.0
35067 NaT 0.0 0.0 0.0 0.0
Furthermore, how can I change the date format from '2017-01-01' to '01-01-2017'?
You set this up perfectly, so here is the dates that you have,
import pandas as pd
import numpy as np
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
df = pd.DataFrame(dates.repeat(num_repeats),columns=['date'])
and converting the column to the format you want is simple with the strftime function
df['newFormat'] = df['date'].dt.strftime('%d-%m-%Y')
Which gives
date newFormat
0 2017-01-01 01-01-2017
1 2017-01-01 01-01-2017
2 2017-01-01 01-01-2017
3 2017-01-01 01-01-2017
4 2017-01-01 01-01-2017
... ... ...
35059 2020-12-31 31-12-2020
35060 2020-12-31 31-12-2020
35061 2020-12-31 31-12-2020
35062 2020-12-31 31-12-2020
35063 2020-12-31 31-12-2020
now
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
gives
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10',
...
'2020-12-22', '2020-12-23', '2020-12-24', '2020-12-25',
'2020-12-26', '2020-12-27', '2020-12-28', '2020-12-29',
'2020-12-30', '2020-12-31'],
dtype='datetime64[ns]', length=1461, freq='D')
and
1461 * 24 = 35064
so I am not sure where 35,068 comes from. Are you sure about that number?
To calculate a volume weighted moving average (VWMA) I am collecting a sum(price*volume) and dividing it by the sum(volume).
I need a faster way to get a value from the previous row and add it to a value on the current row.
I have the following dataframe:
import pandas as pd
from itertools import repeat
df = pd.DataFrame({'dtime': ['16:00', '15:00', '14:00', '13:00', '12:00', '11:00', '10:00', '09:00', '08:00', '07:00', '06:00', '05:00', '04:00', '03:00', '02:00', '01:00'],
'time': [1800, 1740, 1680, 1620, 1560, 1500, 1440, 1380, 1320, 1260, 1200, 1140, 1080, 1020, 960, 900],
'price': [100.1, 102.7, 108.5, 105.3, 107.1, 103.4, 101.8, 102.7, 101.6, 99.8, 100.2, 97.7, 99.3, 100.1, 102.5, 103.9],
'volume': [6.0, 6.5, 5.4, 6.3, 6.4, 7.1, 6.7, 6.2, 5.7, 1.2, 2.4, 3.9, 5.2, 8.9, 7.2, 6.5]
}, columns = ['dtime', 'time', 'price', 'volume']).set_index('dtime')
df.insert(df.shape[1], "PV", df['price']*df['volume'])
df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))
Which is
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
16:00 1800 100.1 6.0 600.60 0.0 0.0 0.0 0.0
15:00 1740 102.7 6.5 667.55 0.0 0.0 0.0 0.0
14:00 1680 108.5 5.4 585.90 0.0 0.0 0.0 0.0
13:00 1620 105.3 6.3 663.39 0.0 0.0 0.0 0.0
12:00 1560 107.1 6.4 685.44 0.0 0.0 0.0 0.0
11:00 1500 103.4 7.1 734.14 0.0 0.0 0.0 0.0
10:00 1440 101.8 6.7 682.06 0.0 0.0 0.0 0.0
09:00 1380 102.7 6.2 636.74 0.0 0.0 0.0 0.0
08:00 1320 101.6 5.7 579.12 0.0 0.0 0.0 0.0
07:00 1260 99.8 1.2 119.76 0.0 0.0 0.0 0.0
06:00 1200 100.2 2.4 240.48 0.0 0.0 0.0 0.0
05:00 1140 97.7 3.9 381.03 0.0 0.0 0.0 0.0
04:00 1080 99.3 5.2 516.36 0.0 0.0 0.0 0.0
03:00 1020 100.1 8.9 890.89 0.0 0.0 0.0 0.0
02:00 960 102.5 7.2 738.00 0.0 0.0 0.0 0.0
01:00 900 103.9 6.5 675.35 0.0 0.0 0.0 0.0
Right now I am using a for loop to check each row if 'flag' is set.
#----pseudo code----
#for each row in df (from bottom to top, excluding the very bottom row)
# if flag[row] is not set:
# PVsum_2[row] = PV[row] + PV[row + 1]
# Vsum_2[row] = volume[row] + volume[row + 1]
# VWMA_2[row] = PVsum_2[row] / Vsum_2[row]
# flag[row] = 1.0
#----pseudo code----
my_dict = {'dtime' : 0,
'time' : 1,
'price' : 2,
"volume" : 3,
'PV' : 4,
'check' : 5,
'PVsum_2': 6,
'Vsum_2' : 7,
'VWMA_2' : 8}
for row in reversed(range(len(df)-1)):
# if flag value is not set (i.e. flag == 0)
if not df['flag'][row]:
# sum of current and previous PV (price*volume) values
a = df['PV'][row] + df['PV'][row+1]
df.iloc[row, my_dict['PVsum_2']-1] = a
# sum of current and previous volumes
b = df['volume'][row] + df['volume'][row+1]
df.iloc[row, my_dict['Vsum_2']-1] = b
# PVsum_2 / Vsum_2
c = (a / b) if b != 0.0 else 0.0
df.iloc[row, my_dict['VWMA_2']-1] = c
# set check value to 1.0
df.iloc[row, my_dict['flag']-1] = 1.0
but this takes too long on large sets of data (500+ rows)
I'm looking for something faster and more elegant.
The dataframe should look like this when it is done (notice the bottom row has not been altered):
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
16:00 1800 100.1 6.0 600.60 1.0 1268.15 12.5 101.452000
15:00 1740 102.7 6.5 667.55 1.0 1253.45 11.9 105.331933
14:00 1680 108.5 5.4 585.90 1.0 1249.29 11.7 106.776923
13:00 1620 105.3 6.3 663.39 1.0 1348.83 12.7 106.207087
12:00 1560 107.1 6.4 685.44 1.0 1419.58 13.5 105.154074
11:00 1500 103.4 7.1 734.14 1.0 1416.20 13.8 102.623188
10:00 1440 101.8 6.7 682.06 1.0 1318.80 12.9 102.232558
09:00 1380 102.7 6.2 636.74 1.0 1215.86 11.9 102.173109
08:00 1320 101.6 5.7 579.12 1.0 698.88 6.9 101.286957
07:00 1260 99.8 1.2 119.76 1.0 360.24 3.6 100.066667
06:00 1200 100.2 2.4 240.48 1.0 621.51 6.3 98.652381
05:00 1140 97.7 3.9 381.03 1.0 897.39 9.1 98.614286
04:00 1080 99.3 5.2 516.36 1.0 1407.25 14.1 99.804965
03:00 1020 100.1 8.9 890.89 1.0 1628.89 16.1 101.173292
02:00 960 102.5 7.2 738.00 1.0 1413.35 13.7 103.164234
01:00 900 103.9 6.5 675.35 0.0 0.00 0.0 0.000000
Eventually new data will be added to the top of the data frame as seen below, and will need to be updated again.
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
19:00 1980 100.1 6.0 600.60 0.0 0.0 0.0 0.0
18:00 1920 102.7 6.5 667.55 0.0 0.0 0.0 0.0
17:00 1860 108.5 5.4 585.90 0.0 0.0 0.0 0.0
16:00 1800 100.1 6.0 600.60 1.0 1268.15 12.5 101.452000
15:00 1740 102.7 6.5 667.55 1.0 1253.45 11.9 105.331933
14:00 1680 108.5 5.4 585.90 1.0 1249.29 11.7 106.776923
13:00 1620 105.3 6.3 663.39 1.0 1348.83 12.7 106.207087
12:00 1560 107.1 6.4 685.44 1.0 1419.58 13.5 105.154074
11:00 1500 103.4 7.1 734.14 1.0 1416.20 13.8 102.623188
10:00 1440 101.8 6.7 682.06 1.0 1318.80 12.9 102.232558
09:00 1380 102.7 6.2 636.74 1.0 1215.86 11.9 102.173109
08:00 1320 101.6 5.7 579.12 1.0 698.88 6.9 101.286957
07:00 1260 99.8 1.2 119.76 1.0 360.24 3.6 100.066667
06:00 1200 100.2 2.4 240.48 1.0 621.51 6.3 98.652381
05:00 1140 97.7 3.9 381.03 1.0 897.39 9.1 98.614286
04:00 1080 99.3 5.2 516.36 1.0 1407.25 14.1 99.804965
03:00 1020 100.1 8.9 890.89 1.0 1628.89 16.1 101.173292
02:00 960 102.5 7.2 738.00 1.0 1413.35 13.7 103.164234
01:00 900 103.9 6.5 675.35 0.0 0.00 0.0 0.000000
It looks like you're not using pandas in the right way. I'd recommend taking a quick look at a tutorial.
For starters, the following lines
df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))
can be much easier written as:
df['flag'] = 0
df['PVsum_2'] = 0
df['Vsum_2'] = 0
df['VWMA_2'] = 0
But it seems you don't even need to initialise those columns really.
You also don't need the for loop because you can align 2 dataframes, one being your original and another one is one where you've shifted all rows. For example:
df_shift = df.shift(-1)
You can then use normal vectorised calculations to achieve what you want, e.g.:
df['PVsum_2'] = df['PV'] + df_shift['PV']
df['Vsum_2'] = df['volume'] + df_shift['volume']
idx = df['Vsum_2'] != 0 # this is your check whether that value is different from 0
df.loc[idx, 'VWMA_2'] = df.loc[idx, 'PVsum_2'] / df.loc[idx, 'VSum_2'] # and now use that index to only calculate VWMA_2 where the Vsum_2 was 0
Hopefully you get the idea and can make small adjustments to make it work exactly as you want.
Trying to transpose and group data to look like this:
Current group by data:
MTD-Total Revenue YTD-Total Revenue MTD-Room Revenue YTD-Room Revenue MTD-Room Nights YTD-Room Nights MTD-ADR YTD-ADR MTD-OCC% YTD-OCC%
Market Group
Aff 0.0 0.0 2026136.99 21546922.96 857.0 8650.0 2457.02 2551.87 4.99 4.16
Air 0.0 0.0 2809312.53 32534587.15 925.0 9684.0 2392.08 3016.00 2.69 2.33
BAR 0.0 0.0 470866.23 8341596.95 131.0 2481.0 3189.75 3133.08 0.76 1.19
Cas 0.0 0.0 4801710.10 55466024.12 1652.0 18566.0 2365.23 2585.25 1.92 1.79
Com 0.0 0.0 3873151.63 43857524.55 1088.0 11980.0 2449.43 2632.57 6.34 5.76
Cor 0.0 0.0 7104841.79 88326080.23 2314.0 26836.0 1552.74 2919.07 4.14 3.97
Pro 0.0 0.0 335358.36 1907348.23 97.0 562.0 3457.30 3393.86 2.26 1.08
Soc 0.0 0.0 12706.96 82957.59 4.0 25.0 1588.37 3315.74 0.04 0.02
TA 0.0 0.0 1016565.12 15563472.77 416.0 6797.0 2412.55 2229.46 4.84 6.54
Wal 0.0 0.0 277267.66 3786378.41 68.0 812.0 4077.47 4663.03 1.58 1.56
Codes ran:
pd.DataFrame(df.values.reshape(-1,5))
df.reset_index().pivot('Market Group', 'MTD-Total Revenue', 'YTD-Total Revenue')
How data looks if it were to be transposed: df.T:
Answer to this would be :
df= pd.melt(df, id_vars=['Market Group'], value_vars=['MTD-Total Revenue','YTD-Total Revenue','MTD-Room Revenue','YTD-Room Revenue','MTD-Room Nights','YTD-Room Nights','MTD-ADR','YTD-ADR','MTD-OCC%','YTD-OCC%'])
This keeps the headers unlike using unstack or stack .
Even though this seems really simple, it drives me nuts. Why is .astype(int) not changing the floats to ints? Thank you
df_new = pd.crosstab(df["date"], df["place"]).reset_index()
places = ['cityA', "cityB", "cityC"]
df_new[places] = df_new[places].fillna(0).astype(int)
sums = df_new.select_dtypes(pd.np.number).sum().rename('total')
df_new = df_new.append(sums)
print(df_new)
Output:
place date cityA cityB cityC
0 2008-01-01 0.0 0.0 51.0
1 2009-06-01 0.0 618.0 0.0
2 2015-07-01 549.0 0.0 0.0
3 2016-01-01 41.0 0.0 0.0
4 2016-04-01 62.0 0.0 0.0
5 2017-01-01 800.0 0.0 0.0
6 2018-07-01 69.0 0.0 0.0
total NaT 1521.0 618.0 51.0
If there are NAs (which are floats in Pandas), the other values will be floats as well. See here.