How to scale all data within a dataframe but one columns - python

My data look like:
cycles os1 os2 os3 sm1 sm2 sm3 sm4 sm5 sm6 sm7 sm8 sm9 sm10 sm11 sm12 sm13 sm14 sm15 sm16 sm17 sm18 sm19 sm20 sm21
0 1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 21.61 554.36 2388.06 9046.19 1.3 47.47 521.66 2388.02 8138.62 8.4195 0.03 392 2388 100.0 39.06 23.4190
1 2 0.0019 -0.0003 100.0 518.67 642.15 1591.82 1403.14 14.62 21.61 553.75 2388.04 9044.07 1.3 47.49 522.28 2388.07 8131.49 8.4318 0.03 392 2388 100.0 39.00 23.4236
2 3 -0.0043 0.0003 100.0 518.67 642.35 1587.99 1404.20 14.62 21.61 554.26 2388.08 9052.94 1.3 47.27 522.42 2388.03 8133.23 8.4178 0.03 390 2388 100.0 38.95 23.3442
3 4 0.0007 0.0000 100.0 518.67 642.35 1582.79 1401.87 14.62 21.61 554.45 2388.11 9049.48 1.3 47.13 522.86 2388.08 8133.83 8.3682 0.03 392 2388 100.0 38.88 23.3739
4 5 -0.0019 -0.0002 100.0 518.67 642.37 1582.85 1406.22 14.62 21.61 554.00 2388.06 9055.15 1.3 47.28 522.19 2388.04 8133.80 8.4294 0.03 393 2388 100.0 38.90 23.4044
and I want to rescale the dataframe but cycles columns.
I've tried:
# Scale features
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
for col in data[1:].columns:
data[col] = sc.fit_transform(data[col].values.reshape(-1,1))
but it scales whole dataframe either way.
Help would be appreciated.
Thanks!

You can select all columns without first by DataFrame.iloc, here first : means all rows and 1: all columns without first and pass to fit_transform:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data.iloc[:, 1:] = sc.fit_transform(data.iloc[:, 1:])
print (data)
cycles os1 os2 os3 sm1 sm2 sm3 sm4 sm5 \
0 1 0.074613 -1.128152 0.0 0.0 -1.847703 0.732430 -1.349237 0.0
1 2 1.287080 -0.725241 0.0 0.0 -0.276203 1.313986 -0.034171 0.0
2 3 -1.604187 1.692228 0.0 0.0 0.676221 0.263346 0.514636 0.0
3 4 0.727480 0.483494 0.0 0.0 0.676221 -1.163111 -0.691704 0.0
4 5 -0.484987 -0.322329 0.0 0.0 0.771464 -1.146651 1.560476 0.0
sm6 sm7 sm8 sm9 sm10 sm11 sm12 sm13 \
0 0.0 0.765578 -0.422577 -0.822800 0.0 1.050958 -1.607627 -1.209416
1 0.0 -1.617089 -1.267731 -1.339486 0.0 1.198981 -0.005169 0.950255
2 0.0 0.374977 0.422577 0.822312 0.0 -0.429265 0.356676 -0.777482
3 0.0 1.117119 1.690309 -0.020960 0.0 -1.465421 1.493904 1.382189
4 0.0 -0.640586 -0.422577 1.360934 0.0 -0.355254 -0.237784 -0.345547
sm14 sm15 sm16 sm17 sm18 sm19 sm20 sm21
0 1.866394 0.265372 0.0 0.204124 0.0 0.0 1.549015 0.867102
1 -1.140246 0.795254 0.0 0.204124 0.0 0.0 0.637830 1.020631
2 -0.406508 0.192136 0.0 -1.837117 0.0 0.0 -0.121491 -1.629404
3 -0.153495 -1.944623 0.0 0.204124 0.0 0.0 -1.184541 -0.638144
4 -0.166145 0.691862 0.0 1.224745 0.0 0.0 -0.880812 0.379816

Related

transform event based data into time series data with pandas using groupby and reindex

We want to transform event-based data into multiple time series.
As an example we use pandas to plot some graphics of the changes in salary per employee in a company over time. An event of a change in salary is a entry in a table with a date, a name and the new salary.
employee salary
date
2000-01-01 anna 4500
2003-01-01 oli 5000
2010-01-01 anna 6500
2012-01-01 lena 5000
2013-01-01 oli 7000
2016-01-01 lena 6500
2017-01-09 joe 5000
2018-01-09 peter 5000
2019-01-09 joe 5500
2019-01-31 lena 0
2020-01-01 anna 8500
2020-01-09 peter 5500
2021-01-09 joe 6000
2022-02-28 peter 0
The changes happen in irregularly-spaced intervals thus to work with the data we want reindex to a common regularly-spaced index and then do fill operations on missing data points.
time_series_index = pd.date_range(df_events.index.min(), df_events.index.max())
df_time_series = pd.DataFrame()
for name, group in df_events.groupby('employee'):
time_series = group['salary'].reindex(time_series_index)
time_series = time_series.ffill().fillna(0)
df_time_series[name] = time_series
print(df_time_series)
anna joe lena oli peter
2000-01-01 4500.0 0.0 0.0 0.0 0.0
2000-01-02 4500.0 0.0 0.0 0.0 0.0
2000-01-03 4500.0 0.0 0.0 0.0 0.0
2000-01-04 4500.0 0.0 0.0 0.0 0.0
2000-01-05 4500.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
2022-02-24 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-25 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-26 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-27 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-28 8500.0 6000.0 0.0 7000.0 0.0
The loop above does the job of reindexing to a common index.
Now the question arose whether the approach is state-of-the-art or if there is more compact and straight forward way to do it. We assume the problem of transformation of events to time series is a common problem and therefore we expected there would be a standard to solve these kind of problems.
We tried to make it compact by removing the loop as follows.
df_time_series = df_events.groupby('employee')['salary'].reindex(time_series_index)
It throws AttributeError:
AttributeError: 'SeriesGroupBy' object has no attribute 'reindex'
This should work. If your index is already a datetime index, then you do not need the .rename(pd.to_datetime) part
(df.rename(pd.to_datetime)
.set_index('employee',append = True)
.unstack()
.asfreq('D')
.ffill()
.fillna(0))
Output:
salary
employee anna joe lena oli peter
2000-01-01 4500.0 0.0 0.0 0.0 0.0
2000-01-02 4500.0 0.0 0.0 0.0 0.0
2000-01-03 4500.0 0.0 0.0 0.0 0.0
2000-01-04 4500.0 0.0 0.0 0.0 0.0
2000-01-05 4500.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
2022-02-24 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-25 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-26 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-27 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-28 8500.0 6000.0 0.0 7000.0 0.0

Get minimum of minimums per column- Pandas

I have the following data frame:
index timestamp Output_Energy Elevation one two three
.
.
538 2016-06-20 08:58:00+05:30 40.34829924338887 44.04129964199598 0.0 0.0 0.0
539 2016-06-20 08:59:00+05:30 40.5703298574816 44.25962644894399 0.0 0.0 0.0
540 2016-06-20 09:00:00+05:30 40.79141282764114 44.47799109475774 25.0 30.0 40.0
541 2016-06-20 09:01:00+05:30 41.01157539726741 44.69639316356193 0.0 0.0 0.0
542 2016-06-20 09:02:00+05:30 41.230790026853384 44.91483208973582 0.0 0.0 0.0
543 2016-06-20 09:03:00+05:30 41.44905311289469 45.13330730580419 0.0 0.0 0.0
544 2016-06-20 09:04:00+05:30 41.666364098311895 45.351818241921414 0.0 0.0 0.0
545 2016-06-20 09:05:00+05:30 41.8827166483967 45.57036432534446 0.0 0.0 0.0
546 2016-06-20 09:06:00+05:30 42.0981013074145 45.78894498194398 0.0 0.0 0.0
547 2016-06-20 09:07:00+05:30 42.31250713341641 46.007559633636575 0.0 0.0 0.0
548 2016-06-20 09:08:00+05:30 42.525960667204465 46.22620769987313 0.0 0.0 0.0
549 2016-06-20 09:09:00+05:30 42.738418433471544 46.44488859711554 0.0 0.0 0.0
550 2016-06-20 09:10:00+05:30 42.94990329039521 46.66360188496949 0.0 0.0 0.0
551 2016-06-20 09:11:00+05:30 43.160390522060574 46.88234668098421 0.0 0.0 0.0
552 2016-06-20 09:12:00+05:30 43.36988302062059 47.101122538059016 0.0 0.0 0.0
553 2016-06-20 09:13:00+05:30 43.57837124543777 47.319928859306344 0.0 0.0 0.0
554 2016-06-20 09:14:00+05:30 43.785848829859404 47.53876504413286 0.0 0.0 0.0
555 2016-06-20 09:15:00+05:30 43.992319514156094 47.75763048766155 0.0 0.0 0.0
556 2016-06-20 09:16:00+05:30 44.19779793479642 47.976524582260204 0.0 0.0 0.0
557 2016-06-20 09:17:00+05:30 44.402250365403276 48.19544671486692 0.0 0.0 0.0
558 2016-06-20 09:18:00+05:30 44.6056725858456 48.4143962684937 0.0 0.0 0.0
559 2016-06-20 09:19:00+05:30 44.80807132168398 48.63337262163986 0.0 0.0 0.0
560 2016-06-20 09:20:00+05:30 45.00943825682395 48.852375147711754 0.0 0.0 0.0
.
.
I am trying to find 'single' minimum value in the columns: one, two and three by following
i=0
for i in range(0, len(df_temp)-1):
target_batteries = ["one", "two", "three"]
global_min_SOC = df_temp[target_batteries].min().min()
min_SOC_battery = (df_temp[target_batteries].min() == global_min_SOC)
min_SOC = df_temp[target_batteries].idxmin(axis=1)
i+=1
min_SOC_battery
Somehow I am getting global_min_SOC = 0.0, min_SOC_battery as
one True
two True
three True
dtype: bool
and min_SOC as
0 one
1 one
2 one
3 one
4 one
...
2876 one
2877 one
2878 one
2879 one
2880 one
Length: 2881, dtype: object
Expected ouput is global_min_SOC = 25.0, min_SOC = one and min_SOC_battery as
one True
two False
three False
dtype: bool
What I am doing wrong?
In addition, how can I access the value of min_SOC_battery? I want to call a function on it.
Thanks in advance!

faster way to calculate a rolling sum in a dataframe

To calculate a volume weighted moving average (VWMA) I am collecting a sum(price*volume) and dividing it by the sum(volume).
I need a faster way to get a value from the previous row and add it to a value on the current row.
I have the following dataframe:
import pandas as pd
from itertools import repeat
df = pd.DataFrame({'dtime': ['16:00', '15:00', '14:00', '13:00', '12:00', '11:00', '10:00', '09:00', '08:00', '07:00', '06:00', '05:00', '04:00', '03:00', '02:00', '01:00'],
'time': [1800, 1740, 1680, 1620, 1560, 1500, 1440, 1380, 1320, 1260, 1200, 1140, 1080, 1020, 960, 900],
'price': [100.1, 102.7, 108.5, 105.3, 107.1, 103.4, 101.8, 102.7, 101.6, 99.8, 100.2, 97.7, 99.3, 100.1, 102.5, 103.9],
'volume': [6.0, 6.5, 5.4, 6.3, 6.4, 7.1, 6.7, 6.2, 5.7, 1.2, 2.4, 3.9, 5.2, 8.9, 7.2, 6.5]
}, columns = ['dtime', 'time', 'price', 'volume']).set_index('dtime')
df.insert(df.shape[1], "PV", df['price']*df['volume'])
df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))
Which is
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
16:00 1800 100.1 6.0 600.60 0.0 0.0 0.0 0.0
15:00 1740 102.7 6.5 667.55 0.0 0.0 0.0 0.0
14:00 1680 108.5 5.4 585.90 0.0 0.0 0.0 0.0
13:00 1620 105.3 6.3 663.39 0.0 0.0 0.0 0.0
12:00 1560 107.1 6.4 685.44 0.0 0.0 0.0 0.0
11:00 1500 103.4 7.1 734.14 0.0 0.0 0.0 0.0
10:00 1440 101.8 6.7 682.06 0.0 0.0 0.0 0.0
09:00 1380 102.7 6.2 636.74 0.0 0.0 0.0 0.0
08:00 1320 101.6 5.7 579.12 0.0 0.0 0.0 0.0
07:00 1260 99.8 1.2 119.76 0.0 0.0 0.0 0.0
06:00 1200 100.2 2.4 240.48 0.0 0.0 0.0 0.0
05:00 1140 97.7 3.9 381.03 0.0 0.0 0.0 0.0
04:00 1080 99.3 5.2 516.36 0.0 0.0 0.0 0.0
03:00 1020 100.1 8.9 890.89 0.0 0.0 0.0 0.0
02:00 960 102.5 7.2 738.00 0.0 0.0 0.0 0.0
01:00 900 103.9 6.5 675.35 0.0 0.0 0.0 0.0
Right now I am using a for loop to check each row if 'flag' is set.
#----pseudo code----
#for each row in df (from bottom to top, excluding the very bottom row)
# if flag[row] is not set:
# PVsum_2[row] = PV[row] + PV[row + 1]
# Vsum_2[row] = volume[row] + volume[row + 1]
# VWMA_2[row] = PVsum_2[row] / Vsum_2[row]
# flag[row] = 1.0
#----pseudo code----
my_dict = {'dtime' : 0,
'time' : 1,
'price' : 2,
"volume" : 3,
'PV' : 4,
'check' : 5,
'PVsum_2': 6,
'Vsum_2' : 7,
'VWMA_2' : 8}
for row in reversed(range(len(df)-1)):
# if flag value is not set (i.e. flag == 0)
if not df['flag'][row]:
# sum of current and previous PV (price*volume) values
a = df['PV'][row] + df['PV'][row+1]
df.iloc[row, my_dict['PVsum_2']-1] = a
# sum of current and previous volumes
b = df['volume'][row] + df['volume'][row+1]
df.iloc[row, my_dict['Vsum_2']-1] = b
# PVsum_2 / Vsum_2
c = (a / b) if b != 0.0 else 0.0
df.iloc[row, my_dict['VWMA_2']-1] = c
# set check value to 1.0
df.iloc[row, my_dict['flag']-1] = 1.0
but this takes too long on large sets of data (500+ rows)
I'm looking for something faster and more elegant.
The dataframe should look like this when it is done (notice the bottom row has not been altered):
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
16:00 1800 100.1 6.0 600.60 1.0 1268.15 12.5 101.452000
15:00 1740 102.7 6.5 667.55 1.0 1253.45 11.9 105.331933
14:00 1680 108.5 5.4 585.90 1.0 1249.29 11.7 106.776923
13:00 1620 105.3 6.3 663.39 1.0 1348.83 12.7 106.207087
12:00 1560 107.1 6.4 685.44 1.0 1419.58 13.5 105.154074
11:00 1500 103.4 7.1 734.14 1.0 1416.20 13.8 102.623188
10:00 1440 101.8 6.7 682.06 1.0 1318.80 12.9 102.232558
09:00 1380 102.7 6.2 636.74 1.0 1215.86 11.9 102.173109
08:00 1320 101.6 5.7 579.12 1.0 698.88 6.9 101.286957
07:00 1260 99.8 1.2 119.76 1.0 360.24 3.6 100.066667
06:00 1200 100.2 2.4 240.48 1.0 621.51 6.3 98.652381
05:00 1140 97.7 3.9 381.03 1.0 897.39 9.1 98.614286
04:00 1080 99.3 5.2 516.36 1.0 1407.25 14.1 99.804965
03:00 1020 100.1 8.9 890.89 1.0 1628.89 16.1 101.173292
02:00 960 102.5 7.2 738.00 1.0 1413.35 13.7 103.164234
01:00 900 103.9 6.5 675.35 0.0 0.00 0.0 0.000000
Eventually new data will be added to the top of the data frame as seen below, and will need to be updated again.
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
19:00 1980 100.1 6.0 600.60 0.0 0.0 0.0 0.0
18:00 1920 102.7 6.5 667.55 0.0 0.0 0.0 0.0
17:00 1860 108.5 5.4 585.90 0.0 0.0 0.0 0.0
16:00 1800 100.1 6.0 600.60 1.0 1268.15 12.5 101.452000
15:00 1740 102.7 6.5 667.55 1.0 1253.45 11.9 105.331933
14:00 1680 108.5 5.4 585.90 1.0 1249.29 11.7 106.776923
13:00 1620 105.3 6.3 663.39 1.0 1348.83 12.7 106.207087
12:00 1560 107.1 6.4 685.44 1.0 1419.58 13.5 105.154074
11:00 1500 103.4 7.1 734.14 1.0 1416.20 13.8 102.623188
10:00 1440 101.8 6.7 682.06 1.0 1318.80 12.9 102.232558
09:00 1380 102.7 6.2 636.74 1.0 1215.86 11.9 102.173109
08:00 1320 101.6 5.7 579.12 1.0 698.88 6.9 101.286957
07:00 1260 99.8 1.2 119.76 1.0 360.24 3.6 100.066667
06:00 1200 100.2 2.4 240.48 1.0 621.51 6.3 98.652381
05:00 1140 97.7 3.9 381.03 1.0 897.39 9.1 98.614286
04:00 1080 99.3 5.2 516.36 1.0 1407.25 14.1 99.804965
03:00 1020 100.1 8.9 890.89 1.0 1628.89 16.1 101.173292
02:00 960 102.5 7.2 738.00 1.0 1413.35 13.7 103.164234
01:00 900 103.9 6.5 675.35 0.0 0.00 0.0 0.000000
It looks like you're not using pandas in the right way. I'd recommend taking a quick look at a tutorial.
For starters, the following lines
df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))
can be much easier written as:
df['flag'] = 0
df['PVsum_2'] = 0
df['Vsum_2'] = 0
df['VWMA_2'] = 0
But it seems you don't even need to initialise those columns really.
You also don't need the for loop because you can align 2 dataframes, one being your original and another one is one where you've shifted all rows. For example:
df_shift = df.shift(-1)
You can then use normal vectorised calculations to achieve what you want, e.g.:
df['PVsum_2'] = df['PV'] + df_shift['PV']
df['Vsum_2'] = df['volume'] + df_shift['volume']
idx = df['Vsum_2'] != 0 # this is your check whether that value is different from 0
df.loc[idx, 'VWMA_2'] = df.loc[idx, 'PVsum_2'] / df.loc[idx, 'VSum_2'] # and now use that index to only calculate VWMA_2 where the Vsum_2 was 0
Hopefully you get the idea and can make small adjustments to make it work exactly as you want.

Pivot group by data

Trying to transpose and group data to look like this:
Current group by data:
MTD-Total Revenue YTD-Total Revenue MTD-Room Revenue YTD-Room Revenue MTD-Room Nights YTD-Room Nights MTD-ADR YTD-ADR MTD-OCC% YTD-OCC%
Market Group
Aff 0.0 0.0 2026136.99 21546922.96 857.0 8650.0 2457.02 2551.87 4.99 4.16
Air 0.0 0.0 2809312.53 32534587.15 925.0 9684.0 2392.08 3016.00 2.69 2.33
BAR 0.0 0.0 470866.23 8341596.95 131.0 2481.0 3189.75 3133.08 0.76 1.19
Cas 0.0 0.0 4801710.10 55466024.12 1652.0 18566.0 2365.23 2585.25 1.92 1.79
Com 0.0 0.0 3873151.63 43857524.55 1088.0 11980.0 2449.43 2632.57 6.34 5.76
Cor 0.0 0.0 7104841.79 88326080.23 2314.0 26836.0 1552.74 2919.07 4.14 3.97
Pro 0.0 0.0 335358.36 1907348.23 97.0 562.0 3457.30 3393.86 2.26 1.08
Soc 0.0 0.0 12706.96 82957.59 4.0 25.0 1588.37 3315.74 0.04 0.02
TA 0.0 0.0 1016565.12 15563472.77 416.0 6797.0 2412.55 2229.46 4.84 6.54
Wal 0.0 0.0 277267.66 3786378.41 68.0 812.0 4077.47 4663.03 1.58 1.56
Codes ran:
pd.DataFrame(df.values.reshape(-1,5))
df.reset_index().pivot('Market Group', 'MTD-Total Revenue', 'YTD-Total Revenue')
How data looks if it were to be transposed: df.T:
Answer to this would be :
df= pd.melt(df, id_vars=['Market Group'], value_vars=['MTD-Total Revenue','YTD-Total Revenue','MTD-Room Revenue','YTD-Room Revenue','MTD-Room Nights','YTD-Room Nights','MTD-ADR','YTD-ADR','MTD-OCC%','YTD-OCC%'])
This keeps the headers unlike using unstack or stack .

astype() does not change floats

Even though this seems really simple, it drives me nuts. Why is .astype(int) not changing the floats to ints? Thank you
df_new = pd.crosstab(df["date"], df["place"]).reset_index()
places = ['cityA', "cityB", "cityC"]
df_new[places] = df_new[places].fillna(0).astype(int)
sums = df_new.select_dtypes(pd.np.number).sum().rename('total')
df_new = df_new.append(sums)
print(df_new)
Output:
place date cityA cityB cityC
0 2008-01-01 0.0 0.0 51.0
1 2009-06-01 0.0 618.0 0.0
2 2015-07-01 549.0 0.0 0.0
3 2016-01-01 41.0 0.0 0.0
4 2016-04-01 62.0 0.0 0.0
5 2017-01-01 800.0 0.0 0.0
6 2018-07-01 69.0 0.0 0.0
total NaT 1521.0 618.0 51.0
If there are NAs (which are floats in Pandas), the other values will be floats as well. See here.

Categories